Reading: DeGroot and Schervish “Probability and Statistics”. (4rd Edition) ...
Psycology and Probability: How do people determine expectations? Reading: ...
41901: Probability and Statistics Nick Polson Fall, 2016
October 1, 2016
Getting Started
I Syllabus
http://faculty.chicagobooth.edu/nicholas.polson/teaching/41900/ I General Expectations
1. Read the notes / Practice 2. Be on schedule 3. DeGroot and Schervish: Probability and Statistics
Course Expectations
Course Expectations : Homework: 20% Assignments. Handed in at Class I encourage you to do assignments in groups. Otherwise it’s no fun! Grading is X−, X, X+. Final: 80% Week 11 Grading: PhD course
Table of Contents Chapter 1 : Probability, Risk and Utility Chapter 2 : Conditional Probability and Bayes Chapter 3 : Distributions Chapter 4 : Expectations Chapter 5 : Modern Regression Methods Chapter 6 : Bayesian Statistics Chapter 7 : Hierarchical Models Chapter 8 : Bayesian Asset Allocation Chapter 9 : Brownian Motion and Sports Chapter 10 : Stochastic Processes, Martingales, SV
Slide 5 Slide 6 Slide 79 Slide 96 Slide 135 Slide 275 Slide 240 Slide 309 Slide 274 Slide 308
Introduction 1. W1-W8 Probability Reading: DeGroot and Schervish Chap 1&2: Probability and Conditional Probability Chap 3: Random Variables Chap 4: Expectations Chap 5: Special Distributions Chap 6: Hierarchical Models 2. W9-W10 Martingales and Stochastic Processes Reading: DeGroot and Schervish “Probability and Statistics” (4rd Edition)
Outline 1. Probability (axioms, subjective, utility, laws, conditional, Bayes theorem, applications, DNA testing, cdf’s and pdf’s) 2. Distributions (discrete, continuous, binomial, poisson, gamma, weibull, beta, exponential family, wishart) Transformations and Expectations (distribution of functions, expectations, mgf’s, convergence, conditional and marginal, bivariate normal, inequalities and identities) 3. Modern Regression Methods (Ridge, Lasso) 4. Bayesian Methods (Hierarchical Models, Shrinkage, Asset Allocation) 5. Martingales and Stochastic Processes (Brownian motion, hitting times, stopping times, ruin problem, transformations, Girsanov’s theorem, stochastic calculus)
Probability: 41901 Week 1: Probability, Risk and Utility Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Overview Probability and Paradoxes
1. Birthday Problem 2. Exchange Paradox 3. Probability: Axioms and Subjectivity 4. Expected Utility: Preferences 5. Risk 6. Probability and Psychology 7. St. Petersburg Paradox 8. Allais Paradox 9. Kelly Criterion Reading: DeGroot and Schervish: Chapter 1,1-47.
Ellsberg Paradox Probability is counter-intuitive!!! I Two urns
1. 100 balls with 50 red and 50 blue. 2. A mix of red and blue but you don’t know the proportion. I Which urn would you like to bet on? I People don’t like the “uncertainty” about the distribution of
red/blue balls in the second urn.
Likelihood of Death 120 Stanford graduates: Heart disease cancer Other natural causes Total 58 accident homicide other unnatural causes Total 32
34 23 35 actual 92 5 1 2 actual 8
I The P’s don’t even sum up to one!
People vastly overestimate probability of violent death
Probability Probability is a language designed to communicate uncertainty. It’s immensely useful, and there’s only a few basic rules. 1. If an event A is certain to occur, it has probability 1, denoted P(A) = 1 2. Either an event A occurs or it does not. P(not A) = 1 − P(A) 3. If two events are mutually exclusive (both cannot occur simultaneously) then P(A or B) = P(A) + P(B) 4. P(A and B) = P(A given B)P(B) = P(A|B)P(B)
Envelope Paradox
The following problem is known as the “exchange paradox”. I A swami puts m dollars in one envelope and 2m in another. He
hands on envelope to you and one to your opponent. The amounts are placed randomly and so there is a probability of 21 that you get either envelope. You open your envelope and find x dollars. Let y be the amount in your opponent’s envelope.
Envelope Paradox = 12 x or y = 2x. You are thinking about whether you should switch your opened envelope for the unopened envelope of your friend. It is tempting to do an expected value calculation as follows
I You know that y
E(y) =
1 1 1 5 · x + · 2x = x > x 2 2 2 4
Therefore, it looks as if you should switch no matter what value of x you see. A consequence of this, following the logic of backwards induction, that even if you didn’t open your envelope that you would want to switch!
Bayes Rule I Where’s the flaw in this argument? Use Bayes rule to update the
probabilities of which envelope your opponent has! Assume p(m) of dollars to be placed in the envelope by the swami. I Such an assumption then allows us to calculate an odds ratio
p y = 12 x|x p (y = 2x|x) concerning the likelihood of which envelope your opponent has. I Then, the expected value is given by
1 1 E(y) = p y = x | x · x + p (y = 2x|x) · 2x 2 2
and the condition E(y) > x becomes a decision rule.
Birthday Problem Example “.. for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us” (Fisher, 1937)
1. Birthdays: 23 people for a 50/50 chance of a match. 2. Almost Birthdays: Only need 13 people for a Birthday match to within a day. 3. Multiple Matches: 88 people for a triple match
Solution Easier to calculate P(no match) I The first person has a particular birthday P(no match) = 1 ×
364 − N + 1 364 ×...× 365 365
Substituting N = 23 you get P(no match) = 12 . I General rule-of-thumb:
N people, c the number of categories
√ Then N = 1.2 c for a 50/50 chance. = 365 and N = 23 √ For near Birthdays c = 121 and N = 1.2 121 = 13.
I To apply this to Birthdays we use: c
Example I We can calculate N −1
P(no match) =
∏
1−
i=1
N −1
= exp
∑
i=1
i c
i log 1 − c
!
N2 ≈ exp − 2c
−1 1 where log 1 − ci ≈ − ci for i c and ∑N i=1 i = 2 N (N − 1 ).
Bayes Theroem Many problems in decision making can be solved using Bayes rule. I Rule-based decision making. Artifical Intelligence. I It’s counterintuitive! But gives the “right” answer.
Bayes Rule: P(A|B) =
P(B|A)P(A) P(A ∩ B) = P(B) P(B)
Law of Total Probability: ¯ )P(A ¯) P(B) = P(B|A)P(A) + P(B|A
Bayes Theorem Disease Testing example .... Let D = 1 indicate you have a disease Let T = 1 indicate that you test positive for it .95
D
T=1
D=1 .02
0
.05
T=0 .01 T=1 .98
D=0
1
0 .9702=.98*.99
.001
1 .0098
.019
T
.99
T=0
If you take the test and the result is positive, you are really interested in the question: Given that you tested positive, what is the chance you have the disease?
Bayes Theorem We have a probability table
T
0 1
D 0 0.9702 0.0098
1 0.001 0.019
Bayes Probability P (D = 1|T = 1) =
0.019 = 0.66 (0.019 + 0.0098)
Probability Subjective probability is very different from the (non-operational) approach based on long-run averages. 1. Kreps (1989) Notes on the Theory of Choice 2. Ramsey (1926) Truth and Probability 3. deFinetti (1930) Theory of Probability 4. Kolmogorov (1931) Probability 5. von Neumann and Morgenstern (1944) Theory of Economic Games and Behaviour 6. Savage (1956) Foundations of Statistics Principle of Coherence: A set of subjective probability beliefs must avoid sure loss
Subjective Probability, Preferences and Utility
I Axiomatic probability doesn’t help you find probabilities.
It describes how to find probabilities of more complicated events. I Subjective probability measures your probability as a propensity
to bet (willingness to pay). Utility effects are “assumed” to be linear (risk neutral) so you value everything in terms of expected values. I P(A) “measures’ how much you are willing to pay to enter a
transaction that pays $1 if A occurs and 0 otherwise.
Odds
We can express probabilities in terms of Odds via O(A) =
1 1 − P(A) or P(A) = P(A) 1 + O(A)
I For example if O(A)
= 1 then for ever $1 bet you will payout $1. An event with probability 12 .
I If O(A)
= 2 or 2 : 1, then for a $1 bet you’ll payback $3.
In terms of probability P = 31 .
US Presidential Election 2016 Oddschecker
Figure: Presidential Odds 2016
I The best odds on Hilary are 6/4.
This means that you have to risk $4 to win $6 offered by Ladbrokes I There’s competition between the Bookmakers. I The odds will dynamically change in time as new information
arrives. You can lock-in fixed odds at the time you bet
Probability and Psychology How do people form probabilities or expectations in reality? Psychologists have categorized many different biases that people have in their beliefs or judgments. Loss Aversion The most important finding of Kahneman and Tversky is that people are loss averse. Utilities are defined over gains and losses rather than over final (or terminal) wealth, an idea first proposed by Markowitz. This is a violation of the EU postulates. Let (x, y) denote a bet with gain x with probability y. To illustrate this subjects were asked In addition to whatever you own, you have been given $1000, now choose between the gambles A = (1000, 0.5) and B = (500, 1). B was the more popular choice.
Example The same subjects were then asked: In addition to whatever you own, you have been given $2000, now choose between the gambles C = (−1000, 0.5) and D = (−500, 1). I This time C was more popular. I The key here is that their final wealth positions are identical yet
people chose differently. The subjects are apparently focusing only on gains and losses. When they are not given any information about prior winnings, they choose B over A and C over D. Clearly for a risk averse people this is the rational choice. I This effect is known as loss aversion.
Representativeness
Representativeness When people try to determine the probability that evidence A was generated by model B, they often use the representative heuristic. This means that they evaluate the probability by the degree to which A reflects the essential characteristics of B. A common bias is base rate neglect or ignoring prior evidence. For example, in tossing a fair coin the sequence HHTHTHHTHH with seven heads is likely to appear and yet people draw conclusions from too few data points and think 7 heads is representative of the true process and conclude p = 0.7.
Expected Utility (EU) Theory Normative Let P, Q be two probability distributions or risky gambles/lotteries. pP + (1 − p)Q is the compound or mixture lottery. The rational agent (You) will have preferences between gambles.
Q if and only if You strictly prefer P to Q. If two lotteries are indifferent we write P ∼ Q.
I We write P
I EU – a number of plausible axioms – completeness, transitivity,
continuity and independence – then preferences are an expectation of a utility function. I The theory is a normative one and not necessarily descriptive. It
suggests how a rational agent should formulate beliefs and preferences and not how they actually behave. I Expected utility U (P) of a risky gamble is then
PQ
⇐⇒
U (P) ≥ U (Q)
Attitudes to Risk The solution depends on your risk preferences: I Risk neutral: a risk neutral person is indifferent about fair bets.
Linear Utility I Risk averse: a risk averse person prefers certainty over fair bets.
E(U (X)) < U (E(X)) . Concave utility I Risk loving: a risk loving person prefer fair bets over certainty.
Depends on your preferences.
St. Petersburg Paradox What are you willing to pay to enter the following game? I I toss a fair game and when the first head appears, on the Tth
toss, I pay you $2T dollars. I First, probability of first head on Tth toss is 2−T ∞
E(X ) =
∑ 2T 2 − T
T =1
= 2(1/2) + 4(1/4) + 8(1/8) + . . . = 1+1+1+... → ∞ I Bernoulli (1754) used this to introduce utility to value bets as
E(u(X)).
Allais Paradox You have to make a choice between the following gambles I First compare the “Gambles”
G1 : $ 5 million with certainty G2 : $ 25 million p = 0.10 $ 5 million p = 0.89 $ 0 million p = 0.01 I Now choose between the Gambles
G3 : $ 5 million p = 0.11 $ 0 million p = 0.89 G4 : $ 25 million p = 0.10 $ 0 million p = 0.90 Fact: If G1 ≥ G2 then G3 ≥ G4 and vice-versa.
Solution: Expected Utility Given (subjective) probabilities P = (p1 , p2 , p3 ). Write E(U |P) for expected utility. I Without loss of generality we can set u(0)
= 0 and for the high prize set u($25 million) = 1. Which leaves one free parameter u = u($5 million).
I Hence to compare gambles with probabilities P and Q we look
at the difference E(u|P) − E(u|Q) = (p2 − q2 )u + (p3 − q3 ) I For comparing
G1 and G2 we get E(u|G1 ) − E(u|G2 ) = 0.11u − 0.1 E(u|G3 ) − E(u|G4 ) = 0.11u − 0.1
The order is the same, given your u. I If your utility satisfies u
“riskier” gamble.
< 0.1/0.11 = 0.909 you take the
Utility Functions Power and log-utilities I Constant relative risk aversion (CRRA). I Advantage that the optimal rule is unaffected by wealth effects.
The CRRA utility of wealth takes the form Uγ (W ) = I The special case U (W )
W 1− γ − 1 1−γ
= log(W ) for γ = 1.
This leads to a myopic Kelly criterion rule.
Kelly Criterion The Kelly Criterion corresponds to betting under binary uncertainty. I Consider a sequence of i.i.d. bets where
p(Xt = 1) = p and p(Xt = −1) = q = 1 − p The optimal allocation is ω ? = p − q = 2p − 1. I Maximising the expected long-run growth rate leads to the
solution max E (ln(1 + ωWT )) = p ln(1 + ω ) + (1 − p) ln(1 − ω ) ω
≤ p ln p + q ln q + ln 2 and ω ? = p − q
Kelly Criterion If one believes the event is certain i.e. p = 1, then one bets all wealth and a priori one is certain to double invested wealth. If the bet is fair, i.e. p = 12 , one bets nothing, ω ? = 0, due to risk-aversion.
= (1 − p)/p the odds. We can generalize the rule to the case of asymmetric payouts (a, b) where
I Let p denote the probability of a gain and O
p(Xt = 1) = p and p(Xt = −1) = q = 1 − p I Then the expected utility function is
p ln(1 + bω ) + (1 − p) ln(1 − aω ) I The optimal solution is
ω? =
bp − aq p−q = ab σ
Kelly Criterion I If a
= b = 1 this reduces to the pure Kelly criterion.
= 1. We can now interpret b as the odds O that the market is willing to offer the invest if the event occurs and so we write b = O. The rule becomes
I A common case occurs when a
ω? =
p·O−q O
Example I Two possible market opportunities: one where it offers you 4/1
when you have personal odds of 3/1 and a second one when it offers you 12/1 while you think the odds are 9/1. In expected return these two scenarios are identical both offering a 33% gain. In terms of maximizing long-run growth, however, they are not identical.
Example I Table 1 shows the Kelly criteria advises an allocation that is
twice as much capital to the lower odds proposition: 1/16 weight versus 1/40. Market 4/1 12/1
You 3/1 9/1
p 1/4 1/10
ω? 1/16 1/40
Table: Kelly rule I The optimal allocation ω ?
= (pO − q)/O is
(1/4) × 4 − (3/4) 1 (1/10) × 12 − (9/10) 1 = and = 4 16 12 40
Assignment 1
Probability and Paradox 1. Envelope Paradox 2. Galton’s Paradox 3. Prosecutors’ Fallacy 4. St Petersburg Paradox 5. Kelly Criterion 6. OJ Simpson Case
Probability: 41901 Week 2: Conditional Probability and Bayes Rule Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Topics
1. Conditional Probability 2. Bayes Rule 3. Prisoner’s Dilemma/Monte Hall 4. Using Probability as Evidence 5. Island Problem and DNA Evidence 6. Combining Evidence 7. Prosecutors and base-rate fallacies. Reading: DeGroot and Schervish: Chapter 2, p.49-97.
Conditional Probability Bayes Rule
In its simplest form. I Two events
A and B . Bayes rule P(A|B) =
P(A ∩ B) P(B|A)P(A) = P(B) P(B)
I Law of Total Probability
¯ )P(A ¯) P(B) = P(B|A)P(A) + P(B|A Hence we can calculate the denominator of Bayes rule.
Prisoner’s Dilemma Example A, B , C . Each believe are equally likely to be set free.
I Three prisoners
I Prisoner
A goes to the warden W and asks if s/he is getting
axed.
A anything about him. He provides the new information: W B = “B is to be executed”
I The Warden can’t tell
Uniform Prior Probabilities leads to Prior P (Pardon)
A B C 0.33 0.33 0.33
I Posterior ? Compute P(A|W B)?
C overhears the conversation? Compute P(C|W B)?
I What happens if
Monte Hall Problem
Problem is named after the host of the long-running TV show, Let’s make a Deal.
I Marilyn vos Savant had a column with the correct answer that
many Mathematicians thought was wrong! I A contestant is given the choice of 3 doors.
There is a prize (a car, say) behind one of the doors and something worthless behind the other two doors: two goats.
Puzzle The game is as follows: I You pick a door. I Monty then opens one of the other two doors, revealing a goat. I You have the choice of switching doors.
Is it advantageous to switch? Conditional Probabilities Assume you pick door A at random. Then P(A) = (1/3). You need to figure out P(A|MB) after Monte reveals B is a goat.
Using Probability as Evidence Suppose that the event G is that the suspect is the criminal. I Compute the conditional probability of guilt
P(G|evidence) Evidence E : suspect and criminal possess the same trait γ (eg DNA match). I Bayes Theorem
P(G|evidence) =
P(evidence|G)P(G) P(evidence)
I In terms of odds
P(I|evidence) P(evidence|I) P(I) = P(G|evidence) P(evidence|G) P(G)
Posterior Odds Posterior Odds equals the likelihood ratio times the prior odds (given ’side’ information)
1. Prior Odds O(G) ? 2. In many problems this need to be carefully thought about. 3. “what if” analysis? 4. The likelihood ratio is common to all observes and updates everyone’s prior probabilities.
Prosecutor’s Fallacy The most common fallacy is confusing P(evidence|G) with P(G|evidence)
I Bayes rule yields
P(G|evidence) =
P(evidence|G)p(G) P(evidence)
I Your assessment of P(G) will matter.
Island Problem
Suppose there’s a criminal on a island of N + 1 people.
I Let I denote innocence and G guilt. I Evidence E: the suspect matches a trait with the criminal. I The probabilities are
p(E|I ) = p and p(E|G) = 1
Bayes factor Bayes factors are likelihood ratios I The Bayes factor is given by
p(E|I ) =p p(E|G) I If we start with a uniform prior distribution we have
p(I ) = I Priors will matter!
1 and odds(I ) = N N+1
Island Problem (contd) Posterior Probability related to Odds p(I |y) =
1 1 + odds(I |y)
I Prosecutors’ fallacy
The posterior probability p(I |y) 6= p(y|I ) = p. I Suppose that N
= 103 and p = 10−3 . Then p(I |y) =
1 1 = 3 − 3 2 1 + 10 · 10
The odds on innocence are odds(I |y) = 1. There’s a 50/50 chance that the criminal has been found.
Sally Clark Case: Independence or Bayes?
Sally Clark was accused and convicted of killing her two children They could have both died of SIDS. I The chance of a family which are non-smokers and over 25
having a SIDS death is around 1 in 8, 500. I The chance of a family which has already had a SIDS death
having a second is around 1 in 100. I The chance of a mother killing her two children is around 1 in
1, 000, 000.
Bayes or Independence 1. Under Bayes P (both SIDS) = P (first SIDS) P (Second SIDS|first SIDS) 1 1 1 · = = 8500 100 850, 000 The
1 100
comes from taking into account genetics.
2. Independence, as the court did, gets you P (both SIDS) = (1/8500)(1/8500) = (1/73, 000, 000) 3. By Bayes rule p(I |E) P(E ∩ I ) = p(G|E) P(G ∩ I ) P(E ∩ I ) = P(E|I )P(I ) needs discussion of p(I ).
Comparison I Hence putting these two together gives the odds of guilt as
1/850, 000 p(I |E) = = 1.15 p(G|E) 1/1, 000, 000 In terms of posterior probabilities p(G|E) =
1 = 0.465 1 + O(G|E)
I If you use independence
p(I |E) 1 = and p(G|E) ≈ 0.99 p(G|E) 73 The suspect looks guilty.
OJ Simpson case Prosecutor’s Fallacy: P(G|T ) 6= P(T |G). I Dr. Robin Cotton testified that the chance of the blood being
someone else’s other than Simpson’s was one-in-17 million or more I Suppose that you are a juror in a murder case of a husband who
is accused of killing his wife. The husband is known is have battered her in the past. Consider the three events: G “husband murders wife in a given year”, M “wife is murdered in a given year”, B “husband is known to batter his wife”. I You now have the following probability facts. Only one tenth of
one percent of husbands who batter their wife actually murder them. Conditional on eventually murdering their wife, there a one in ten chance it happens in a given year.
OJ Simpson (contd) I In a given year, there are about 5 murders per 100, 000 of
population in the United States. How do you use the prior information? p(M|G, B) p(G|B) p(G|M, B) = p(I |M, B) p(M|I, B) p(I |B) By assumption, we have p(M|G, B) = 1, p(M|I, B) =
1 p(G|B) 1 and = 20, 000 p(I |B) 10, 000
I Putting it all together
p(G|M, B) 2 = 2 and p(G|M, B) = p(I |M, B) 3 More than a 50/50 chance that your spouse murdered you!
Fallacy p(G|B) 6= p(G|B, M) Alan Dershowitz stated to the press that fewer than 1 in 2000 of batterers go on to murder their wives (in any given year). ¯ B) I We estimate p(M|G,
¯) = = p(M|G
1 20,000 .
I The Bayes factor is then
p(G|M, B) 1/2000 ¯ |M, B) = 1/20, 000 = 10 p(G which implies posterior probabilities ¯ |M, B) = p(G
1 10 and p(G|M, B) = 1 + 10 11
Hence its over 90% chance that O.J. is guilty based on this information! Dershowitz intended this information to exonerate O.J.
Proving Innocence DNA evidence is very good at proving innocence: O(G|E) =
p(y|I ) O(G) p(y|G)
I A small Bayes factor p(y|I )/p(y|G) implies O(G|E) is also small. I Hence, we see that no match implies innocence. I Match probability is the probability that unrelated individual
chosen randomly from a reference population will match the profile.
More than One Criminal Let E two criminals and suspect matches C. I Rare blood type (p
= 0.1) and Common blood type C (p = 0.6)
I Evidence for the prosecution? I Bayes: P(C|I )
= 0.6 and P(C|G) = 0.5 BF =
P(C|G) 0.5 = = 0.93 < 1! P(C|I ) 0.6
I Evidence for the Defense. Posterior Odds on Guilt are
decreased.
The GateCrasher
Example I 1000 people where after collecting receipts (no tickets) you
realize only 499 paided. I On the basis of the base rate evidence - the only potentially
relevant evidence – there’s a 0.501 probability that a random person is guilty. I Does this civil case deserve to succeed on the balance of
probabilities?
Base Rate Fallacies
“Witness” 80 % certain saw a “checker” C taxi in the accident. I What’s your P(C|E) ? I Need P(C). Say P(C)
= 0.2 and P(E) = 0.8.
I Then your posterior is
P(C|E) =
0.8 · 0.2 = 0.5 0.8 · 0.2 + 0.2 · 0.8
Therefore O(C) = 1 a 50/50 bet.
Updating Fallacies
When you have a small sample size, Bayes rule still updates probabilities I Two players: either 70 % A or 30 % A I Observe A beats B 3 times out of 4. I What’s P(A
= 70% player) ?
I Most people don’t update quickly enough in light of new data
Edwards (1968).
Sequential Learning: Bellman Bellman’s principle of optimality An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (Bellman, 1956)
In Math!! For a finite horizon in discrete time with no discounting is given by V (x, t) = max (u(x, d, t) + V (p(x, d, t), t + 1)) for t < T d
Value function
Secretary Problem Also called the matching or marriage problem I You will see items (spouses) from a distribution of types F(x). I You clearly would like to pick the maximum.
You see these chronologically. After you decide no, you can’t go back and select it. I Strategy: wait for the length of time
1 1 = = 0.3678 e 2.718281828 Select after you observe an item greater than the current best.
Strategy
What’s your best strategy? I Turns out its insensitive to the choice of distribution. I Although there is the random sample i.i.d. assumption lurking. I You’ll not doubt get married between 18 and 60.
Waiting
1 e
along this sequence gets you to the age 32!
I Then, pick the next best person!
Q-Learning and Deal-No Deal
Rule of Thumb: Continue as long as there are two large prizes left. I Continuation value is large.
For example, with three prizes and two large ones. Risk averse people will naively choose deal, when if they incorporated the continuation value they would choose no deal. I Data: Suzanne and Frank’s choices on the Dutch version of the
show
Susanne’s Choices Prize e0.01 e0.20 e0.50 e1 e5 e10 e20 e50 e100 e200 e300 e400 e500 e1,000 e2,500 e5,000 e7,500 e10,000 e12,500 e15,000 e20,000 e25,000 e50,000 e100,000 e150,000 e250,000 Average e Offer e Offer % Decision
1 × × ×
2 × × ×
3 × × ×
4 ×
5
6
7
8
9
×
×
×
×
× × ×
× × ×
×
×
× ×
×
×
× × ×
× ×
× ×
× ×
× ×
× ×
×
×
× × × × × × × × × 32,094 3,400 11% No Deal
× ×
×
× ×
×
×
×
× ×
× ×
× ×
× ×
× ×
× ×
× ×
× ×
21,431 4,350 20% No Deal
26,491 10,000 38% No Deal
34,825 15,600 45% No Deal
46,417 25,000 54% No Deal
50,700 31,400 62% No Deal
62,750 46,000 73% No Deal
83,667 75,300 90% No Deal
125,00 125,00 100% No Dea
Table: Frank’s Choices Prize e0.01 e0.20 e0.50 e1 e5 e10 e20 e50 e100 e500 e1,000 e2,500 e5,000 e7,500 e10,000 e25,000 e50,000 e75,000 e100,000 e200,000 e300,000 e400,000 e500,000 e1,000,000 e2,500,000 e5,000,000 Average e Offer e Offer % Decision
1 × × × ×
2 × × × ×
3
4
5
6
7
× ×
× ×
× ×
×
×
× ×
× ×
× ×
× ×
× ×
× ×
× × ×
× ×
×
× × × × × ×
×
×
×
×
× × × ×
×
×
×
×
×
×
64,502 8,000 12% No Deal
85,230 23,000 27% No Deal
95,004 44,000 46% No Deal
85,005 52,000 61% No Deal
102,006 75,000 74% No Deal
× × × × × × × × × × × 383,427 17,000 4% No Deal
8
9
× ×
× ×
×
×
×
×
2,508 2,400 96% No Deal
3,343 3,500 105% No Deal
×
5,005 6,000 120% No De
Q-Learning Rationalising Waiting
There’s a matrix of Q-values that solves the problem. I Let s denote the current state of the system and a an action. I The Q-value, Qt (s, a), is the value of using action a today and
then proceeding optimally in the future. We use a = 1 to mean no deal and a = 0 means deal. I The Bellman equation for Q-values becomes
Qt (s, a) = u(s, a) + ∑ P(s? |s, a) max Qt+1 (s? , a) s?
a
Value Function The value function and optimal action are given by V (s) = max Q(s, a) and a? = argmaxQ(s, a) a
I Transition Matrix.
Consider the problem where you have three prizes left. Now s is the current state of three prizes. s? = {all sets of two prizes} and P(s? |s, a = 1) = where the transition matrix is uniform to the next state. I There’s no continuation for P(s? |s, a
= 0).
1 3
Utility
I Utility.
The utility of the next state depends on the contestants value for money and the bidding function of the banker u(B(s? )) = in power utility case.
B (s? )1− γ − 1 1−γ
Banker’s Function
Expected value implies B(s) = s¯ where s are the remaining prizes. I The website uses the following criteria: with three prizes left:
B(s) = 0.305 × big + 0.5 × small and with two prizes left B(s) = 0.355 × big + 0.5 × small
Example
Three prizes left: s = {750, 500, 25}. I Assume the contestant is risk averse with log-utility U (x) I Banker offers the expected value we get
u(B(s = {750, 500, 25})) = ln(1275/3) = 6.052 and so Qt (s, a = 0) = 6.052. I In the continuation problem, s?
s1?
= {750, 500} and
s2?
= {s1? , s2? , s3? } where = {750, 25} and s3? = {500, 25}.
= ln x.
Q-values We’ll have offers 625, 387.5, 137.5 under the expected value. I As the banker offers expected value the optimal action at time
t + 1 is to take the deal a = 0 with Q-values given by Qt (s, a = 1) =
Qt + 1 ( s ? , a ) ∑? P(s? |s, a = 1) max a s
1 = (ln(625) + ln(387.5) + ln(262.5)) = 5.989 3 as immediate utility u(s, a) = 0. I Hence as
Qt (s, a = 1) = 5.989 < 6.052 = Qt (s, a = 0) the optimal action is a? = 0, deal.
Utilities Continuation value Not large enough to overcome the generous (expected value) offered by the banker. I Sensitivity analysis: Different Banker’s bidding function:
If we use the function from the website (2 prizes): B(s) = 0.355 × big + 0.5 × small Hence B(s1? = {750, 500}) = 516.25 B(s2? = {750, 25}) = 278.75 B(s3? = {500, 25}) = 190
Optimal Action I The optimal action with two prizes left for the contestant is
1 (ln(750) + ln(500)) = 6.415 2 > 6.246 = Qt+1 (s1? , a = 0) = ln (516.25) 1 Qt+1 (s1? , a = 1) = (ln(750) + ln(25)) = 4.9194 2 < 5.63 = Qt+1 (s1? , a = 0) = ln (278.75) 1 Qt+1 (s1? , a = 1) = (ln(500) + ln(25)) = 4.716 2 < 5.247 = Qt+1 (s1? , a = 0) = (516.25)
Qt+1 (s1? , a = 1) =
Hence future optimal policy with be no deal under s1? , and deal under s2? , s3? .
Solve for Q-values
I Therefore solving for Q-values at the previous step gives
Qt (s, a = 1) =
Qt+1 (s? , a) ∑? P(s? |s, a = 1) max a s
1 = (6.415 + 5.63 + 5.247) = 5.764 3 with a monetary equivalent as exp(5.764) = 318.62.
Premium With three prizes we have Qt (s, a = 0) = u(B(s = {750, 500, 25}))
= ln (0.305 × 750 + 0.5 × 25) = ln(241.25) = 5.48. The contestant is offered $ 241. I Now we have Qt (s, a
= 1) = 5.7079 > 5.48 = Qt (s, a = 0) and the optimal action is a? = 1, no deal. The continuation value is large.
I The premium is $ 241 compared to $319, a 33% premium.
Overview
Bayes accounts for probability in many scenarios I Prisoner’s Dilemma I Monte Hall. I Island Problem I Sally Clark I O.J. Simpson I Bellman and Sequential Learning
Probability: 41901 Week 3: Distributions Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Topics
1. Markov Dependence 2. SP500 Hidden Markov Model 3. Random Variables 4. Transformations 5. Expectations 6. Moment Generating Functions Reading: DeGroot and Schervish: 3, p.97-181 and 4, 181-247
Markov Dependence I We can always factor a joint distribution as
p ( Xn , Xn − 1 , . . . , X1 ) = p ( Xn | Xn − 1 , . . . , X1 ) . . . p ( X2 | X1 ) p ( X1 )
Example I A process has the Markov Property if
p ( Xn | Xn − 1 , . . . , X1 ) = p ( Xn | Xn − 1 ) I Only the current history matter when determining the
probabilities.
SP500 daily ups and downs
I
I
Daily return data from 1948 to 2007 for the SP500 index of stocks Can we calculate the probability of ups and downs?
A real world probability model Hidden Markov Models
Are stock returns a random walk? Hidden Markov Models (Baum-Welch, Viterbi) I Daily returns on the SP500 stock market index.
Build a hidden Markov model to predict the ups and downs. I Suppose that stock market returns on the next four days are
X1 , . . . , X4 . I Let’s empirical determine conditionals and marginals
SP500 Data Marginal and Bivariate Distributions I Empirically, what do we get? Daily returns from 1948 − 2007.
x P ( Xi ) = x
Down 0.474
Up 0.526
I Finding p(X2 |X1 ) is twice as much computational effort:
counting UU, UD, DU, DD transitions. Xi Xi−1 = Down Xi−1 = Up
Down 0.519 0.433
Up 0.481 0.567
Conditioned on two days I Let’s do p(X3 |X2 , X1 )
Xi − 2 Down Down Up Up
Xi − 1 Down Up Down Up
Down 0.501 0.412 0.539 0.449
Up 0.499 0.588 0.461 0.551
I We could do the distribution p(X2 , X3 |X1 ). This is a joint,
marginal and conditional distribution all at the same time. Joint because more than one variable (X2 , X3 ), marginal because it ignores X4 and conditional because its given X1 .
Joint Probabilities
I Under Markov dependence
P(UUD) = p(X1 = U )p(X2 = U |X1 = U )p(X3 |X2 = U, X1 = U )
= (0.526)(0.567)(0.433) I Under independence we would have got
P(UUD) = P(X1 = U )p(X2 = U )p(X3 = D)
= (.526)(.526)(.474) = 0.131
Special Distributions See Common Distributions 1. Bernoulli and Binomial 2. Hypergeometric 3. Poisson 4. Negative Binomial 5. Normal Distribution 6. Gamma Distribution 7. Beta Distribution 8. Multinomial Distribution 9. Bivariate Normal Distribution 10. Wishart Distribution ...
Continuous Random Variables 1. We will be interesting in random variables X : Ω → < 2. The cumulative distribution function of X is defined as FX (x) = P (X ≤ x) 1 − FX (x) = P (X > x) 3. The probability density function of X is defined as 0 fX (x) = FX (x) and P (X ≤ x) =
Z x −∞
fX (x)dx
4. We have the following properties: fX (x) ≥ 0 and FX (x) is non-decreasing 5. limx→−∞ FX (x) = 0 and limx→∞ FX (x) = 1.
Intuition: cdf I
For a discrete random variable FX (x) is a step function with the heights of the steps being P ( X = xi )
I
For a continuous variable it’s a continuous non-decreasing curve.
Intuition: pdf I The intuitive interpretation of a pdf is that for small ∆x
P (x < X ≤ x + ∆x) =
Z x+∆x x
fX (x)dx ≈ fX (x)∆x
X lies in a small interval with probability proportional to fX (x).
I We will also consider the multivariate generalization of this.
Transformations The cdf identity gives FY (y) = P(Y ≤ y) = P(g(X) ≤ y)
I Hence if the function g(·) is monotone we can invert to get
FY (y) =
Z g(x)≤y
fX (x)dx
= P(X ≤ g−1 (y)) = FX (g−1 (y)) If g is decreasing FY (y) = P(X ≥ g−1 (y)) = 1 − FX (g−1 (y))
I If g is increasing FY (y)
Exponential
Example I
I
I
Suppose X ∼ U (0, 1) and Y = g(X) = − log X ∼ Exp(1). Suppose X ∼ Γ(α, β) then what is the pdf for Y = 1/X ? Inverse Gamma distribution Stochastic variances in regression models.
Transformation Identity
1. Theorem 1: Let X have pdf fX (x) and let Y = g(X). Then if g is a monotone function we have d −1 −1 fY (y) = fX (g (y)) g (y) dy There’s also a multivariate version of this that we’ll see later. I Suppose X is a continuous rv, what’s the pdf for Y I Let X
∼ N (0, 1), what’s the pdf for Y = X2 ?
= X2 ?
Probability Integral Transform
Theorem Suppose that U ∼ U [0, 1], then for any continuous distribution function F, the random variable X = F−1 (U ) has distribution function F.
∈ [0, 1], P (U ≤ u) = u, so we have P (X ≤ x) = P F−1 (U ) ≤ x = P (U ≤ F(x)) = F(x)
I Remember that for u
−1 Hence, X = FX (U ).
Inequalities and Identities 1. Markov P (g(X ) ≥ c) ≤
E(g(X)) where g(X) ≥ 0 c
2. Chebyshev P (|X − µ| ≥ c) ≤ 3. Jensen
Var(X) c2
E (φ(X)) ≤ φ (E(X))
4. Cauchy-Schwarz corr(X, Y) ≤ 1 Chebyshev follows from Markov. Mike Steele and Cauchy-Schwarz.
Probability: 41901 Week 4: Expectations Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Expectation and Variance
I The expectation of X is given by
E (X ) =
Z ∞ −∞
xfX (x)dx
I The variance of X is given by
Var(X) = E (X − µ)2 = E X2 − µ2 I Also derivatives of moment generating functions (mgfs).
Conditional Expectation Let X denote our random variable and Y a conditioning variable. I In the continuous case
E (g(X)) =
Z ∞ −∞
g(x)fX (x)dx
I The conditional expectation (given Y) is determined by
E (g(X)|Y) =
Z ∞ −∞
g(x)fX (x|Y)dx
where fX|Y (x|y) denotes the conditional density of X|Y.
Iterated Expectation
I The following fact is known as the law of iterated expectation
EY EX | Y ( X | Y ) = E ( X ) This is a very useful way of calculating E (X) (martingales) I Sometimes called the Tower property of expectations. I A similar decomposition for a variance
Var(X) = E (Var(X|Y)) + Var (E(X|Y)) .
Exponential Distribution Example I One of the most important distributions is the exponential
fX (x) = λe−λx and FX (x) =
Z x 0
λe−λx dx = 1 − e−λx
λ is called a parameter. I The expectation of X is
E (X ) =
Z ∞ 0
xλe
−λx
dx =
Z ∞ 0
1 xd −e−λx = λ
I The variance of X is
1 Var(X) = E X2 − µ2 = 2 λ
Simulation
Suppose that X ∼ FX (x) and let Y = g(X). How do we find FY (y) and fY (y) ? I von Neumann
Given a uniform U, how do we find X = g(U )?
→ (U, V ). We need to find f(U,V ) (u, v) from fX,Y (x, y)
I In the bivariate case (X, Y)
I Applications: Simulation, MCMC and PF.
Transformations An important application is how to transform multiple random variables? Here’s the basic set-up: I Suppose that we have random variables:
(X, Y) ∼ fX,Y (x, y) A transformation of interest given by: U = g(X, Y) and V = h(X, Y) I The problem is how to compute fU,V (u, v) ? Jacobian
∂(x, y) = J= ∂(u, v)
∂x ∂u ∂y ∂u
∂x ∂v ∂y ∂v
Bivariate Change of Variable I Theorem: (change of variable)
∂(x, y) fU,V (u, v) = fX,Y (h1 (u, v), h2 (u, v)) ∂(u, v) The last term is the Jacobian. This can be calculated in two ways. ∂(x, y) = 1/ ∂(u, v) ∂(x, y) ∂(u, v) I So we don’t always need the inverse transformation
(x, y) = (g−1 (u, v), h−1 (u, v))
Example: Exponentials Example I Suppose that X and Y are independent, identically distributed
random variables each with an Exp(λ) distribution. Let U = X + Y and V =
X X+Y
I The joint probability distribution
fX,Y (x, y) = λ2 e−λ(x+y) for 0 < x, y < ∞
= uv, y = u(1 − v). The range of definition 0 < u < ∞, 0 < v < 1.
I The inverse transformation is x
Jacobian
I We can calculate the Jacobian as
J=
∂x ∂u ∂y ∂u
∂x ∂v ∂y ∂v
v = 1−v
u −u
I Thus the joint density of (U, V ) is
fU,V (u, v) = λ2 ue−λu I Therefore
U ∼ Γ (2, λ) and V ∼ U (0, 1)
Example: Normals Suppose that X, Y are i.i.d. each with a standard normal N (0, 1) distribution. I Let D = X2 + Y2 and Θ = tan−1 Y . X I The joint distribution y2 x2 1 1 fX,Y (x, y) = √ e− 2 √ e− 2 for − ∞ < x, y < ∞ 2π 2π
I We can calculate the Jacobian as
J=
∂d ∂x ∂θ ∂x
∂d ∂y ∂θ ∂y
2x = − y x2 + y2
2y x x2 + y2
Example: Normal Simulation I Thus the joint density of (D, Θ) is
fD,Θ (d, θ ) =
1 −d e 2 where 0 ≤ d < ∞, 0 ≤ θ ≤ 2π 4π
I Therefore, we have
1 D ∼ Exp and Θ ∼ U (0, 2π ) 2 I Box-Muller transform for simulating normal random variables:
draw D = −2 log U1 and Θ = 2πU2 √ p X = Dcos(Θ) = −2 log U1 cos(2πU2 ) √ p Y = Dsin(Θ) = −2 log U1 sin(2πU2 )
Simulating a Cauchy A Cauchy is a ratio of normals: C = X/Y. I Let X, Y
∼ N (0, 1). fX,Y (x, y) =
I Transformation U
1 − 1 ( x2 + y2 ) e 2 − ∞ < x, y < ∞ 2π
= X/Y and V = |Y|. Inverse x = uv, y = v
I Jacobian
v 0
u =v 1
I Joint and marginal
fU,V (u, v) = fU (u) =
Z v
1 − 1 v2 ( 1 + u2 ) ve 2 0 < v < ∞, −∞ < u < ∞ . 2π
fU,V (u, v)dv = 2
A Cauchy distribution.
Z ∞ 1 0
2π
1 2 ( 1 + u2 )
ve− 2 v
dv =
1 1 π 1 + u2
Discrete Distributions Consider a countably infinite set of possible outcomes Ω = {ω1 , ω2 , . . .} with associated probabilities pi . We must have ∞
pi ≥ 0 and
∑ pi = 1
i=1
I The Bernoulli distribution
Sample space consists of two outcomes Ω = {H, T } and p = P(H ) and q = 1 − p = P(T ). I Other distributions: Geometric, Poisson and Negative Binomial
Binomial Distribution
This models the number of heads occurring in n successive trails of the Bernoulli coin. I Thus Ω
= {0, 1, . . . , n} and n fX (x) = px (1 − p)n−x 0 ≤ x ≤ n x
I Generalizations?
Random probability and hierarchical exchangeable models.
Sums of Binomial Distributions
Distribution of sums I The probability generating function is
E zX =
n
∑ zr
r=0
I Therefore X + Y I In general Xi
n x
pr qn−r = (pz + q)n
∼ Bin(n + m, p)
∼ Bin(ni , p) then ∑ki=1 Xi ∼ Bin ∑ki=1 ni , p
Sums of Poisson Distributions X + Y where X ∼ Poi(λ) and Y ∼ Poi(µ)? I The probability generating function
E zX =
∞
λr
∑ zr e−λ r!
= e− λ (1−z)
r=0
Let z = et to get the moment generating function. I Hence for the sum X + Y we have
e−λ(1−z) e−µ(1−z) = e−(λ+µ)(1−z) Hence X + Y ∼ Poi(λ + µ).
Sums of Random Variables
I For any random variables, the expectation of the sum is the sum
of expectations E ( X1 + . . . + Xn ) = E ( X1 ) + . . . + E ( Xn ) I If X1 , . . . , Xn are independent and identically distributed (i.i.d)
Var
X1 + . . . + Xn n
=
1 Var(X1 ) n
Cauchy Distribution I The Cauchy distribution has a pdf given by
fX ( x ) =
1 1 π 1 + x2
I Special case of a tν -distribution which has density proportional
to
− 1 (1+ ν ) 2 fX (x|ν) = Cν 1 + x2
I p
= 1 then T is a Cauchy
I p
→ ∞ then T →d N (0, 1)
I E(T )
= 0, Var(T ) = p/(p − 2) for p > 2.
Sums of Cauchy’s Example I If X1 , . . . , Xn
∼ C(0, 1) then 1 (X + . . . + Xn ) ∼ C(0, 1) n 1
I Averaging a set of Cauchy’s leads back to the same Cauchy !
The normal distribution behaves differently: if Xi ∼ N µ, σ2 1 σ2 (X + . . . + Xn ) ∼ N µ, n 1 n ¯ → µ a constant (CLT) Eventually X
Moment Generating Functions
I Definition. The moment generating function (mgf) is defined by
mX (t) = E etX that is
mX ( t ) =
Z ∞ 0
etX fX (x)dx
I Here θ is a parameter.
You’ll need to check when the function mX (θ ) exists.
Moment Identity
I There is an equivalence between knowing the mgf and the set of
moments of the distribution given by the following (n)
Theorem: If X has mgf mX (θ ) then E(Xn ) = mX (0) (n)
Here mX (0) denotes the nth derivative evaluated at zero. I What about a C(0, 1) random variable ?
Characteristic Functions Characteristic Function: φX (t). I This function exits for all distributions and all values of t.
It is defined by φX (t) = E eitX I Here eitX
= cos(tX) + i sin(tX) with i2 = −1.
I Technically better to do everything in terms of characteristic
functions as they always exist. |φX (t)| ≤ E |eitX | = 1
φC (t)? Cauchy distribution φC (t) = e−|t| . I mgf only exists at the origin I We can calculate
φX (t) =
1 π
=e
Z ∞
−∞ −|t|
eitx dx 1 + x2
I Cauchy’s residue theorem calculates the complex integral
Properties
Here’s a list of special properties
I Theorem: If MX (t)
= MY (t) then FX (x) = FY (y). n I mgf of an average MX¯ (t) = MX t n I Example: Normal and Cauchy distributions. I MaX+b (t)
= ebt MX (at)
I Probability generating function pX (z)
= E zX with z = eθ .
mgf for normals
I The mgf of X
∼ N µ, σ2 is given by 1 mX (θ ) = exp µθ + σ2 θ 2 2
I Hence the sums of two normals is normal as the mgf is
1 1 exp µX θ + σ2 θ 2 exp µY θ + τ 2 θ 2 2 2 which gives X + Y ∼ N µX + µY , σ2 + τ 2
Sums of mgfs I If X1 , . . . , Xn are independent random variables where mgfs
mX i ( θ )
n
mX1 +...Xn (θ ) =
∏ mXi (θ ) i=1
I This follows from the identity
E eθ (X1 +...Xn ) =
n
∏E
eθXi
i=1
I If there are identically distributed
mX1 +...Xn (θ ) = mX (θ )n mgf=Laplace transform, characteristic function=Fourier.
Properties of mgfs Theorem
The mgf mX (θ ) = E eθX uniquely determines the distribution of X provided it is defined for some open interval of θ values. m(r) ( 0 ) = E ( X r )
I Key identity is
d mX ( θ ) = E dθ
d θX e dθ
= E XeθX
¯ Distribution of X I Let (X1 , . . . , Xn ) be an iid sample from any distribution, then
mX¯ (θ ) = mX
n θ n
I Example: Suppose the X0 s are N (µ, σ2 ). Then i
1 σ2 2 mX¯ (θ ) = exp µθ + θ 2 n
¯ is normal The distribution of X ¯ I Example: Suppose the X0 s are Γ (α, β ) then X i
∼ Γ(nα, β/n).
Convergence Concepts I Convergence in Probability: A sequence of random variables
X1 , X2 , . . . converges in probability to a random variable X if, for every e > 0, lim P (|Xn − X| ≥ e) = 0 n→ ∞
or equivalently limn→∞ P (|Xn − X| ≤ e) = 1. I Weak Law of Large Numbers: Let X1 , X2 , . . . be a random sample
(iid) with E(Xi ) = µ and Var(Xi ) = σ2 < ∞. Then, for every ¯ n = 1 ∑ Xi satisfies e > 0, X n ¯ n − µ| ≥ e) = 0 lim P (|X
n→ ∞
Almost Sure Convergence I Almost surely: A sequence of random variables X1 , X2 , . . .
converges almost surely to a random variable X if, for every e > 0, P lim |Xn − X| ≥ e = 0 n→ ∞
I Strong Law of Large Numbers: Let X1 , X2 , . . . be a random sample
(iid) with E(Xi ) = µ and Var(Xi ) = σ2 < ∞. Then, for every e > 0, ¯ n − µ| ≥ e = 0 P lim |X n→ ∞
¯ n →a.s. µ. ∑ Xi . Hence X Don’t need finite variance (see martingale convergence proof) ¯n = where X
1 n
Central Limit Theorem I Let X1 , X2 , . . . be a random sample (iid) with E(Xi )
= µ and Var(Xi ) = σ2 < ∞. Suppose that the mgf exists in a neighborhood of zero ie mX (θ ) exists for some θ > 0.
Theorem The sample mean converges in distribution to a normal random variable Z x y2 X1 + . . . + Xn − nµ 1 √ √ lim P | ≤ x| = e− 2 dy = Φ(x) n→ ∞ σ n 2π −∞ which is the distribution function of a standard N (0, 1). I One can expand the mgf m(θ ) and consider the limit as n √
I
¯ n −µ) n(X σ
→D Z
→ ∞.
Sketch Proof Key facts: Taylor Series expansion of mgf I
M√
¯ n −µ)/σ (t) n(X
I
MY
t √
n n
I Therefore
lim
n→ ∞
1 = 1+ 2
MY
=
t √
MY
n n
t √
t √
n n
2 n
+ RY
= exp(t2 /2)
t √
n
Order Statistics Order Statistics I The order statistics of a random sample X1 , . . . , Xn are the
sample values in ascending order, denoted by X(1) , . . . , X(n) or equivalently X(1) = min Xi = Xmin and X(n) = max Xi = Xmax 1≤i≤n
1≤i≤n
I The median Xmed is also widely studied as it is a consistent
estimator for the location of a distribution for a wide family and is less sensitive to extreme observations (breakdown point).
Proof I Consider the cases X(1) and X(n) .
max Xi ≤ x
FX(n) (x) = P
1≤i≤n n
= (FX (x))
I Differentiating with respect to x gives
fX(n) (x) = nfX (x) (FX (x))n−1 I For the minimum, let’s compute
1 − FX(1) (x) = P (Xmin ≥ x)
= (1 − FX (x))n Differentiating with respect to x gives fX(n) (x) = nfX (x) (1 − FX (x))n−1
General Case
I Theorem:. Let X(1) , . . . , X(n) denote the order statistics from a
population with cdf FX (x) and pdf fX (x). Then the pdf of X(j) is fX(j) (x) =
n f (x) (FX (x))j−1 (1 − FX (x))n−j (j − 1)(n − j) X
I For the general case we get n
FX(j) (x) =
∑
k =j
n k
(FX (x))k (1 − FX (x))n−k
Exchangeability: Bayes Billard Balls
I You roll a Billiard ball on a table and record its distance
θ ∈ [0, 1]. I You roll n balls, what’s the distribution of balls Y
= y1 + . . . + yn
to the left of θ? n θ k (1 − θ )n−k where θ ∼ p(θ ) k Z 1 n P (Y ≤ k ) = θ k (1 − θ )n−k p(θ )dθ k 0
P (Y ≤ k | θ ) =
If p(θ ) ∼ U (0, 1) then Y is uniform with probability 1/(n + 1).
Independence versus Exchangeability I de Finetti’s theorem: a sequence of 0-1 exchangeable random
variables is a mixture distribution for some prior distribution f (p) p ( y1 + . . . + yn = k ) =
Z 1 0
k
k
p∑i=1 yi (1 − p)n−∑i=1 yi f (p)dp
I Independence yields
p(y1 = 0, y2 = 0, . . . , yn = 0) = pn I Independence underestimates the odds of a tail event (Black
Swan).
Moments Matter I Prior moments matter. Consider, the Binomial-Beta modeling of
a sequence of successes and failures. Then the marginal probability of a failure is the prior mean p(y1 = 0) = E(p) I If I observe a failure, the conditional probability of the next
failure is a ratio of moments p(y2 = 0|y1 = 0) =
E(p2 ) E(p)
I Independence is the same as the prior mean, E(p).
Probability: 41901 Week 5: Modern Regression Methods Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Topics
I I I I I I
Keynes vs Buffett: Regression Superbowl Spread Data Logistic Regression Least Squares Ridge Regression Lasso Regression
Keynes versus Buffett CAPM keynes = 15.08 + 1.83 market buffett = 18.06 + 0.486 market
Keynes and Rate returns
King’s College Cambridge
3 2
Market 7.9 6.6 -20.3 -25.0 -5.8 21.5 -0.7 5.3 10.2 -0.5 -16.1 -7.2 -12.9 12.5 0.8 15.6 5.4 0.8
1
Keynes -3.4 0.8 -32.4 -24.6 44.8 35.1 33.1 44.3 56.0 8.5 -40.1 12.9 -15.6 33.5 -0.9 53.9 14.5 14.6
4
Keynes Rate
Year 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
5
10 Period
Keynes vs Cash
15
Example: Super Bowl Spread Data Favorite 1 GreenBay 2 GreenBay 3 Baltimore 4 Minnesota 5 Dallas 6 Dallas 7 Miami 8 Miami 9 Pittsburgh 10 Pittsburgh 11 Oakland 12 Dallas 13 Pittsburgh 14 Pittsburgh 15 Philadelphia 16 SanFrancisco 17 Miami 18 Washington 19 SanFrancisco 20 Chicago 21 NYGiants 22 Denver
23 SanFrancisco Cincinnati Underdog Spread Outcome Upset 24 SanFrancisco Denver KansasCity 14.0 25.0 0 25 Buffalo NYGiants Oakland 13.5 19.0 0 26 Washington Buffalo NYJets 18.0 -9.0 1 27 Dallas Buffalo KansasCity 12.5 -16.0 1 28 Dallas Buffalo Baltimore 2.0 -3.0 1 29 SanFrancisco SanDiego Miami 6.0 21.0 0 30 Dallas Pittsburgh Washington 1.0 7.0 0 31 GreenBay NewEngland Minnesota 7.0 17.0 0 32 Denver GreenBay Minnesota 3.0 10.0 0 33 Denver Atlanta Dallas 6.0 -4.0 1 34 StLouis Tennessee Minnesota 4.5 18.0 0 35 Baltimore NewYork Denver 5.5 17.0 0 36 NewEngland StLouis Dallas 3.5 4.0 0 37 Oakland TampaBay LARams 10.5 12.0 0 38 NewEngland Carolina Oakland 3.0 -17.0 1 39 NewEngland Philadelphia Cincinnati 1.0 5.0 0 40 Pittsburgh NewEngland Washington 3.0 -10.0 1 41 Indianapolis Chicago LARaiders 2.5 -29.0 1 42 NewEngland NYGiants Miami 3.0 22.0 0 43 Pittsburgh Seattle NewEngland 10.0 36.0 0 44 NewOrleans Indianapolis Denver 9.5 19.0 0 45 GreenBay Pittsburgh Washington 3.0 -32.0 1 46 NewEngland NYGiants
7.0 11.5 6.0 7.0 7.0 10.0 19.0 13.0 14.0 12.0 7.5 7.0 3.0 14.0 3.5 7.0 4.0 4.0 7.0 14.0 6.5 5.0 2.5 2.5
4.0 45.0 -1.0 13.0 35.0 17.0 23.0 10.0 14.0 7.0 15.0 7.0 27.0 -3.0 -27.0 3.0 3.0 11.0 8.0 -3.0 4.0 14.0 3.5 -4.0
0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1
Superbowl Spread Data Let’s build a regression model
model = lm(Outcome ∼ Spread) Residuals: Min 1Q -35.334 -6.503
Median 0.788
3Q 8.475
Max 33.761
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5444 4.3790 0.124 0.9016 Spread 0.9299 0.5083 1.829 0.0741 .
Residual standard error: 15.74 on 44 degrees of freedom Multiple R-squared: 0.07068, Adjusted R-squared: 0.04956 F-statistic: 3.347 on 1 and 44 DF, p-value: 0.07413
SuperBowl = 0.544 + 0.9299 Spread
How Close is the Spread? Plot the Empirical versus Theoretical plot(Outcome,Spread,pch=20,col=4)
# Plot Data
abline(super,col=2)
# Add the least squares line
abline(a=0,b=1,col=3)
# Add the line outcome=spread
I
The two lines are nearly on top of each other!
I
There’s still a lot of uncertainty left over. This is measured by the residual standard error: s = 15.7
Regression Model Superbowl Data ●
15
●
●
●
● ●
● ● ●
10
●
● ●
● ●●
●●
●
●
●
● ●
●
● ●
5
Spread
●
●
● ● ● ●
● ●
●
● ●
●
● ●
●
●
● ●
●
−20
●
0
20 Outcome
40
Data
Let’s look at favourites and underdogs
SanFrancisco
0 Outcome
20
40
15 10
Spread
10 5
Spread
NewEngland GreenBay GreenBay GreenBay Dallas Minnesota Denver SanFrancisco Pittsburgh Dallas Chicago NYGiants Denver NewEngland Washington Indianapolis Miami Dallas SanFrancisco StLouis Pittsburgh Pittsburgh Dallas Buffalo Dallas NewOrleans Oakland NewEngland Pittsburgh Pittsburgh Oakland Pittsburgh Philadelphia Denver Miami Baltimore SanFrancisco Washington NewEngland GreenBay Dallas Miami SanFrancisco −20
SanDiego
NYJets
15
Baltimore
NewEngland KansasCity NYGiants StLouis Oakland Pittsburgh KansasCity GreenBay Denver LARams NewEngland Buffalo Denver
Atlanta Buffalo Chicago Tennessee Minnesota Buffalo Cincinnati Carolina Arizona Dallas Miami NYGiants Denver Indianapolis Minnesota NewEngland Philadelphia TampaBay Dallas Washington Washington Minnesota Miami NewYork Oakland Pittsburgh LARaiders NYGiants Baltimore Washington Cincinnati
5
I
−20
0 Outcome
20
40
Residuals
I
We should check our modelling assumptions
I
The studentized residuals are: rstudent(model)
residual 14
Normal Q−Q Plot
● ●
8 4
Frequency
10
●
2
1 0 −1 −2
●●
●
0
Sample Quantiles
●
6
2
● ● ● ● ●● ●●● ●● ●●●●●● ● ● ● ●●●●● ●●● ●●● ●●● ● ● ●● ●
−2
−1
0
1
Theoretical Quantiles
2
−2
−1
0
1
rstudent(super)
2
Logistic Regression Suppose that we have a response variables of ones and zeroes Y = 1, 0 and X are our usual set of predictors/covariates The logistic regression model is linear in log-odds P (win) log = x> β 1 − P (win) Data: NBA point spreads. Logistic regression: Does point spread predict whether the favourite wins or not? The β will be highly statistically significant.
Data
0
40
favwin
80
1
favwin=1 favwin=0
0
Frequency
120
Let’s look at the histogram of the data first, by group.
0
20 spread
40
0
20 spread
40
R: Logit Logistic Regression nbareg = glm(favwin~spread-1, family=binomial) summary(nbareg) Call: glm(formula = favwin ~ spread - 1, family = binomial) Deviance Residuals: Min 1Q Median -2.5741 0.1587 0.4620
3Q 0.8135
Max 1.1119
Coefficients: Estimate Std. Error z value Pr(>|z|) spread 0.15600 0.01377 11.33 X)−1 X> y This can be numerically unstable when X> X is ill-conditioned. Happens when p is large. I Ridge Regression
βˆ ridge = (X> X + λI )−1 X> y over a regularisation path of λ’s.
Least Absolute: L1-norm Least Absolute Shrinkage and Selection Operator I (Lasso) Show that the solution to the lasso objective function
arg minβ
1 (y − β )2 + λ | β | 2
is the soft-thresholding operator defined by βˆ = soft(y; λ) = (y − λsgn(y))+ Here sgn is the sign function and (x)+ = max(x, 0).
= | β| and solve the joint constrained optimisation problem which is differentiable
I Hint: define a slack variable z
Linear Regression Shortcomings
Setup: yi = xi> β + ei where β = ( β 1 , . . . , β p ) fr 1 ≤ i ≤ n. Equivalently: y = Xβ + e and minβ ||y − Xβ||2 . I
Predictive Ability The MLE (maximum likelihood) or OLS (Ordinary Least Squares) estimator is designed to have zero bias. That means it can suffer with high variance! There’s a variance/bias tradeoff that we have to trade-off.
I
Interpretability
I
Cross-Validation Fit the model on training data. ˆ Use model to predict Y-values for the holdout sample. Calculate predicted MSE Smallest MSE wins.
1 N
ˆ 2 ∑N j=1 (Yj − Yj ) .
Ridge Regression Ridge Regression ˆ towards zero. Similar to least squares but shrinks the β’s Formally, we have the minimisation problem p
βˆ ridge = arg minβ ||y − Xβ||2 + λ ∑ | β j |2 j=1
I
Equivalently called Tikhonov regularisation: p 2 ∑j=1 | β j |2 = || β|| .
I
Nothing is for free: we have to pick λ
I
ˆ to λ Regularisation path: sensible of β’s
I
ˆ Out-of-sample cross-validation for λ.
Example
Simulated Data: n = 50, p = 30 and σ2 = 1. True model is linear with 10 large coefficients between 0.5 and 1. I Linear Regression
Bias squared = 0.006 and variance = 0.627. Prediction error = 1 + 0.006 + 0.627 = 1.633 I We’ll do better by shrinking the coefficients to reduce the
variance I How big a gain will we get with Ridge/Bayes Regression?
4 3 0
1
2
Frequency
5
6
7
Example: True Coefficients
0.0
0.2
0.4
0.6
0.8
1.0
True coefficients
Figure: Shrinkage will Help
1.70
Linear regression Ridge regression
1.60 1.50
Prediction error
1.80
Gain in Prediction error
High
Low 0
5
10
15
20
25
Amount of shrinkage
Figure: Ridge Ridge Regression At best: Bias squared = 0.077 and variance = 0.402. Prediction error = 1 + 0.077 + 0.403 = 1.48
Bias-Variance Tradeoff
0.2
0.4
0.6
0.8
Simulated data: n = 50, p = 30.
0.0
Linear MSE Ridge MSE Ridge Bias^2 Ridge Var 0
5
10
15 λ
Figure: Ridge
20
25
Lasso Regression Lasso Regression Shrinkage methods that performs automatic selection Designed to solve the minimisation problem p
βˆ ridge = arg minβ ||y − Xβ||2 + λ ∑ | β j | j=1
I
An L1 -regularisation penalty
I
ˆ to zero. Nice feature that it forces some of the β’s Automatic variable selection!
I
Computationally fast: Convex optimisation
I
Still have to pick λ.
I
Regularisation and predictive cross-validation
Regularisation Path: Coefficient Estimates Cross-Validation: CV 8
8
8
−6
−5
−4
6
4
2
0
−3
−2
−1
0
svi
0.2
lweight
lbph pgg45
gleason
0.0
Coefficients
0.4
0.6
lcavol
lcp age
Log Lambda
Figure: Lasso
8
8
−2
0
8
8
8
2
4
6
lcavol
0.3
svi
0.2
lweight
0.1
lbph pgg45
0.0
gleason
lcp
−0.1
Coefficients
0.4
0.5
0.6
What about Ridge?
age
Log Lambda
Figure: Ridge Coefficient Estimates
Prostate Data CV Regularisation Path Hastie, Friedman and Tibshriani (2013). Elements of Statistical Learning 8
8
8
8
8
8
8
8
8
8
8
8
8
8
7
7
7
6
6
6
5
5
5
5
3
3
3
3
3
2
1
1
1
1
1.4
1.6
8
●
1.2
●
1.0
●
● ● ●
0.8
● ● ● ● ● ● ● ●
0.6
Mean−Squared Error
●
●
●
●
●
●
●
●
●
●
−6
●
●
●
●
●
●
●
●
●
●
−5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−4
●
●
●
●
●
●
−3 log(Lambda)
Figure: Lasso
●
●
●
●
●
●
●
−2
●
●
●
−1
0
What about Ridge? 8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8.0
8
●
●
●
●
7.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
7.0
Mean−Squared Error
● ●
−2
0
2
4
log(Lambda)
Figure: Ridge: CV regularisation
6
MSE Example I A very popular statistical criterion is mean squared error I MSE is defined by
MSE =
∑(Y − Yˆ )2 Y
ˆ about the variable Y. I You make a prediction Y ˆ )2 . I After the outcome Y, you calculate (Y − Y I In data mining, it is popular to use a holdout sample and after
you’ve built your statistical model to test it out-of-sample in terms of its mean squared error performance.
Example I The Brier score of a set of forecasts is defined by
∑(Y − Yˆ ) Y
I Then you check to see if it’s statistically significantly different
from zero. I It’s less sensitive to outliers but not as popular as MSE.
ˆ) I Your forecasts are Calibrated or unbiased if E(Y
= Y.
Netflix Prize
I A massive dataset of Netflix customer’s preferences.
People ratings 1 − 5 by Movies 180 million data-points. I Movies rated on a scale of 1-5 I The challenge was a 10% improvement on the current Netflix
prediction algorithm I Build a model to make predictions to recommend movies you
haven’t seen yet
MSE Goal
I $1 million prize if you could improve the Netflix rating system
at 10% I After two years finally won in Oct 2009. I The MSE goal was a level of 0.854. I After two years of improving their MSE and being in the lead
for most of 2009, a noted Bayesian Statistician Chris Volinsky and his colleagues ended up losing with 4 mins to go!
Statistical Effects
I Downside momentum: once start panning movies, people give
lower ratings to ones they like I Time Trends: Memento (short term memory loss) 3.4 rating on
opening days, 4 after a few weeks I Model averaging: 700 different models (regressions, ... ) and a
weighted averaged prediction seems to work best. I New prize: data includes demographics, age, gender, movie
details
Probability: 41901 Week 6: Bayesian Statistics Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Introduction to Bayesian Methods Modern Statistical Learning I
Bayes Rule and Probabilistic Learning
I
Computationally challenging: MCMC and Particle Filtering
I
Many applications in Finance: Asset pricing and corporate finance problems.
Lindley, D.V. Making Decisions Bernardo, J. and A.F.M. Smith Bayesian Theory
Recent Bayesian Books
Bayesian Theory and Applications I Hierarchical Models and MCMC I Bayesian Nonparametrics Machine Learning I Dynamic State Space Models . . .
Popular Books
McGrayne (2012): The Theory that would not Die I History of Bayes-Laplace I Code breaking I Bayes search: Air France . . .
Nate Silver: NYT
Silver (2012): The Signal and The Noise I Presidential Elections I Bayes dominant methodology I Predicting College Basketball/Oscars . . .
Things to Know Explosion of Models and Algorithms starting in 1950s I Bayesian Regularisation and Sparsity I Hierarchical Models and Shrinkage I Hidden Markov Models I Nonlinear Non-Gaussian State Space Models
Algorithms I Monte Carlo Method (von Neumann and Ulam, 1940s) I Metropolis-Hastings (Metropolis, 1950s) I Gibbs Sampling (Geman and Geman, Gelfand and Smith, 1980s) I Sequential Particle Filtering
Probabilistic Reasoning
Bayesians only make Probability statements I Bayesian Probability (Ramsey, 1926, de Finetti, 1931)
1. Beta-Binomial Learning: Black Swans 2. Elections: Nate Silver 3. Baseball: Kenny Lofton and Derek Jeter I Monte Carlo (von Neumann and Ulam, Metropolis, 1940s)
Shrinkage Estimation: (Lindley and Smith, Efron and Morris, 1970s)
Bayes Learning: Beta-Binomial Recursive Bayesian Learning I Likelihood for Bernoulli T
p (y| θ ) =
T
T
∏ p (yt |θ ) = θ ∑t=1 yt (1 − θ )T−∑t=1 yt .
t=1
I Initial distribution θ
∼ B (a, A). The Beta pdf is given by
p (θ |a, A) =
θ a−1 (1 − θ )A−1 , B (a, A)
I Posterior is Beta
p (θ |y) ∼ B (aT , AT ) and aT = a +
T
T
t=1
t=1
∑ yt , AT = A + T − ∑ yt
The posterior mean and variance are E [ θ |y] =
aT AT aT and var [θ |y] = , 2 aT + AT (aT + AT ) (aT + AT + 1)
Black Swans Observing White Swans and having never seen a Black Swan. (Taleb, 2008, The Black Swan: the Impact of the Highly Improbable). What’s the Probability of Black Swan event sometime in the future? I Suppose that after T trials you have only seen successes
(y1 , . . . , yT ) = (1, . . . , 1). The next trial being a success has p(yT+1 = 1|y1 , . . . , yT ) =
T+1 T+2
For large T is almost certain. Here a = A = 1. I The probability of never seeing a Black Swan is given by
p(yT+1 = 1, . . . , yT+n = 1|y1 , . . . , yT ) =
T+1 →0 T+n+1
I Black Swan will eventually happen – don’t be surprised when it
actually happens.
Baseball Hierarchical Model I Batter-pitcher match-up I Prior information on overall ability of a player. I Small sample size, pitcher variation. I Let pi denote Jeter’s ability. Observed number of hits yi
(yi |pi ) ∼ Bin(Ti , pi ) with pi ∼ Be(α, β) where Ti is the number of at-bats against pitcher i. I A priori E(pi )
= α/(α + β) = p¯ i .
I The extra heterogeneity leads to a prior variance
Var(pi ) = p¯ i (1 − p¯ i )φ where φ = (α + β + 1)−1 .
Sports Data: Baseball Table: Kenny Lofton hitting Pitcher J.C. Romero S. Lewis B. Tomko T. Hoffman K. Tapani A. Cook J. Abbott A.J. Burnett K. Rogers A. Harang K. Appier R. Clemens C. Zambrano N. Ryan E. Hanson E. Milton M. Prior Total
At-bats 9 5 20 6 45 9 34 15 43 6 49 62 9 10 41 19 7 7630
Hits 6 3 11 3 22 4 14 6 17 2 15 14 2 2 7 1 0 2283
ObsAvg .667 .600 .550 .500 .489 .444 .412 .400 .395 .333 .306 .226 .222 .200 .171 .056 .000 .299
Kenny Lofton versus individual pitchers.
Sports Data: Baseball On Aug 29, 2006, Kenny Lofton (career .299 average, and current .308 average for 2006 season) was facing the pitcher Milton (current record 1 for 19). He was rested and replaced by a .275 hitter. I Is putting in a weaker player really a better bet? I Was this just an over-reaction to bad luck in the Lofton-Milton
matchup? P (≤ 1 hit in19 attempts|p = 0.3) = 0.01 an unlikely 1-in-100 event. I Bayes
Lofton’s batting estimates that vary from .265 to .340. The lowest being against Milton. Conclusion: .265 < .275 resting Lofton against Milton is justified.
Bayes Model
Batter-pitcher match-up I Small sample sizes and pitcher variation. I Let pi denote Lofton’s ability. Observed number of hits yi
(yi |pi ) ∼ Bin(Ti , pi ) with pi ∼ Be(α, β) where Ti is the number of at-bats against pitcher i.
Hierarchical Model Table: Derek Jeter hierarchical model estimates Pitcher R. Mendoza H. Nomo A.J.Burnett E. Milton D. Cone R. Lopez K. Escobar J. Wettland T. Wakefield P. Martinez K. Benson T. Hudson J. Smoltz F. Garcia B. Radke D. Kolb J. Julio Total
At-bats 6 20 5 28 8 45 39 5 81 83 8 24 5 25 41 5 13 6530
Hits 5 12 3 14 4 21 16 2 26 21 2 6 1 5 8 0 0 2061
ObsAvg .833 .600 .600 .500 .500 .467 .410 .400 .321 .253 .250 .250 .200 .200 .195 .000 .000 .316
EstAvg .322 .326 .320 .324 .320 .326 .322 .318 .318 .312 .317 .315 .314 .314 .311 .316 .312
Derek Jeter 2006 season versus individual pitchers.
95% Int (.282,.394) (.289,.407) (.275,.381) (.291,.397) (.218,.381) (.291,.401) (.281, .386) (.275,.375) (.279,.364) (.254,.347) (.264,.368) (.260,.362) (.253,.355) (.253,.355) (.247,.347) (.258,.363) (.243,.350 )
Bayes Estimates
I Stern (2007) estimates φˆ
= 0.002 (Jeter) doesn’t vary much across the population of pitchers.
I The extremes are shrunk the most also matchups with the
smallest sample sizes. I Jeter had a season .308 average. We see that his Bayes estimates
vary from .311 to .327 and that he is very consistent. I If all players had a similar record then the assumption of a
constant batting average would make sense.
Bayes Elections: Nate Silver Multinomial-Dirichlet Predicting the Electoral Vote (EV) I Multinomial-Dirichlet: (pˆ |p)
∼ Multi(p), (p|α) ∼ Dir(α)
pObama = (p1 , . . . , p51 |pˆ ) ∼ Dir (α + pˆ )
≡ 1. http://www.electoral-vote.com/evp2012/Pres/prespolls.csv
I Flat uninformative prior α
I Calculate probabilities via simulation: rdirichlet
p pj,O |data
and p (EV > 270|data)
I The election vote prediction is given by the sum 51
EV =
∑ EV (j)E
pj |data
j=1
where EV (j) are for individual states
Polling Data: electoral-vote.com I Electoral Vote (EV), Polling Data: Mitt and Obama percentages State M.pct O.pct EV 1 Alabama 58 36 9 2 Alaska 55 37 3 3 Arizona 50 46 10 4 Arkansas 51 44 6 5 California 33 55 55 6 Colorado 45 52 9 7 Connecticut 31 56 7 8 Delaware 38 56 3 9 D.C. 13 82 3 10 Florida 46 50 27 11 Georgia 52 47 15 12 Hawaii 32 63 4 13 Idaho 68 26 4 14 Illinois 35 59 21 15 Indiana 48 48 11 16 Iowa 37 54 7 17 Kansas 63 31 6 18 Kentucky 51 42 8 19 Louisiana 50 43 9 20 Maine 35 56 4 21 Maryland 39 54 10 22 Massachusetts 34 53 12 23 Michigan 37 53 17 24 Minnesota 42 53 10 25 Mississippi 46 33 6 ...
26 Missouri 27 Montana 28 Nebraska 29 Nevada 30 New.Hampshire 31 New.Jersey 32 New.Mexico 33 New.York 34 North.Carolina 35 North.Dakota 36 Ohio 37 Oklahoma 38 Oregon 39 Pennsylvania 40 Rhode.Island 41 South.Carolina 42 South.Dakota 43 Tennessee 44 Texas 45 Utah 46 Vermont 47 Virginia 48 Washington 49 West.Virginia 50 Wisconsin 51 Wyoming
48 49 60 43 42 34 43 31 49 43 47 61 34 46 31 59 48 55 57 55 36 44 39 53 42 58
48 46 34 47 53 55 51 62 46 45 45 34 48 52 45 39 41 39 38 32 57 47 51 44 53 32
11 3 5 5 4 15 5 31 15 3 20 7 7 21 4 8 3 11 34 5 3 13 11 5 10 3
0.02 0.01 0.00
Density
0.03
0.04
Histogram of sim.EV
260
280
300
320
340
360
380
sim.EV
Figure: Election 2008 Prediction. Obama 370
Electoral College Results Probability
0.020
Density
0.015
270 EV to Win
0.010
Actual Obama
0.005
0.000
250
300
350
Electoral Votes for Obama
Figure: Election 2012 Prediction. Obama 332.
SAT Scores SAT (200 − 800): 8 high schools and estimate effects. School Estimated yj St. Error σj Average Treatment θi A 28 15 ? B 8 10 ? C -3 16 ? D 7 11 ? E -1 9 ? F 1 11 ? G 18 10 ? H 12 18 ? I θj average effects of coaching programs I yj estimated treatment effects, for school j, standard error σj .
Estimates? Two programs appear to work (improvements of 18 and 28) I Large standard errors. Overlapping Confidence Intervals? I Classical hypothesis test fails to reject the hypothesis that the
θj ’s are equal. I Pooled estimate is
θˆ =
∑j (yj /σj2 ) ∑j (1/σj2 )
= 7.9
with a standard error of 4.2. I Neither separate or pooled seems sensible.
Bayesian shrinkage!
Data and Model Hierarchical Model is given by yj |θj ∼ N (θj , σj2 ) σj2 assumed known. Unequal variances – differential shrinkage.
∼ N (µ, τ 2 ) for 1 ≤ j ≤ 8. Traditional random effects model. Exchangeable prior for the treatment effects. As τ → 0 (complete pooling) and as τ → ∞ (separate estimates).
I Prior Distribution: θj
I Hyper-prior Distribution: p(µ, τ 2 ) ∝ 1/τ.
The posterior p(µ, τ 2 |y) can be used to “estimate” (µ, τ 2 ).
Posterior
I Joint Posterior Distribution
p(θ, µ, τ |y) ∝ p(µ, τ )p(θ |µ, τ )p(y|θ ) 8
8
i=1
j=1
∝ p(µ, τ 2 ) ∏ N (θj |µ, τ 2 ) ∏ N (yj |θj ) ∝τ I MCMC!
−9
1 1 1 1 exp − ∑ 2 (θj − µ)2 − ∑ 2 (yj − θj )2 2 j τ 2 j σj
!
Inference Report posterior quantiles School 2.5% 25% 50% A -2 6 10 B -5 4 8 C -12 3 7 D -6 4 8 E -10 2 6 F -9 2 6 G -1 6 10 H -7 4 8 µ -2 5 8 τ 0.3 2.3 5.1
75% 16 12 11 12 10 10 15 13 11 8.8
97.5% 32 20 22 21 19 19 27 23 18 21
Probability: 41901 Week 7: Hierarchical Models Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Bayesian Shrinkage
Bayesian shrinkage provides a way of modeling complex datasets.
1. Coin Tossing: Lindley’s Paradox 2. Baseball batting averages: Stein’s Paradox 3. Toxoplasmosis 4. SAT scores 5. Clinical Trials
Bayesian Inference Key Idea: Explicit use of probability for summarizing uncertainty. 1. A probability distribution for data given parameters f (y|θ ) Likelihood 2. A probability distribution for unknown parameters p(θ ) Prior 3. Inference for unknowns conditional on observed data Inverse probability (Bayes Theorem); Formal decision making (Loss, Utility)
Posterior Inference I Bayes theorem to derive posterior distributions
p( θ |y) = p(y) =
p(y| θ )p( θ ) p(y) Z
p(y|θ )p(θ )dθ
Allows you to make probability statements I They can be very different from p-values!
Hypothesis testing and Sequential problems I Markov chain Monte Carlo (MCMC) and Filtering (PF)
Conjugate Priors Example I Definition: Let
F denote the class of distributions f (y|θ ).
A class Π of prior distributions is conjugate for F if the posterior distribution is in the class Π for all f ∈ F , π ∈ Π, y ∈ Y . I Example: Binomial/Beta:
Suppose that Y1 , . . . , Yn ∼ Ber(p). Let p ∼ Beta(α, β) where (α, β) are known hyper-parameters. The beta-family is very flexible and has prior mean E(p) =
α α+ β .
Posterior ¯) I The posterior for p is given by p(p|Y Here Y¯ is a sufficient statistic. I
Bayes theorem gives p(p|y) ∝ f (y|p)p(p|α, β) ∝ p∑ yi (1 − p)n−∑ yi · pα−1 (1 − p) β−1 ∝ pα+∑ yi −1 (1 − p)n−∑ yi + β−1
∼ Beta(α + ∑ yi , β + n − ∑ yi ) I
The posterior mean can be written as a weighted combination of the sample mean y¯ and the prior mean E(p)–a shrinkage estimator.
Example Example: Poisson/Gamma: Suppose that Y1 , . . . , Yn ∼ Poi(λ). Let λ ∼ Gamma(α, β) where (α, β) are known hyper-parameters. I The posterior is then
p(λ|y) ∝ exp(−nλ)λ∑ yi λα−1 exp(− βλ)
∼ Gamma(α + ∑ yi , n + β)
Normal-Normal Posterior Distribution Using Bayes rule we get p( µ |y) ∝ p(y| µ )p( µ )
I Posterior is given by
1 p(µ|y) ∝ exp − 2 2σ
n
1 ∑ (yi − µ) − 2τ2 (µ − µ0 )2 i=1
!
2
Hence µ|y ∼ N µˆ B , Vµ where µˆ B =
n/σ2 1/τ 2 n 1 ¯ y + µ0 and Vµ−1 = 2 + 2 n/σ2 + 1/τ 2 n/σ2 + 1/τ 2 σ τ
A shrinkage estimator.
Lindley’s Paradox
Often evidence which, for a Bayesian statistician, strikingly supports the null leads to rejection by standard classical procedures. I Do Bayes and Classical always agree?
Bayes computes the probability of the null being true given the data p(H0 |D). That’s not the p-value. why? I Surely they agree asymptotically? I How do we model the prior and compute likelihood ratios
L(H0 |D) in the Bayesian framework?
Bayes t-ratio Edwards, Lindman and Savage (1963) There’s a simple approximation for the likelihood ratio. √ √ 1 2 L(p0 ) ≈ 2π n exp − t 2 I The key is that the Bayes test will have the factor
√
n which will
asymptotically favour the null. I There is only a big problem when 2
typically the most interesting case!
< t < 4 – but this is
Coin Tossing
Let’s start with intuition: imagine a coin tossing experiment and you want to determine whether the coin is “fair” H0 : p = 12 . There are four experiments. Expt n r L ( p0 )
1 50 32 0.81
2 100 60 1.09
3 400 220 2.17
4 10,000 5098 11.68
Coin Tossing
There’s lots of implications: I Classical: In each case the t-ratio is approx 2. They we just H0 ( a
fair coin) at the 5% level in each experiment. I Bayes: L(p0 ) grows to infinity and so they is overwhelming
evidence for H0 . Connelly shows that the Monday effect disappears when you compute the Bayes version.
Shrinkage Estimation
Stein showed that it is possible to make a uniform improvement on the MLE in terms of MSE. I Mistrust of the statistical interpretation of Stein’s result.
In particular, the loss function. I Difficulties in adapting the procedure to special cases I Long familiarity with good properties for the MLE I Any gains from a “complicated” procedure could not be worth
the extra trouble (Tukey, savings not more than 10 % in practice)
Baseball Batting Averages Data: 18 major-league players after 45 at bats (1970 season) Player Clemente Robinson Howard Johnstone Berry Spencer Kessinger Alvarado Santo Swoboda Unser Williams Scott Petrocelli Rodriguez Campanens Munson Alvis
y¯ i 0.400 0.378 0.356 0.333 0.311 0.311 0.311 0.267 0.244 0.244 0.222 0.222 0.222 0.222 0.222 0.200 0.178 0.156
E ( pi | D ) 0.290 0.286 0.281 0.277 0.273 0.273 0.268 0.264 0.259 0.259 0.254 0.254 0.254 0.254 0.254 0.259 0.244 0.239
average season 0.346 0.298 0.276 0.222 0.273 0.270 0.263 0.210 0.269 0.230 0.264 0.256 0.303 0.264 0.226 0.285 0.316 0.200
Baseball Data
0.35 0.25 0.15
predicted average
First Shrinkage Estimator: Efron and Morris
Clemente Robinson Howard Johnstone Berry Kessinger Alvarado Santo Petrocelli Campaneris Munson Alvis First 45 at bats
Average of remainder
a
Figure: Baseball Shrinkage
J−S estimator
Shrinkage Let θi denote the end of season average I Lindley: shrink to the overall grand mean
c = 1− and
(k − 3) σ 2 ∑(y¯ i − y¯ )2
θˆ = c¯yi + (1 − c)y¯
where y¯ is the overall grand mean.
= 0.212 and y¯ = 0.265. 2 ˆ Compute ∑(θi − y¯ obs i ) and see which is lower:
I Baseball data c
MLE = 0.077 STEIN = 0.022 That’s a factor of 3.5 times better!
Batting Averages
0.50
'Clemente' batting averages over 1970 season: .400 after 45 at bats; .346 for remainder ; .352 overall
● ●●
0.40
● ● ● ● ●
0.35
●
●
0.30
●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ●
●
.400 after 45 at bats
● ● ●
●
● ●
0.25
●
0.20
Batting average
0.45
●● ● ●
●
.346 remainder of season
●
● 5
10
20
45
100
200
●
300
400 20
●
Number At Bats (square root scale)
.352
Further Paradoxes Shrinkage on Clemente too severe: zCl = 0.265 + 0.212(0.400 − 0.265) = 0.294. The 0.212 seems a little severe I Limited translation rules, maximum shrinkage eg. 80%
= 1, n = 2). zO0 C = 0.265 + 0.212(0.5 − 0.265) = 0.421.
I Not enough shrinkage eg O’Connor ( y
Still better than Ted Williams 0.406 in 1941.
= 19) will further improve MSE performance! It will change the shrinkage factors.
I Foreign car sales (k
I Clearly an improvement over the Stein estimator is
θˆS+ = max
k−2 1− ∑ Y¯ 2 i
!
! , 0 Y¯ i
Prior Information
Extra Information I Empirical distribution of all major league players
θi ∼ N (0.270, 0.015) The 0.270 provides another origin to shrink to and the prior variance 0.015 would give a different shrinkage factor. I To fully understand maybe we should build a probabilistic
model and see what the posterior mean is as our estimator for the unknown parameters.
Unequal Variances Model Yi |θi ∼ N (θi , Di ) where θi ∼ N (θ0 , A) ∼ N (0.270, 0.015). I The Di can be different – unequal variances I Bayes posterior means are given by
E(θi |Y) = (1 − Bi )Yi where Bi =
Di Di + A
ˆ is estimated from the data, see Efron and Morris (1975). where A I Different shrinkage factors as different variances Di .
Di ∝ ni−1 and so smaller sample sizes are shrunk more. Makes sense.
Toxoplasmosis Data Disease of Blood that is endemic in tropical regions. Data: 5000 people in El Salvador drawn in varying sample sizes from 36 cities. I Estimate “true” prevalences θi 1
≤ i ≤ 36
I Allocation of Resources: should we spend funds on the city
with the highest observed occurrence of the disease? I Same shrinkage factors? I Shrinkage Procedure (Efron and Morris, p315)
zi = ci yi where yi are the observed relative rates (normalized so y¯ = 0 The smaller sample sizes will get shrunk more. The most gentle are in the range 0.6 → 0.9 but some are 0.1 → 0.3.
Clinical Trials Novick and Grizzle: Bayesian Analysis of Clinical Trials Four treatments for duodenal ulcers. Doctors assess the state of the patient. Data is collected sequentially. Armitage is the classis text for computing the sequential rejection rejion for the classical tests (α-spending function, can only look at prespecified times).
Treat Excellent Fair Death A 76 17 7 B 89 10 1 C 86 13 1 D 88 9 3 Conclusion: Cannot reject at the 5% level Use a conjugate binomial/beta prior model and do some sensitivity analysis.
Binomial-Beta Let pi be the death rate proportion under treatment i. I To compare treatment A to B directly compute P(p1 I Prior beta(α, β ) with prior mean E(p)
=
Posterior beta(α + ∑ xi , β + n − ∑ xi )
→ beta(8, 94) For B, beta(1, 1) → beta(2, 100)
I Example: For A, beta(1, 1)
I Inference: P(p1
> p2 |D) ≈ 0.98
α α+ β .
> p2 |D).
Sensitivity Analysis Its important to do a sensitivity analysis. Treat Excellent Fair Death A 76 17 7 B 89 10 1 C 86 13 1 D 88 9 3 I Model: Poisson/Gamma I Prior Γ(m, z) and λi be the expected death rate. I Compute P
λ1 λ2
> c|D
Prob P λλ12 > 1.3|D P λλ21 > 1.6|D
(0,0)
( 100, 2)
( 200 , 5)
0.95
0.88
0.79
0.91
0.80
0.64
Multinomial Probit
I Multinomial Probit. You observe a choice variable
Y ∈ {0, 1, , . . . , p − 1} where consumers choices are formed in a random utility model setting W = Xβ + e Y(W ) = i if max W = Wi
∼ p( β ). Applications: Marketing, Economics, Political Science. Gelman et al (2006) and Rossi et al (2007).
I Distribution of consumers tastes β
Stochastic Volatility
I Stochastic Volatility: You observe a sequence of returns rt with √
expected return µt and volatility
Vt .
I Hierarchical model defined by the sequence of conditionals
rt = µt +
p
Vt e t
µt = regression log Vt = α + β log Vt−1 + σv etv I Here α/(1 − β) describes the long-run average (log)-volatility I σv the volatility of volatility.
Bayes Portfolio Selection I de Finetti and Markowitz: Mean-variance portfolio shrinkage
constraints.
1 min xT Σx − xT µ − λc(x) x 2
I Bayes Predictive Moments
1 xT ι = 1 min xT Var(Rt+1 |Dt )x subject to T x 2 x E ( Rt + 1 | D t ) = µ P I PT (1999): Hierarchical Model
E(Rt+1 |Dt ) =
(tR¯ + λµ0 ) t+λ
Results
I Different shrinkage factors for different history lengths. I Portfolio Allocation in the SP500 index I Entry/exit; splits; spin-offs etc. For example, 73 replacements to
the SP500 index in period 1/1/94 to 12/31/96. I Advantage?: E(α|Dt )
= 0.39, that is 39 bps per month which on an annual basis is 468 bps. The posterior mean for β is p( β|Dt ) = 0.745
I x¯ M
= 12.25% and x¯ PT = 14.05%.
Date General Electric Coca Cola Exxon ATT Philip Morris Royal Dutch Merck Microsoft Johnson/Johnson Intel Procter and Gamble Walmart IBM Hewlett Packard Pepsi Pfizer Dupont AIG Mobil Bristol Myers Squibb GTE General Motors Disney Citicorp BellSouth Motorola
Symbol GE KO XON T MO RD MRK MSFT JNJ INTC PG WMT IBM HWP PEP PFE DD AIG MOB BMY GTE GM DIS CCI BLS MOT
6/96 2.800 2.342 2.142 2.030 1.678 1.636 1.615 1.436 1.320 1.262 1.228 1.208 1.181 1.105 1.061 0.918 0.910 0.910 0.906 0.878 0.849 0.848 0.839 0.831 0.822 0.804
12/89 2.485 1.126 2.672 2.090 1.649 1.774 1.308 ***** 0.845 ***** 1.040 1.084 2.327 0.477 0.719 0.491 1.229 0.723 1.093 1.247 0.975 1.086 0.644 0.400 1.190 *****
12/79 1.640 0.606 3.439 5.197 0.637 1.191 0.773 ***** 0.689 ***** 0.871 ***** 5.341 0.497 ***** 0.408 0.837 ***** 1.659 ***** 0.593 2.079 ***** 0.418 ***** *****
12/69 1.569 1.051 2.957 5.948 ***** ***** 0.906 ***** ***** ***** 0.993 ***** 9.231 ***** ***** 0.486 1.101 ***** 1.040 0.484 0.705 4.399 ***** ***** ***** *****
SP Composition Ford Chervon Amoco Eli Lilly Abbott Labs AmerHome Products FedNatlMortgage McDonald’s Ameritech Cisco Systems CMB SBC Boeing MMM BankAmerica Bell Atlantic Gillette Kodak Chrysler Home Depot Colgate Wells Fargo Nations Bank Amer Express
F CHV AN LLY ABT AHP FNM MCD AIT CSCO CMB SBC BA MMM BAC BEL G EK C HD COL WFC NB AXP
0.798 0.794 0.733 0.720 0.690 0.686 0.686 0.686 0.639 0.633 0.621 0.612 0.598 0.581 0.560 0.556 0.535 0.524 0.507 0.497 0.489 0.478 0.453 0.450
0.883 0.990 1.198 0.814 0.654 0.716 ***** 0.545 0.782 ***** ***** 0.819 0.584 0.762 ***** 0.946 ***** 0.570 ***** ***** 0.499 ***** ***** 0.621
0.485 1.370 1.673 ***** ***** 0.606 ***** ***** ***** ***** ***** ***** 0.462 0.838 0.577 ***** ***** 1.106 ***** ***** ***** ***** ***** *****
0.640 0.966 0.758 ***** ***** 0.793 ***** ***** ***** ***** ***** ***** ***** 1.331 ***** ***** ***** ***** 0.367 ***** ***** ***** ***** *****
Chicago Bears 2014-2015 Season Hierarchical Model
Bayes Learning: Update our beliefs in light of new information I In the 2014-2015 season.
The Bears suffered back-to-back 50-points defeats. Partiots-Bears 51 − 23 Packers-Bears 55 − 14 I Their next game was at home against the Minnesota Vikings.
Current line against the Vikings was −3.5 points. Slightly over a field goal
What’s the Bayes approach to learning the line?
Hierarchical Model Hierarchical Model: let’s model the current average win/lose this year 18.342 σ2 ∼ N θ, y¯ |θ ∼ N θ, n 9 θ ∼ N (0, τ 2 ) Pre-season prior mean µ0 = 0, standard deviation τ = 4. Data y¯ = −9.22. I Bayes Shrinkage estimator
E (θ |y¯ , τ ) =
τ2 τ2 +
σ2 n
y¯
Shrinkage factor is then 0.3. I Our updated estimator is
E (θ |y¯ , τ ) = −2.75 > −.3.5 where is current line is −3.5. I Based on our hierarchical model this is an over-reaction.
One point change on the line is about 3% on a probability scale.
Bayes Bears Last two defeats were 50 points scored by opponent bears=c(-3,8,8,-21,-7,14,-13,-28,-41) > mean(bears) [1] -9.222222 > sd(bears) [1] 18.34242 > tau=4 > sig2=sd(bears)*sd(bears)/9 > tau^2/(sig2+tau^2) [1] 0.2997225 > 0.29997*-9.22 [1] -2.765723 > pnorm(-2.76/18) [1] 0.4390677
Home advantage is worth 3 points. Vikings an average record. Result: Bears 21, Vikings 13
Probability: 41901 Week 8: Bayesian Asset Allocation Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Overview Long history of Bayes in Finance I de Finetti-Markowitz: Mean-Variance Allocation I Kelly-Breiman-Merton: Maximise Expected Growth Rate
Applications 1. Sir Isaac Newton: Lost £20, 000. “I can calculate the motion of heavenly bodies, but not the madness of people” South Sea Bubble Act (1721) Master of the Mint (1700-1727). Newton invented the Gold standard. 2. David Ricardo (1815) Bought all issuance of UK Gilts the day before the Battle of Waterloo 3. J.M. Keynes (1920-1945) King’s College Fund.
South Sea Bubble: 1720
Also owned the East India Company: £30, 000
Post 1720 Bubble Volatility: persistent, asymmetric (leverage) and fat-tails
Dutch East India Company
London Stock Market: 1723-1794
South Sea Bubble Isaac Newton
South Sea Bubble
400 0
Prices
800
Bank.of.England Royal.African.Company Old.East.India.Company South.Sea.Company
1719.6
1720.0
1720.4
1720.8
Time
Figure: Bubble Act 1721
Keynes Investment Returns 1928-1945
Keynes and Rate returns
4
Keynes Rate
3
Market 7.9 6.6 -20.3 -25.0 -5.8 21.5 -0.7 5.3 10.2 -0.5 -16.1 -7.2 -12.9 12.5 0.8 15.6 5.4
2
Keynes -3.4 0.8 -32.4 -24.6 44.8 35.1 33.1 44.3 56.0 8.5 -40.1 12.9 -15.6 33.5 -0.9 53.9 14.5
1
Year 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944
5
10 Period
Keynes vs Cash
15
Keynes Optimal Bayes Rebalancing Mix of Keynes and Rate
●
●
●
●
●
●
5
● ●
Universal Portfolios with Keynes and Rate Keynes Rate "universal"
● ● ● ●
●
●
● ●
●
4
● ●
4
● ●
● ●
●
●
Return from Bayes Portfolio
●
●
●
●
3
3
● ●
● ●
● ● ● ●
2
● ●
2
●
● ● ● ● ●
1
●
●
1
●
●
0
●
●
0.0
0.5
1.0
1.5
Fraction of Keynes in Portfolio
Optimal allocation ω
2.0
2.5
5
10
15
Period
Keynes vs Universal vs Cash
Bayesian Learning ω ? Cover-Stern: Universal Portfolios Returns i.i.d. and allocation maxω E (ln(1 + ω 0 X)) I Realised wealth WT (X, ω )
distribution
= ∏Tt=1 (1 + ω 0 Xt ) and weight
FX (ω ) = R
WT (X, ω ) Ω WT (x, ω )dω
Estimate ω ? with
R ωWT (X, ω )dω ωˆ = EF (ω ) = R Ω WT (x, ω )dω I Dynamically rebalance to guarantee “area-under-parabola”
return
Parrando’s Paradoxes Two losing bets can be combined to a winner
Parrando1 and Parrando2 returns
I Bernoulli market: 1 + f
I Positive Expectation
30 20
or 1 − f with p = 0.51 and f = 0.05
Parrando1 Parrando2
p log(1 + f ) + (1 − p) log(1 − f )
= −0.00025 < 0
0
governed by the median/entropy
10
I Caveat: Growth
0
1000
2000
3000
4000
Period
Brownian Ratchets and cross-entropy of Markov processes
Two Losing Bets+Volatility
5000
Parrando Universal Portfolios with Parrando1 and Parrando2 70
Mix of Parrando1 and Parrando2 Parrando1 Parrando2 "universal"
●
12
●
60
●
●
50
10
●
●
40
●
8
●
6
30
●
●
Return from Bayes Portfolio
20
●
4
●
●
10
●
2
●
● ●
0
● ● ●
0
●
0 0.0
0.2
0.4
0.6
0.8
1.0
1000
2000
3000 Period
Fraction of Parrando1 in Portfolio
Optimal allocation ω
Ex Ante vs Ex Post Realisation
4000
5000
Breiman-Kelly-Merton Rule Kelly Criterion: Optimal wager in binary setting ω? =
p·O−q O
Merton’s Rule in continuous setting is Kelly ω? =
1 µ γ σ2
1. µ: (excess) expected return 2. σ: volatility 3. γ: risk aversion 4. ω ? : optimal position size 5. p = Prob(Up), q = Prob(Down), O = Odds
Example: Kelly Criterion S&P500: I Consider logarithmic utility (CRRA with γ
= 1). This is a pure
Kelly rule. I We assume iid log-normal stock returns with an annualized
expected excess return of 5.7% and a volatility of 16% which is consistent will long-run equity returns. In our continuous time formulation ω ? = 0.057/0.162 = 2.22 and the Kelly criterion which imply that the investor borrows 122% of wealth to invest a total of 220% in stocks. This is a the risk-profile of the Kelly criterion. One also sees that the allocation is highly sensitive to ˆ We consider dynamic learning in a later estimation error in µ. section and show how the long horizon and learning affects the allocation today.
Fractional Kelly The fractional Kelly rule leads to a more realistic allocation. I Suppose that γ
= 3. Then the informational ratio is
0.057 1 0.057 µ = = 0.357 and ω ? = = 74.2% σ 0.16 3 0.162 An investor with such a level of risk aversion then has a more reasonable 74.2% allocation. I This analysis ignores the equilibrium implications. If every
investor acted this way, then this would drive up prices and drive down the equity premium of 5.7%.
Black-Litterman Black-Litterman Bayesian framework for combining investor views with market equilibrium. I In a multivariate returns setting the optimal allocation rule is
ω? =
1 −1 Σ µ γ
The question is how to specify (µ, Σ) pairs? ˆ BL derive Bayesian inference for µ given I For example, given Σ, market equilibrium model and a priori views on the returns of pre-specified portfolios which take the form ˆ . (µˆ |µ) ∼ N µ, τ Σˆ and (Q|µ) ∼ N Pµ, Ω
Posterior Views
I Combining views, the implied posterior is
ˆ Q) ∼ N (Bb, B) (µ|µ, I The mean and variance are specified by
ˆ −1 P and b = (τ Σ ˆ )−1 µˆ + P0 Ω−1 Q B = (τ Σˆ )−1 + P0 Ω These posterior moments then define the optimal allocation rule.
Probability: 41901 Week 9: Stochastic Processes: Brownian Motion http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Brownian Motion A stochastic process {Bt , t ≥ 0} in continuous time taking real values is a Brownian Motion (or a Wiener process) I for each s
≥ 0 and t > 0 Bt+s − Bt ∼ N (0, t).
> s > 0, the random variable Bt − Bs is independent of Bs − B0 = Bs .
I For each t I B0
= 0 and
I Bt is continuous in t
≥ 0.
I Bt is standard BM. σBt includes a volatility.
Brown (1827) and Einstein (1921).
Brownian Motion with Drift
I Here we have the process
Xt = µt + σBt dXt = µdt + σdBt I We can think of the Euler discretization as
√ Xt+∆ − Xt = µ∆ + σ ∆et where et ∼ N (0, 1).
Geometric Brownian Motion
I Geometric Brownian Motion starting at X0
= 1 evolves as
Xt = exp(µt + σBt ) 1 2
We have E (Xt ) = e(µ+ 2 σ )t . I This is equivalent differential (SDE) form
dXt = Xt
1 2 µ + σ dt + σdBt 2
Brownian Motion for Sports Scores Implied Volatility Stern (1994): Brownian Motion and Sports Scores Polson and Stern (2014): The Implied Volatility of a Sports Game I “Team A” rarely loses if they are ahead at halftime I Approximately 75% of games are won by the team that leads at
halftime. There’s a 0.5 probability that the same team wins both halves and a 0.25 probability that the first half winner defeats the second half winner. This only applies to equally matched teams. I For the team that is ahead after 3 of the game, the empirical 4
frequency of winnings is: basketball 80%, baseball 90%, football 80% and hockey 80%.
Model Checking
µ home point advantage. Football 3 points, basketball 5 − 6 points. I Stern estimates for basketball a 1.5 points difference for the first
three quarters with a standard deviation of 7.5 points. Fourth quarter the difference is only 0.22 points with a standard deviation of 7. The q-q plot look normals. Correlations between periods are small and so the random walk model appears to be reasonable.
Evolution of a Game Volatility of Outcome Superbowl Probability
Game Evolution: Second half
10 −10 −30
−20
X
0
P(X > 0 ) = 0.353
Prob
mu=−4,p=0.353,sigma=10.60
−40
−20
0
Winning Margin
20
40
0
50
100 Time
150
200
Brownian Motion for Sports
Probability of team A winning, p = P(X(1) > 0), given point spread (or drift) µ, and standard deviation (or volatility) σ and final score X(1) ∼ N (µ, σ2 ).
∼ N (µ, σ2 ), we have µ p = P(X (1) > 0) = Φ σ
I Given the normality assumption, X(1)
where Φ is the standard normal cdf. I Table 1 uses Φ to convert team A’s advantage µ to a probability
scale using the information ratio µ/σ.
µ/σ p = Φ(µ/σ )
0 0.5
0.25 0.5 0.60 0.69
0.75 1 0.77 0.84
1.25 1.5 0.89 0.93
Probability of winning p versus the ratio µ/σ Evenly matched and µ/σ = 0 then p = 0.5.
2 0.977
Conditional Probability After time t has elapsed: I Current lead of l for team A as the conditional distribution:
(X(1)|X(t) = l) = (X(1) − X(t)) + l ∼ N (l + µ(1 − t), σ2 (1 − t)) I The probability of team A winning at time t given a current lead
of l points is: pt = P(X(1) > 0|X(t) = l) = Φ
l + µ (1 − t) p σ (1 − t)
!
= −4 and volatility is σ = 10.6, then team A has a µ/σ = −4/10.6 = −0.38 volatility point disadvantage.
I Point spread µ
The probability of winning is Φ(−0.38) = 0.353 < 0.5.
Implied Volatility Initial point spread–or the markets’ expectation of the expected outcome. Probability that the underdog team wins is p = Φ(µ/σ ) = Φ(−4/10.6) = 35.3%. I The volatility is a decreasing function of t, illustrating that the
volatility dissipates over the course of a game.
1 1 3 0 1 4 2 4 √t σ 1 − t 10.6 9.18 7.50 5.3 0
Volatility Decay over Time I We can infer the implied volatility, σIV , by solving
σIV :
p=Φ
µ σIV
which gives σIV =
µ . Φ −1 (p )
SuperBowl XLVII: Ravens vs 49ers TradeSports.com
Figure: SuperBowl XLVII
Super Bowl XLVII: Ravens vs San Francisco 49ers Super Bowl XLVII was held at the Superdome in New Orleans on February 3, 2013. We will track X(t) which corresponds to the Raven’s lead over the 49ers at each point in time. Table 3 provides the score at the end of each quarter.
t Ravens 49ers X (t)
0 0 0 0
1 4
7 3 4
1 2
3 4
1 21 28 34 6 23 31 15 5 3
SuperBowl XLVII by Quarter
Initial Market Initial point spread which was set at the Ravens being a four point underdog, µ = −4. µ = E (X(1)) = −4. The Ravens upset the 49ers by 34 − 31 and X(1) = 34 − 31 = 3 with the point spread being beaten by 7 points. I To determine the markets’ assessment of the probability that the
Ravens would win at the beginning of the game we use the money-line odds. San Francisco −175 and Baltimore Ravens +155. This implies that a bettor would have to place $175 to win $100 on the 49ers and a bet of $100 on the Ravens would lead to a win of $155. Convert both of these money-lines to implied probabilities of the each team winning pSF =
100 175 = 0.686 and pBal = = 0.392 100 + 175 100 + 155
Probabilities of Winning The probabilities do not sum to one. This ”excess” probability is in fact the mechanism by which the oddsmakers derive their compensation. The difference is the ”market vig”, also known as the bookmaker’s edge. pSF + pBal = 0.686 + 0.392 = 1.078 providing a 7.8% edge for the bookmakers. I Put differently, if bettors place money proportionally across
both teams then the bookies will make 7.8% of the total staked. We use the mid-point of the spread to determine p implying that p=
1 1 p + (1 − pSF ) = 0.353 2 Bal 2
From the Ravens perspective, we have p = P(X(1) > 0) = 0.353. Baltimore’s win probability started trading at pmkt = 0.38 0
Half Time Analysis The Ravens took a commanding 21 − 6 lead at half time. I The market was trading at pmkt 1
= 0.90.
2
During the 34 minute blackout 42760 contracts changed hands with Baltimore’s win probability ticking down from 95 to 94. The win probability peak of 95% occured again after a third-quarter kickoff return for a touchdown. At the end of the four quarter, however, when the 49ers nearly went into the lead with a touchdown, Baltimore’s win probability had dropped to 30%.
Implied Volatility To calculate the implied volatility of the Superbowl we substitute the pair (µ, p) = (−4, .353) into our definition and solve for σIV . σ=
I We obtain
σIV =
µ Φ −1 (p )
,
µ −4 = 10.60 = −0.377 Φ −1 (p )
where Φ−1 (p) = qnorm(0.353) = −0.377. The 4 point advantage assessed for the 49ers is under a 12 σ favorite. I The outcome X(1)
= 3 was within one standard deviation of the pregame model which had an expectation of µ = −4 and volatility of σ = 10.6.
Half Time Probabilities
What’s the probability of the Ravens winning given their lead at half time? At half time Baltimore led by 15 points, 21 to 6. The conditional mean for the final outcome is 15 + 0.5 ∗ (−4) = 13 √ and the conditional volatility is 10.6 1 − t. These imply a probability of .96 for Baltimore to win the game. I A second estimate of the probability of winning given the half
time lead can be obtained directly from the betting market. From the online betting market we also have traded contracts on TradeSports.com that yield a half time probability of p 1 = 0.90. 2
What’s the implied volatility for the second half?
pmkt reflects all available information t I For example, at half-time t
σIV,t= 1 = 2
=
1 2
we would update
l + µ (1 − t) 15 − 2 √ = 14 √ = Φ − 1 ( pt ) 1 − t Φ−1 (0.9)/ 2
where qnorm(0.9) = 1.28.
> 10.6, the market was expecting a more volatile second half–possibly anticipating a comeback from the 49ers.
I As 14
How can we form a valid betting strategy? Given the initial implied volatility σ = 10.6. At half time with the Ravens having a l + µ(1 − t) = 13 points edge
= 10.6 √ p 1 = Φ 13/(10.6/ 2) = 0.96
I We would assess with σ
2
probability of winning versus the pmkt = 0.90 rate. 1 2
I To determine our optimal bet size, ωbet , on the Ravens we might
appeal to the Kelly criterion (Kelly, 1956) which yields ωbet = p 1 − 2
q1 2
Omkt
= 0.96 −
0.1 = 0.60 1/9
Mu = 4, Sigma = 6, P(win) = 0.75
Mu = 4, Sigma = 16, P(win) = 0.6
Probability
0.000
0.02 0.00
-40
-20
0
20
Lakers' Final Margin of Victory
40
0.010
0.020
0.06 0.04
Probability
0.008 0.004 0.000
Probability
0.012
Mu = 4, Sigma = 32, P(win) = 0.55
-40
-20
0
20
Lakers' Final Margin of Victory
40
-40
-20
0
20
Lakers' Final Margin of Victory
40
10 0 -30
-20
-10
Points ahead, X(t)
20
30
Line = 4 points, P(win) = 0.6
0.0
0.2
0.4
0.6
0.8
1.0
t
Figure: The family of probability distributions for X(t), for any t between 0 and 1. Any vertical slice for a specific value t1 corresponds to a normal distribution for X(t1 ). The darker the shading at t, the higher the probability of the corresponding score-difference for X(t) along that particular time slice.
NBA: Lakers vs Celtics: Game 7, 2010 The Los Angeles Lakers hosting Boston Celtics in Game 7 of the NBA Finals. I At the end of the 3rd quarter, the Celtics are ahead by 17 points,
and look to be dominating the game. Yet the Lakers come storming back over the final 12 minutes of play, and end up scoring the decisive basket in the final minute to earn a 1-point victory, and the championship. I As a discerning basketball fan, you decide to be a bit more
measured in your assessment of the result. Should you be impressed by such an incredible feat? Or was the comeback by the Lakers really not that improbable, after all?
Final Score Pre-game odds the Lakers were thought to have a 60% chance of winning Expected margin of victory of 4 points—“the line”. I We model the Lakers’ lead X(t) at any given time t in the game.
X(t) negative indicates a deficit. Let’s assume that a game begins at t = 0 and ends at t = 1.
= −17 The Lakers are down by 17 points at the 3/4 mark of the game. We would like to know the probability that X(1) > 0, which corresponds to a lead at time t = 1, when the final buzzer sounds.
I Currently X(0.75)
Probability Model Our original question in terms of probability statements means we are interested in P X(1) > 0 | X(0.75) = −17 .
I A natural choice for modeling X(1) is a normal distribution.
Therefore, we write X(1) ∼ N (µ, σ2 ) to indicate that the final margin of victory, X(1), follows a normal distribution with mean µ and variance σ2 . I Volatility σ must ensure that P(X(1)
> 0) = 0.6.
We solve and get σ = 16 (or 15.8 exactly)
Assessing the Comeback We’re now ready to answer our original question: how impressive was the Lakers’ 17-point comeback in the fourth quarter? I We phrased this question in terms of
P X(1) > 0 | X(0.75) = −17 . If this number is very small, then the Lakers have clearly accomplished something remarkable. I To compute this quantity, observe that D
X(1) = X(0.75) + X(1) − X(0.75) .
Distribution of Time-to-Go I By applying the rules above, we know that D
X(1) − X(0.75) = X(1 − 0.75) = X(0.25) . Therefore, D
X(1) = X(0.75) + X(0.25) . Conditioning on the knowledge that X(0.75) = −17. Under our modeling assumptions, X(0.25) ∼ N (0.25µ, 0.25σ2 ) for µ = 4 and σ2 = 162 , the mean and variance we computed using the Vegas odds. I The upshot is: a normally distributed random variable with this
mean and variance has roughly a 2.3% chance of being 17 or more. According to our model, we have seen an unlikely event, but hardly a miracle!
Implications √ Stern √ uses a probit model to check this with regressors l/ 1 − t and µ 1 − t. He estimates for the whole game µˆ = 4.87 and σˆ = 15.82. I Home team greater than 50 % chance of winning if behind by 2
points at halftime. I Figure 3 plots the probability when t
= 0.9 (5 mins left).
I Table 3 provides estimates for baseball µˆ
Our model is highly plausible.
= 0.34 and σˆ = 4.04.
Example Suppose you have a team that has a µ = 6 with a lead l = 6 at half-time t = 12 with an initial volatility of σ = 12. One-σ move increases probability of winning from Φ( 21 ) = 0.69 to Φ(1) = 0.84 I The standard deviation varies in time according to
1 σ t 10 σ √
2 10/1.41
3 10/1.73
4 5
I Aways think in terms of the number of standard deviation at
anyone time that a team is ahead or not and then convert to the probability scale.
Black-Scholes Let X(t) represent a stock price at time t. The Black-Scholes model will provide a convenient way of estimating probabilities such as P(X(1) > 1000|X(0) > 530)
I We know that X(t)
> 0 for all values of t. A simple model is to use a log-normal distribution to describe stock prices. If Z ∼ N (0, 1) is a standard normal then X = eZ is a standard log-normal random variable.
I For example, we will assume that the log-return has a normal
distribution log
X (1) = µ + σZ where Z ∼ N (0, 1) X (0)
∼ N (µ, σ2 )
Model The model can then be written X(1) = X(0)eµ+σZ Hence X(1) we have a log-normal distribution. I Continuously compounded mean µ of log-returns
From properties of the mean of a log-normal 1 2 E(X(1)) = X(0)eµ+ 2 σ . Common to define µ˜ = µ + 21 σ2 or µ = µ˜ − 12 σ2 . 1 2 + σB
X(1) = X(0)eµ˜ − 2 σ
t
I We are now in a position to calculate the probability of interest
X (1) 1000 > log P(X(1) > 1000|X(0) > 530) = P log X (0) 530 0.06348 − µ =F σ
Option Pricing Bachlier (1900) Theory of Speculation was popularized by Black and Scholes (1973). I The formula prices a call option as follows. If a stock price is
currently at X(0) and its future prices is assumed to follow a geometric Brownian motion with constant drift µ and volatility σ as specified by dX(t) = µX(t)dt + σX(t)dBt Here √ µ is the continuously compounded growth rate and Bt = tZ ∼ N (0, t). I Fair price of an option with exercise price K, payout
max(X(T ) − K, 0) at the expiry date T is given by E e−rT max(X(T ) − K, 0) = X(0)Φ (d1 ) − Ke−rt Φ (d2 ) where r is the risk-free rate, Φ is the standard normal cdf and the parameters d1 =
ln(X(T )/K) + (r + 12 σ2 )T ln(X(T )/K) + (r − 21 σ2 )T √ √ and d2 = σ T σ T
Risk Neutral Distribution Solving the differential equation for X(t) we see that 1 2 )t+ σ
X (t ) = X (0 )e( µ − 2 σ
√
tZ
I To obtain the risk neutral evolution of prices, denoted by Q, we
simply replace µ by the risk-free rate r 1 2 )t+ σ
X ( t ) = X ( 0 ) e(r− 2 σ
√
tZ
Hence, to calculate the expectation of the call option payout requires √ EQ e−rT max(aeσ
TZ
− c, 0)
where c = K is the exercise price, a = X(0) is the current stock price. I If we look at incremental movements, we also have a martingale
as
1 2 )(t−s)+ σ (B −B ) s t
X ( t ) = X ( s ) e(r− 2 σ
This implies E (X(t)|X(s)) = X(s)–a martingale.
Calculation I We can calculate the option price using the tail properties of a
normal distribution. Let µt = (r − 12 σ2 )t. Then we need to calculate Z ∞ √ √ E e−rT max(aeµt +σ tZ − c, 0) = e−rT max(aeµt σ tZ − c, 0)p(Z)dZ −∞ Z √ µt +σ tZ −rT ae = e − c p(Z)dZ √ aeµt +σ
−rT
Z
tZ −c≥0
√ µt +σ tZ
−rT
Z
p(Z)dZ − ce e p √ √ σ tZ≥− ln(a/c)−µt σ tZ≥− ln(a/c)−µt √ √ = aP σ tZ > ln(c/a) − µt − ce−rT P σ tZ > ln(c/a) − µt
= ae
where the first integral is calculated as a tail√normal probability having completed the square on the term σ tZ − 12 Z2 .
Black-Scholes This is the same as the Black-Scholes formula E e−rT max(X(T ) − K, 0) = X(0)Φ (d1 ) − Ke−rt Φ (d2 )
I The parameters are given by
d1 =
ln(a/c) + (r − 12 σ2 )t ln(a/c) + (r + 12 σ2 )t √ √ and d2 = σ t σ t
where a = X(0), c = K and t = T.
Extensions There are many limitations and extensions: 1. Jumps (mergers and acquisitions, earnings announcements), 2. Dividends 3. Down-and-out features and exotics 4. stochastic volatility, leverage effect negative shocks stocks implies volatility spikes 5. Contagion. I One important concept is the idea of an implied volatility– all
the departures from the model will show up as volatility smiles and smirks I Adding stochastic volatility and jumps.
Probability: 41901 Week 9: Martingales and Time Series Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Overview This lecture will cover the following topics I Martingales
1. 2. 3. 4.
Doob’s Optimal Stopping Theorem Hitting times Distribution of a Maximum and Hitting time Ruin Problem
I Time Series
1. Winning Kentucky Derby times 2. Pandemic SEIR models
Martingales Martingales: E (Xn |Xn−1 , . . . , X0 ) = Xn−1 sub-Martingales: E (Xn |Xn−1 , . . . , X0 ) ≥ Xn−1 super-Martingales: E (Xn |Xn−1 , . . . , X0 ) ≤ Xn−1 I Random Walk: Sn
= X1 + . . . + Xn
I Asymmetric Random Walk:
Sn q Zn = p I Brownian Motion:
I B2t
E (Bt |Bs ) = Bs
− t, . . .
A Markov Process is a statement about probability distributions not expectations.
Fair and Unfair Games
Think of Xn − Xn−1 as your winnings from a $1 stake in game n. I In the martingale case (fair game)
E(Xn − Xn−1 |Fn−1 ) = 0 I In the super-martingale case (unfair game)
E(Xn − Xn−1 |Fn−1 ) ≤ 0 I A series of bets Cn is previsible if Cn is
Fn−1 -measurable
Gambling Strategies Dubins and Savage 1960s: How to gamble if you must. Yn =
∑ Ci (Xi − Xi−1 ) i
Yn − Yn−1 = Cn (Xn − Xn−1 )
Fn−1 -measurable, eg Cn = f (X1 , . . . , Xn−1 ). Based on iterated expectation
I Here that Cn is
E(Yn − Yn−1 |Fn−1 ) = Cn E(Xn − Xn−1 |Fn−1 ) ≤ 0 or equals zero if X is a martingale.
Geometric BrownianMotion: Xt = exp λBt − 21 λ2t . 2 σ2 /2
Recall that if Z ∼ N (µ, σ2 ) then the mgf is E(etZ ) = etµ+t I Since for t
, ∀t.
> s,
1 Xt = Xs exp λ(Wt − Ws ) − λ2 (t − s) 2
and Wt − Ws is independent of Fs with distribution N (0, t − s). I Therefore
1 E(Xt |Xs ) = Xs E exp λ(Wt − Ws ) − λ2 (t − s) 2
= Xs
Properties Given a Martingale sequence X0 , X1 , . . . , XT−1 , XT . We have E (Xt |Xt−1 ) = Xt−1 . I By the tower property of expectations
E (XT ) = E (E (XT |XT−1 )) = E (XT−1 ) = . . . = E (X0 ) I Valuation
Transform the problem to a martingale. Then value payouts as if things are a martingale E Q ( XT ) = X0
Gambler’s Ruin
For the simple random walk St we can represent the fortune of a gambler at time t where they win $1 with probability p and lose $1 with probability q = 1 − p.
= S0 and are trying to achieve barrier a before bankruptcy 0. Let Ta denote the first hitting time to barrier a.
I They start with initial capital c
I We need P (Ta
< T0 ). We’ll analyze this probability later using martingale methods.
I
We have the following picture
Ruin Problem Feller (1956) and Nierderhoffer (1997): I Question: Compute the probability of Brownian Motion exiting
first from the interval [a, b] (where a ≤ 0, b ≥ 0) through the right-hand endpoint b, say. I This is a ruin problem because if the Brownian motion
represents your fortune evolving while playing some game, and you start with fortune x and wish to compute the probability that your fortune reaches some level c before you go bankrupt, this would be the situation where b = c − x and a = −x. I We’ll use Doob’s theorem to compute P(Tb
< Ta )
Using Martingales A stochastic process {Xt , t ≥ 0} is a martingale relative to Brownian motion {Wt : t ≥ 0} or, more strictly, its filtration {Ft : t ≥ 0}, if for each t ≥ 0, I Xt is adapted to I E(|Xt |)
Ft , so that it is an Ft -r.v.
< ∞ ∀t
I E(Xt |Fs )
= Xs , for each s, 0 ≤ s ≤ t.
I Think of Xs as denoting your fortune at time s I By iterated expectation we have
E ( Xt ) = E ( X0 ) ∀ t ≥ 0
Stopping Times Definition: A map T : Ω → Z+ = {0, 1, 2, . . .} is a stopping time if { ω : T ( ω ) ≤ n} ∈ Fn .
= time you decide to stop playing the game and depends only on the history up to (and including) time n
I Intuitively: T
I Theorem: (stopped supermartingales are supermartingales)
If X is a supermartingale and T is a stopping time then the stopped process XT is a super-martingale and in particular E(Xmin(T,n) ) ≤ E(X0 ) I If X is a martingale, then E(Xmin(T,n) )
= E ( X0 ) .
Note: This is not the same as E(XT ) = E(X0 ).
Gambler’s Doubling Strategy Xn is a simple random walk starting at 0 initial wealth I T
= inf {n : Xn = 1} We know that P(T < ∞) = 1.
I However
1 = E ( XT ) 6 = E ( X0 ) = 0 with probability one I always make a dollar? I Stopping theorem tells us that
E(Xmin(T,n) ) = E(X0 ) ∀n The problem is that E(T ) = ∞ for the doubling strategy.
Doob’s Optimal-Stopping Theorem Theorem Let T be a stopping time. Let X be a super-martingale. Then E ( XT ) ≤ E ( X0 ) In each of the following situations I T is bounded (for some N, T (ω )
< N ∀ω)
I X is bounded and T is a.s. finite. I E(T )
< ∞ and for some K, |Xn (ω ) − Xn−1 (ω )| < K.
In the martingale case, we have E(XT ) = E(X0 )
Waiting times
Monkey typing Shakespeare. I A monkey types symbols at random, one per unit time,
producing an infinite sequence Xn of random variables with values in the set of possible symbols. I If he only types capital letters and it is equally likely to type any
one of the 26 letters, how long on average will it take to type the sequence ABRACADABRA?
Example I A Fair coin will be tossed producing a sequence like
HHHTHTT . . .. I The first player picks their favorite sequence of length three (eg
HHH). The second player picks their sequence. I The winner is the player whose sequence appears first in the
sequence of coin tosses.
Solution Imagine a casino game: Gambler bets $1 on each toss of the coin On the first toss, the first gambler starts to bet on the occurrence of the sequence I House collects the $1 and pays out fair odds I If the first gambler wins he lets his winnings ride on the second
toss. A second gambler enters and starts betting at the beginning of the sequence I There will be I gamblers who win for a particular sequence
= T − ∑I Wi is a martingale. Intuitively this is just the balance in the casino’s book.
I Check that XT
I Now apply optimal stopping:
E ( XT ) = E ( X0 )
Random Walk and Brownian Motion Discrete Time Limit
I
{Xn : n ≥ 0} iid with E(Xn ) = 0 and Var(Xn ) = 1.
I Let Sn
= X1 + . . . + Xn be a random walk. Then, S[nt] Bn (t) = √ →D B(t) n
a standard Brownian motion Bt .
Distribution of a Maximum
Reflection principle. Joint distribution of Wt and Mt = max0≤s≤t Ws ? I Mt represents the highest level reached in the time interval [0, t].
≥ 0 is non-decreasing in t. For a > 0, we define Ta as before so that
I Mt
{Mt ≥ a} = {Ta ≤ t}
Joint Distribution of Mt and Ta
I Taking T
= Ta , we have for a ≥ x and all t ≥ 0.
P(Mt ≥ a, Wt ≤ x) = P(Ta ≤ t, Wt ≤ x) ˜ t) = P(Ta ≤ t, 2a − x ≤ W ˜ t) = 1 − Φ = P(2a − x ≤ W
2a − x √ t
Hitting a Level By symmetry, P(Mt ≥ a, Wt ≥ 2a − x) = P(Mt ≥ a, Wt ≤ x) whence we see that √ P(Ta ≤ t) = P(Mt ≥ a) = 2 1 − Φ(a/ t) Letting t → ∞ we have that P(Ta < ∞) = 1 since Φ(0) = 0.5. I Differentiating with respect to t gives the probability density
function for Ta as √ fTa (t) = a exp −a2 /2t / 2πt3 where t > 0. I This is known as the inverse Gaussian distribution. We can
check directly that E(Ta ) = ∞ (as the density is O(t−3/2 ) as t → ∞).
I This is related to the problem with the doubling-up strategy.
Example: Winning Kentucky Derby Times Trend, Random Walk, Mean Reversion?
36 34 32
speed mph
Kentucky Derby
1880
1920
1960 year
Figure: Winning Speed
2000
Kentucky Derby
34 33 32 31
speed mph
35
time trend removed
Stone Street Riley
Kingman 1880
1920
1960 year
Figure: Winning Speed
2000
Kentucky Derby
36.0 35.0 34.0
speed mph
37.0
MA(1) predictions, outliers removed
0
20
40
60
80
100
120
140
year
Figure: Today’s Horses are getting Worse!
Kentucky Derby Recent Winners
The last six winners are: Secretariat’s winning time is 1 min and 59 2/5 seconds . > tail(mydata) year year_num 135 2009 135 136 2010 136 137 2011 137 138 2012 138 139 2013 139 140 2014 140
date 2-May-09 1-May-10 7-May-11 5-May-12 4-May-13 2-May-14
winner mins secs timeinsec distance spee Mine That Bird 2 2.66 122.66 1.25 36. Super Saver 2 4.04 124.04 1.25 36. Animal Kingdom 2 2.04 122.04 1.25 36. I’ll Have Another 2 1.83 121.83 1.25 36. Orb 2 2.89 122.89 1.25 36. California Chrome 2 3.66 123.66 1.25 36.
Example: History of Pandemics Bill Gates: 12/11/2009:
“I’m most worried about a worldwide
Pandemic ...” Early-period Pandemics Plague of Athens Black Death London Plague Recent Flu Epidemics Spanish Flu Asian Flu Hong Kong Flu
Dates 430 BC 1347 1666
Size of Mortality 25% population. 30% Europe 20% population
Dates 1900-2010 1918-19 H2N2, 1957-58 H3N2, 1968-69
Spanish Flu killed more than WW1 H1N1 Flu 2009: 18, 449 people killed World wide
Size 40-50 million 2 million 6 million
Typical Influenza Mortality Pattern I 1918 influenza and pneumonia mortality, UK
Pattern of mortality led to models of growth
SEIR Epidemic Models Growth “self-reinforcing”: More likely if more infectants I An individual comes into contact with disease at rate β 1 I The susceptible individual contracts the disease with
probability β 2 I Each infectant becomes infectious with rate α per unit time I Each infectant recovers with rate γ per unit time
St + Et + It + Rt = N
Current Models: SEIR susceptible-exposed-infectious-recovered model
Dynamic models that extend earlier models to include exposure and recovery.
60 40 20
number of people
Susceptible Exposed Infected Recovered
0
The coupled SEIR model: S˙ = − βSI E˙ = βSI − αE I˙ = αE − γI R˙ = γI
80
100
SEIR model
0
5
10
15 time
20
25
Infectious disease models Daniel Bernoulli’s (1766) first model of disease transmission in smallpox: “I wish simply that, in matters which so closely concern the well being of the human race, no decision shall be made without all knowledge which a little analysis and calculation can provide” I R.A. Ross, (Nobel Medicine winner, 1902) – math model of
malaria transmission, which ultimately lead to malaria control. Ross-McDonald model
I Kermack and McKendrick: susceptible-infectious-recovered
(SIR) London Plague 1665-1666; Cholera: London 1865, Bombay, 1906.
Example: London Plague, 1666 Village Eyam nr. Sheffield Model of transmission from Infectants, I, to susceptibles, S. Date 1666 Initial July 3 July 19 Aug 3 Aug 19 Sept 3 Sept 19 Oct 3 Oct 19
Susceptibles 254 235 201 153 121 108 97 – 83
Infectives 7 15 22 29 21 8 8 – 0
Initial Population N = 261 = S0 ; Final population S∞ = 83.
Modeling Growth: SI
Coupled Differential eqn S˙ = − βSI, I˙ = ( βS − α)I I Estimates β α
= 6.54 × 10−3 , αβ = 1.53. βˆ ln(S0 /S∞ ) = α S0 − S∞
Predicted maximum 30.4, very close to observed 29 Key: S and I are observed and α, β are estimated in hindsight
R0: Basic Reproductive Ratio
R0 := βS0 /γ > 1 R0 used as the currency in public health (CDC) I R0 can be interpreted as the mean number of secondary infected
that one infectious person introduces in a completely susceptible population produces during his/her infectious lifetime I R0 not only reflects the severity and virulence,but the
demographics, health and mixing characteristics of the society I Caveat R0 is very epidemic-instance dependent.
Transmission Rates R0 for 1918 Episode
I 1918-19 influenza pandemic: Mills et al. 2004: Viboud et al. 2006: Massad et al. 2007: Nishiura, 2007: Chowell et al., 2006: Chowell et al., 2007:
45 US cities England and Wales Sao Paulo Brazil Prussia, Germany Geneva, Switzerland San Francisco
3 (2-4) 1.8 2.7 3.41 2.7-3.8 2.7-3.5
The larger the R0 the more severe the epidemic. Transmission parameters vary substantially from epidemic to epidemic
Probability: 41901 Week 10: Stochastic Volatility, MCMC and PF Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/
Topics
1. Random Walks and Mean Reversion 2. Continuous Time Models 3. Time-varying Volatility 4. Stochastic Volatility 5. Applications: Portfolio Choice and Option Pricing
Random Walks and Mean Reversion Random Walk has evolution Pt+1 = Pt + et Here E(et ) = 0 and so E(Pt+1 |Pt ) = Pt . I Mean Reversion:
Pt+1 = µ + βPt + et where | β| < 1. The long run average is given by P = µ + βP or P =
I Log-returns are log
Pt + 1 Pt
.
µ 1−β
Time-Varying Volatility A very popular tool is to allow for time-varying volatility I Let’s first make volatility time-varying: σt
Pt+1 = µ + Pt + σt et I If yesterday’s movement was a large down versus large up? I Is volatility mean-reverting? Let eˆ t2 be yesterday’s squared
residual.
σt2+1 = α + βσt2 + γeˆt2
I How about an asymmetry effect? Black (1976)
log σt2+1 = α + β log σt2 + γet − ν|et | I See Rosenberg (1972), Engle (1972). Nelson (1991): EGARCH.
VIX index There’s is typically a spread between the implied and historical volatilities
50 Implied Historical
Volatility (%)
40
30
20
10 1990
1992
1994
1996 Year
1998
2000
Volatility VIX vs SP500 Crisis of 2008: Large jump. Mean reverting diffusion or Shot noise process? How innovation happens: The ferment of finance | The Economist
SP500 VIX Index in the 1990s
Figure: The pattern of spikes and crises in the 1990s
Single Stocks and Earnings: INTC
Figure: Predictable increase in volatility before and sharp drop after earnings.
Stochastic Volatility Jump Models I We’ll use an SVJ model:
dSt
=
µP t St dt + St
p
Vt dWts
(P) + d
Nt ( P )
∑
Sτj − e
Zj (P)
−1
j=1
dVt
= κv (θv − Vt ) dt + σv
p
Vt dWtv (P)
I Here W s (P) and W v (P) are Brownian motions. t t
Nt (P) counts the number of jump times, τj , prior to time t. µt is the equity risk premium, µP t St is the jump compensator.
!
Filtering, Smoothing, Learning and Prediction Data yt depends on a latent state variable, xt . y Observation equation: yt = f xt , ε t State evolution: xt+1 = g xt , εxt+1 ,
I Posterior distribution of p xt |yt where yt
= (y1 , ..., yt )
I Prediction and Bayesian updating.
Z p xt+1 |yt = p (xt+1 |xt ) p xt |yt dxt , updated by Bayes rule p xt+1 |yt+1 ∝ p (yt+1 |xt+1 )p xt+1 |yt . {z }| {z } {z } | | Posterior
Likelihood
Prior
(1)
(2)
SV: Motivating Example SV models (JPR, EJP) yt + 1 =
p
y
Vt + 1 ε t + 1
log (Vt+1 ) = αv + β v log (Vt ) + σv εvt+1 , Typical parameters αv = 0, β v = 0.95, and σv = 0.10. Continuously compounded returns
5
0
−5
0
200
400
600 Time Stochastic volatility
800
1000
1200
400
600 Time
800
1000
1200
4 2 0
0
200
Galton 1877: First Particle Filter
Streaming Data Online Learning Construct an essential state vector Zt+1 . p(Zt+1 |yt+1 ) =
Z
p(Zt+1 |Zt , yt+1 ) dP(Zt |yt+1 )
∝
Z
z }| { p(Zt+1 |Zt , yt+1 ) p(yt+1 |Zt ) dP(Zt |yt ) | {z } | {z } propagate
resample
(i)
1. Re-sample with weights proportional to p(yt+1 |Zt ) and ζ (i) N } i=1
generate {Zt
(i)
ζ (i)
2. Propagate with Zt+1 ∼ p(Zt+1 |Zt
Parameters: p(θ |Zt+1 ) drawn “offline”
(i)
, yt+1 ) to obtain {Zt+1 }N i=1
Nonlinear Model
I The observation and evolution dynamics are
yt =
xt + vt , where vt ∼ N (0, 1) 1 + x2t
xt = xt−1 + wt , where wt ∼ N (0, 0.5) I Initial condition x0
∼ N (1, 10)
Fundamental question: How do the filtering distributions p(xt |yt ) propagate in time?
y −3
2
−2
−1
4
x
0
6
1
2
8
3
Nonlinear: yt = xt /1 + x2t + vt
0
20
60 Time
100
0
20
60 Time
100
Simulate Data # MC sample size N = 1000 # Posterior at time t=0 p(x[0]|y[0])=N(1,10) x = rnorm(N,1,sqrt(10)) hist(x,main="Pr(x[0]|y[0])") # Obtain draws from prior p(x[1]|y[0]) x1 = x + rnorm(N,0,sqrt(0.5)) hist(x1,main="Pr(x[1]|y[0])") # Likelihood function p(y[1]|x[1]) y1 = 5 ths = seq(-30,30,length=1000) plot(ths,dnorm(y1,ths/(1+ths^2),1),type="l",xlab="",ylab="") title(paste("p(y[1]=",y1,"|x[1])",sep=""))
Nonlinear Filtering Pr(x[1]|y[0])
1.0e−05
200 150
0.0e+00
50 0
0 −10
0
5
10
−10
0
5
15
Pr(x[1]|y[1])
Pr(x[2]|y[1])
0
5 x
10
−10
10
30
0.10
p(y[2]=−2|x[2])
0.06
0 −5
−30
0.02
Frequency
300 100
100 200 300 400 500
x1
500
x
0
Frequency
p(y[1]=5|x[1])
100
Frequency
150 100 50
Frequency
200
Pr(x[0]|y[0])
−10
0 x2
5
10
−30
−10
10
30
Resampling Observe y1 = 5, y2 = −2 Key: resample and propagate particles # Computing resampling weights w = dnorm(y1,x1/(1+x1^2),1) # Resample to obtain draws from p(x[1]|y[1]) k = sample(1:N,size=N,replace=TRUE,prob=w) x = x1[k] hist(x,main="Pr(x[1]|y[1])") # Obtain draws from prior p(x[2]|y[1]) x2 = x + rnorm(N,0,sqrt(0.5)) hist(x2,main="Pr(x[2]|y[1])") ...
Propagation of MC error
−10 40
60
80
100
0
20
40
60
Time
Time
N=5000
N=10000
80
100
80
100
0 5 −10
−10
0 5
mx
15
20
15
0
mx
0 5
mx
0 5 −10
mx
15
N=2000
15
N=1000
0
20
40
60
Time
80
100
0
20
40
60
Time
Dynamic Linear Model (DLM) Kalman Filter
Kalman filter for linear Gaussian systems I FFBS (Filter Forward Backwards Sample)
This determines the posterior distribution of the states p(xt |yt ) and p(xt |yT ) Also the joint distribution p(xT |yT ) of the hidden states. I Discrete Hidden Markov Model HMM (Baum-Welch, Viterbi) I With parameters known the Kalman filter gives the exact
recursions.
Simulate DLM Dynamic Linear Models yt = xt + vt and xt = α + βxt−1 + wt Simulate Data T1=50; alpha=-0.01; beta=0.98; sW=0.1; sV=0.25; W=sW^2; V=sV^2 y = rep(0,2*T1) ht = rep(0,2*T1) for (t in 2:(2*T1)){ ht[t] = rnorm(1,alpha+beta*ht[t-1],sW) y[t] = rnorm(1,ht[t],sV) } ht = ht[(T1+1):(2*T1)] y = y[(T1+1):(2*T1)]
Exact calculations Kalman Filter recursions m=rep(0,T1);C=rep(0,T1); m0=0; C0=100 R = C0+W; Q = R+V; A = R/Q m[1] = m0+A*(y[1]-m0); C[1] = R-A^2*Q for (t R Q A m[t] C[t] }
in 2:T1){ = C[t-1]+W = R+V = R/Q = m[t-1]+A*(y[t]-m[t-1]) = R-A^2*Q
# Quantiles q025 q975 h025 sd
= = = =
function(x){quantile(x,0.025)} function(x){quantile(x,0.975)} apply(hs,2,q025) sqrt(apply(hs,2,var))
DLM Data y(t) ●
●
●
● ●
0.0
●
● ●
●
● ●● ● ●
●
●
● ●
−0.5
●●
●
●
●
●
● ●
●● ●
●
●
●
●
●
● ●
● ●
●
●
● ● ●
−1.0
y(t)
●
●
●
●
y(t) m(t) m(t)−2*sqrt(C(t)) m(t)+2*sqrt(C(t)) 0
10
● ●
●
20
30 time
40
50
Bootstrap Filter Now run the particle filter # Bootstrap filter M = 1000 h = rnorm(M,m0,sqrt(C0)) hs = NULL for (t h1 = w = w = h = hs = }
in 1:T1){ rnorm(M,alpha+beta*h,sW) dnorm(y[t],h1,sV) w/sum(w) sample(h1,size=M,replace=T,prob=w) cbind(hs,h)
DLM Filtering
0.5
Filtering
−1.0
−0.5
x(t)
0.0
True SMC
0
10
20
30 time
40
50
Streaming Data How do Parameter Distributions change in Time?
Online Dynamic Learning I Real-time surveillance I Bayes means sequential
updating of information I Update posterior
density p(θ | yt ) with every new observation (t = 1, . . . , T) “sequential learning”
Bayes theorem: p(θ | yt ) ∝ p(yt | θ) p(θ | yt−1 )
Sample – Resample Posterior at t
● ● ●●● ● ●●● ●●●●● ● ●●●●● ● ● ●●
Prior at t+1
●
Weights at t+1
●
● ●●● ● ●
● ● ●● ●
● ●
Posterior at t+1
Prior at t+2
Weights at t+2 −0.5
−0.4
−0.3
●
●
●
●
−0.2
●
●●● ●●
●●● ●●●●●
●●●
●● ●
● ●
● ●● ● ●● ●● ●●
● ● ●
● ● ●● ●
● ● ● ● ●● ●● ● ● ●
●●● ● ● ● ● ●● ● ●● ●●
●● ●
●
●
−0.1
● ● ●● ●
●●
●●
0.0
●
● ●●
●
●●
● ● ● ●
●●●
● ● ●
●
●●
●
●
●
●
●
● ● ●
●●●
●● ●●
●
● ●●
● ●● ●●
●
0.1
●
0.2
● ●●
●
●
●
0.3
0.4
0.5
Resample – Sample Posterior at t
● ● ●●● ● ●●● ●●●●● ● ●●●●● ● ● ●●
●
●●●
Reweights at t+1
●●●●● ●● ●● ● ● ● ●●● ●●●● ●●●●●
●
●● ●
Posterior at t+1
●
● ● ● ●● ●●
● ●●● ●● ● ●● ● ●● ●● ●
● ●
Reweights at t+2
●
● ● ● ●● ● ●
●● ● ● ●●● ● ●●● ● ●●
●
Posterior at t+1
●
−0.5
−0.4
−0.3
●
−0.2
● ●●
−0.1
● ●●
● ●●● ● ● ●● ●● ●●● ● ●●
0.0
0.1
●
● ●
●
0.2
0.3
0.4
0.5
Particle Methods: Blind Propagation
Figure: Propagate-Resample is replaced by Resample-Propagate
Portfolio Choice How to allocate assets between risky assets? 1. Portfolio Analysis: de Finetti (1941) “On a Problem of Re-insurance” Markowitz (1951): “Portfolio Selection” 2. Cash or Stocks? Samuelson (1969) and Merton (1971) Optimal allocation to stocks ω? =
1 µ−r γ σ2
γ is the agent’s risk aversion Dynamically: µt and Vt and hedging demands.
Historical Return Data I Here the returns for US stocks etc
Period 1802-1998 1802-1870 1871-1925 1926-1998 1946-1998
Stocks 7.0 7.0 6.6 7.4 7.8
Bonds 3.5 4.8 3.7 2.2 1.3
Bills 2.9 5.1 3.2 0.7 0.6
Gold - 0.1 0.2 -0.8 0.2 -0.7
I What does the Samuelson/Merton rule yield?
Portfolio strategy: maxωt E[U (Wt+1 |Rt ] ωt∗ =
1 E[Rt+1 − Rf |Rt ] γ Var[Rt+1 |Rt ]
Inflation 1.3 0.1 0.6 3.1 4.2
SP500 Data
Market Timing: time-varying mean-reverting risk premia model Volatility Timing: SV and Leverage Strategy S&P500 SV: γ = 2 SV: γ = 4 SV: γ = 6
Mean 13.6 16.1 11.2 9.6
SD 14.9 17 10.2 7.1
Sharpe 0.49 0.53 0.68 0.94
Sequential Posteriors αµ
(*E-5)
αv 0.020
10
0.015 9
0.010 •••••••• •••••••••••••• ••• • • ••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• ••• ••• •••••••••••••••••••
8 ••••••••••••••• 7
0.005 ••• • •
0.0 -0.005 -0.010
6 1976
1980
1984
1988
1992
1996
2000
1976
1980
1984
σµ
(*E-4)
1988
1992
1996
2000
βv 1.00
30
0.99 20 10
•• •• •••• ••••••••••••• ••••••••••••••••••••••••••• ••••••••• ••• ••• • •••••••••••••••••• • •••••••••••••••••••••••••••••••••• •••••••••••• •
0.98 • ••• • •• • •••••••••••••••••••• • ••••• •••• • •• • •••••••••••• ••••••••••••• •• • • • • • • ••• • • • • ••••• • ••• ••••••••• •••••••••••••••••• ••••••••••••••• • •
••••• •• ••••••••••••••••••••••••••••••••••• ••••••• •• •••••••• •• ••••••••••••••••••••••••••••• ••• •••••• ••••••••• •• • ••••••••••••• • •• • • •• ••• •
0.97 0.96 0.95
0 1976
1980
1984
1988
1992
1996
2000
1976
1980
1984
ρ
1988
1992
1996
2000
σv
0.5
0.18 0.16
0.0 -0.5
• •• • • • •• • •• • • •• • • • • ••• • •••• ••••• • •••• • ••• • ••• •••••• • •••• ••••••• •••• •••• ••••••• ••• • •••••••••••••••••• • •• • • •• • •••• •• ••• ••• •••••• •• • •••• • •
0.14 • 0.12 0.10 0.08
-1.0
•• • •• • ••••••••••• • ••••• • •••••• • •••••••••••••••••••••• •• • • •••• • •• ••••••••••• ••• ••• ••••••••••••••••••••••••• ••••••••••••••• • ••••• ••• ••• •
0.06 1976
1980
1984
1988
1992
1996
2000
1976
1980
1984
1988
1992
1996
2000
Wealth - SP500 - fixed v - parameter learning gamma = 2 4
gamma = 4 4
Market Leverage No Leverage
3
gamma = 6 4
Market Leverage No Leverage
3
3
2
2
2
1
1
1
0
0
0
1974
1979
1984
1989
1994
1999
1974
1979
1984
1989
1994
1999
2.0
2.0
2.0
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0 1974
1979
1984
1989
1994
1999
Market Leverage No Leverage
1974
1979
1984
1989
1994
1999
1974
1979
1984
1989
1994
1999
0.0 1974
1979
1984
1989
1994
1999