41901: Probability and Statistics - Faculty

41901: Probability and Statistics Nick Polson Fall, 2016

October 1, 2016

Getting Started

I Syllabus

http://faculty.chicagobooth.edu/nicholas.polson/teaching/41900/ I General Expectations

1. Read the notes / Practice 2. Be on schedule 3. DeGroot and Schervish: Probability and Statistics

Course Expectations

Course Expectations : Homework: 20% Assignments. Handed in at Class I encourage you to do assignments in groups. Otherwise it’s no fun! Grading is X−, X, X+. Final: 80% Week 11 Grading: PhD course

Table of Contents Chapter 1 : Probability, Risk and Utility Chapter 2 : Conditional Probability and Bayes Chapter 3 : Distributions Chapter 4 : Expectations Chapter 5 : Modern Regression Methods Chapter 6 : Bayesian Statistics Chapter 7 : Hierarchical Models Chapter 8 : Bayesian Asset Allocation Chapter 9 : Brownian Motion and Sports Chapter 10 : Stochastic Processes, Martingales, SV

Slide 5 Slide 6 Slide 79 Slide 96 Slide 135 Slide 275 Slide 240 Slide 309 Slide 274 Slide 308

Introduction 1. W1-W8 Probability Reading: DeGroot and Schervish Chap 1&2: Probability and Conditional Probability Chap 3: Random Variables Chap 4: Expectations Chap 5: Special Distributions Chap 6: Hierarchical Models 2. W9-W10 Martingales and Stochastic Processes Reading: DeGroot and Schervish “Probability and Statistics” (4rd Edition)

Outline 1. Probability (axioms, subjective, utility, laws, conditional, Bayes theorem, applications, DNA testing, cdf’s and pdf’s) 2. Distributions (discrete, continuous, binomial, poisson, gamma, weibull, beta, exponential family, wishart) Transformations and Expectations (distribution of functions, expectations, mgf’s, convergence, conditional and marginal, bivariate normal, inequalities and identities) 3. Modern Regression Methods (Ridge, Lasso) 4. Bayesian Methods (Hierarchical Models, Shrinkage, Asset Allocation) 5. Martingales and Stochastic Processes (Brownian motion, hitting times, stopping times, ruin problem, transformations, Girsanov’s theorem, stochastic calculus)

Probability: 41901 Week 1: Probability, Risk and Utility Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Overview Probability and Paradoxes

1. Birthday Problem 2. Exchange Paradox 3. Probability: Axioms and Subjectivity 4. Expected Utility: Preferences 5. Risk 6. Probability and Psychology 7. St. Petersburg Paradox 8. Allais Paradox 9. Kelly Criterion Reading: DeGroot and Schervish: Chapter 1,1-47.

Ellsberg Paradox Probability is counter-intuitive!!! I Two urns

1. 100 balls with 50 red and 50 blue. 2. A mix of red and blue but you don’t know the proportion. I Which urn would you like to bet on? I People don’t like the “uncertainty” about the distribution of

red/blue balls in the second urn.

Likelihood of Death 120 Stanford graduates: Heart disease cancer Other natural causes Total 58 accident homicide other unnatural causes Total 32

34 23 35 actual 92 5 1 2 actual 8

I The P’s don’t even sum up to one!

People vastly overestimate probability of violent death

Probability Probability is a language designed to communicate uncertainty. It’s immensely useful, and there’s only a few basic rules. 1. If an event A is certain to occur, it has probability 1, denoted P(A) = 1 2. Either an event A occurs or it does not. P(not A) = 1 − P(A) 3. If two events are mutually exclusive (both cannot occur simultaneously) then P(A or B) = P(A) + P(B) 4. P(A and B) = P(A given B)P(B) = P(A|B)P(B)

Envelope Paradox

The following problem is known as the “exchange paradox”. I A swami puts m dollars in one envelope and 2m in another. He

hands on envelope to you and one to your opponent. The amounts are placed randomly and so there is a probability of 21 that you get either envelope. You open your envelope and find x dollars. Let y be the amount in your opponent’s envelope.

Envelope Paradox = 12 x or y = 2x. You are thinking about whether you should switch your opened envelope for the unopened envelope of your friend. It is tempting to do an expected value calculation as follows

I You know that y

E(y) =

1 1 1 5 · x + · 2x = x > x 2 2 2 4

Therefore, it looks as if you should switch no matter what value of x you see. A consequence of this, following the logic of backwards induction, that even if you didn’t open your envelope that you would want to switch!

Bayes Rule I Where’s the flaw in this argument? Use Bayes rule to update the

probabilities of which envelope your opponent has! Assume p(m) of dollars to be placed in the envelope by the swami. I Such an assumption then allows us to calculate an odds ratio

p y = 12 x|x p (y = 2x|x) concerning the likelihood of which envelope your opponent has. I Then, the expected value is given by

1 1 E(y) = p y = x | x · x + p (y = 2x|x) · 2x 2 2

and the condition E(y) > x becomes a decision rule.

Birthday Problem Example “.. for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us” (Fisher, 1937)

1. Birthdays: 23 people for a 50/50 chance of a match. 2. Almost Birthdays: Only need 13 people for a Birthday match to within a day. 3. Multiple Matches: 88 people for a triple match

Solution Easier to calculate P(no match) I The first person has a particular birthday P(no match) = 1 ×

364 − N + 1 364 ×...× 365 365

Substituting N = 23 you get P(no match) = 12 . I General rule-of-thumb:

N people, c the number of categories

√ Then N = 1.2 c for a 50/50 chance. = 365 and N = 23 √ For near Birthdays c = 121 and N = 1.2 121 = 13.

I To apply this to Birthdays we use: c

Example I We can calculate N −1

P(no match) =

∏

1−

i=1

N −1

= exp

∑

i=1

i c

i log 1 − c

!

N2 ≈ exp − 2c

−1 1 where log 1 − ci ≈ − ci for i c and ∑N i=1 i = 2 N (N − 1 ).

Bayes Theroem Many problems in decision making can be solved using Bayes rule. I Rule-based decision making. Artifical Intelligence. I It’s counterintuitive! But gives the “right” answer.

Bayes Rule: P(A|B) =

P(B|A)P(A) P(A ∩ B) = P(B) P(B)

Law of Total Probability: ¯ )P(A ¯) P(B) = P(B|A)P(A) + P(B|A

Bayes Theorem Disease Testing example .... Let D = 1 indicate you have a disease Let T = 1 indicate that you test positive for it .95

D

T=1

D=1 .02

0

.05

T=0 .01 T=1 .98

D=0

1

0 .9702=.98*.99

.001

1 .0098

.019

T

.99

T=0

If you take the test and the result is positive, you are really interested in the question: Given that you tested positive, what is the chance you have the disease?

Bayes Theorem We have a probability table

T

0 1

D 0 0.9702 0.0098

1 0.001 0.019

Bayes Probability P (D = 1|T = 1) =

0.019 = 0.66 (0.019 + 0.0098)

Probability Subjective probability is very different from the (non-operational) approach based on long-run averages. 1. Kreps (1989) Notes on the Theory of Choice 2. Ramsey (1926) Truth and Probability 3. deFinetti (1930) Theory of Probability 4. Kolmogorov (1931) Probability 5. von Neumann and Morgenstern (1944) Theory of Economic Games and Behaviour 6. Savage (1956) Foundations of Statistics Principle of Coherence: A set of subjective probability beliefs must avoid sure loss

Subjective Probability, Preferences and Utility

I Axiomatic probability doesn’t help you find probabilities.

It describes how to find probabilities of more complicated events. I Subjective probability measures your probability as a propensity

to bet (willingness to pay). Utility effects are “assumed” to be linear (risk neutral) so you value everything in terms of expected values. I P(A) “measures’ how much you are willing to pay to enter a

transaction that pays $1 if A occurs and 0 otherwise.

Odds

We can express probabilities in terms of Odds via O(A) =

1 1 − P(A) or P(A) = P(A) 1 + O(A)

I For example if O(A)

= 1 then for ever $1 bet you will payout $1. An event with probability 12 .

I If O(A)

= 2 or 2 : 1, then for a $1 bet you’ll payback $3.

In terms of probability P = 31 .

US Presidential Election 2016 Oddschecker

Figure: Presidential Odds 2016

I The best odds on Hilary are 6/4.

This means that you have to risk $4 to win $6 offered by Ladbrokes I There’s competition between the Bookmakers. I The odds will dynamically change in time as new information

arrives. You can lock-in fixed odds at the time you bet

Probability and Psychology How do people form probabilities or expectations in reality? Psychologists have categorized many different biases that people have in their beliefs or judgments. Loss Aversion The most important finding of Kahneman and Tversky is that people are loss averse. Utilities are defined over gains and losses rather than over final (or terminal) wealth, an idea first proposed by Markowitz. This is a violation of the EU postulates. Let (x, y) denote a bet with gain x with probability y. To illustrate this subjects were asked In addition to whatever you own, you have been given $1000, now choose between the gambles A = (1000, 0.5) and B = (500, 1). B was the more popular choice.

Example The same subjects were then asked: In addition to whatever you own, you have been given $2000, now choose between the gambles C = (−1000, 0.5) and D = (−500, 1). I This time C was more popular. I The key here is that their final wealth positions are identical yet

people chose differently. The subjects are apparently focusing only on gains and losses. When they are not given any information about prior winnings, they choose B over A and C over D. Clearly for a risk averse people this is the rational choice. I This effect is known as loss aversion.

Representativeness

Representativeness When people try to determine the probability that evidence A was generated by model B, they often use the representative heuristic. This means that they evaluate the probability by the degree to which A reflects the essential characteristics of B. A common bias is base rate neglect or ignoring prior evidence. For example, in tossing a fair coin the sequence HHTHTHHTHH with seven heads is likely to appear and yet people draw conclusions from too few data points and think 7 heads is representative of the true process and conclude p = 0.7.

Expected Utility (EU) Theory Normative Let P, Q be two probability distributions or risky gambles/lotteries. pP + (1 − p)Q is the compound or mixture lottery. The rational agent (You) will have preferences between gambles.

Q if and only if You strictly prefer P to Q. If two lotteries are indifferent we write P ∼ Q.

I We write P

I EU – a number of plausible axioms – completeness, transitivity,

continuity and independence – then preferences are an expectation of a utility function. I The theory is a normative one and not necessarily descriptive. It

suggests how a rational agent should formulate beliefs and preferences and not how they actually behave. I Expected utility U (P) of a risky gamble is then

PQ

⇐⇒

U (P) ≥ U (Q)

Attitudes to Risk The solution depends on your risk preferences: I Risk neutral: a risk neutral person is indifferent about fair bets.

Linear Utility I Risk averse: a risk averse person prefers certainty over fair bets.

E(U (X)) < U (E(X)) . Concave utility I Risk loving: a risk loving person prefer fair bets over certainty.

Depends on your preferences.

St. Petersburg Paradox What are you willing to pay to enter the following game? I I toss a fair game and when the first head appears, on the Tth

toss, I pay you $2T dollars. I First, probability of first head on Tth toss is 2−T ∞

E(X ) =

∑ 2T 2 − T

T =1

= 2(1/2) + 4(1/4) + 8(1/8) + . . . = 1+1+1+... → ∞ I Bernoulli (1754) used this to introduce utility to value bets as

E(u(X)).

Allais Paradox You have to make a choice between the following gambles I First compare the “Gambles”

G1 : $ 5 million with certainty G2 : $ 25 million p = 0.10 $ 5 million p = 0.89 $ 0 million p = 0.01 I Now choose between the Gambles

G3 : $ 5 million p = 0.11 $ 0 million p = 0.89 G4 : $ 25 million p = 0.10 $ 0 million p = 0.90 Fact: If G1 ≥ G2 then G3 ≥ G4 and vice-versa.

Solution: Expected Utility Given (subjective) probabilities P = (p1 , p2 , p3 ). Write E(U |P) for expected utility. I Without loss of generality we can set u(0)

= 0 and for the high prize set u($25 million) = 1. Which leaves one free parameter u = u($5 million).

I Hence to compare gambles with probabilities P and Q we look

at the difference E(u|P) − E(u|Q) = (p2 − q2 )u + (p3 − q3 ) I For comparing

G1 and G2 we get E(u|G1 ) − E(u|G2 ) = 0.11u − 0.1 E(u|G3 ) − E(u|G4 ) = 0.11u − 0.1

The order is the same, given your u. I If your utility satisfies u

“riskier” gamble.

< 0.1/0.11 = 0.909 you take the

Utility Functions Power and log-utilities I Constant relative risk aversion (CRRA). I Advantage that the optimal rule is unaffected by wealth effects.

The CRRA utility of wealth takes the form Uγ (W ) = I The special case U (W )

W 1− γ − 1 1−γ

= log(W ) for γ = 1.

This leads to a myopic Kelly criterion rule.

Kelly Criterion The Kelly Criterion corresponds to betting under binary uncertainty. I Consider a sequence of i.i.d. bets where

p(Xt = 1) = p and p(Xt = −1) = q = 1 − p The optimal allocation is ω ? = p − q = 2p − 1. I Maximising the expected long-run growth rate leads to the

solution max E (ln(1 + ωWT )) = p ln(1 + ω ) + (1 − p) ln(1 − ω ) ω

≤ p ln p + q ln q + ln 2 and ω ? = p − q

Kelly Criterion If one believes the event is certain i.e. p = 1, then one bets all wealth and a priori one is certain to double invested wealth. If the bet is fair, i.e. p = 12 , one bets nothing, ω ? = 0, due to risk-aversion.

= (1 − p)/p the odds. We can generalize the rule to the case of asymmetric payouts (a, b) where

I Let p denote the probability of a gain and O

p(Xt = 1) = p and p(Xt = −1) = q = 1 − p I Then the expected utility function is

p ln(1 + bω ) + (1 − p) ln(1 − aω ) I The optimal solution is

ω? =

bp − aq p−q = ab σ

Kelly Criterion I If a

= b = 1 this reduces to the pure Kelly criterion.

= 1. We can now interpret b as the odds O that the market is willing to offer the invest if the event occurs and so we write b = O. The rule becomes

I A common case occurs when a

ω? =

p·O−q O

Example I Two possible market opportunities: one where it offers you 4/1

when you have personal odds of 3/1 and a second one when it offers you 12/1 while you think the odds are 9/1. In expected return these two scenarios are identical both offering a 33% gain. In terms of maximizing long-run growth, however, they are not identical.

Example I Table 1 shows the Kelly criteria advises an allocation that is

twice as much capital to the lower odds proposition: 1/16 weight versus 1/40. Market 4/1 12/1

You 3/1 9/1

p 1/4 1/10

ω? 1/16 1/40

Table: Kelly rule I The optimal allocation ω ?

= (pO − q)/O is

(1/4) × 4 − (3/4) 1 (1/10) × 12 − (9/10) 1 = and = 4 16 12 40

Assignment 1

Probability and Paradox 1. Envelope Paradox 2. Galton’s Paradox 3. Prosecutors’ Fallacy 4. St Petersburg Paradox 5. Kelly Criterion 6. OJ Simpson Case

Probability: 41901 Week 2: Conditional Probability and Bayes Rule Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Topics

1. Conditional Probability 2. Bayes Rule 3. Prisoner’s Dilemma/Monte Hall 4. Using Probability as Evidence 5. Island Problem and DNA Evidence 6. Combining Evidence 7. Prosecutors and base-rate fallacies. Reading: DeGroot and Schervish: Chapter 2, p.49-97.

Conditional Probability Bayes Rule

In its simplest form. I Two events

A and B . Bayes rule P(A|B) =

P(A ∩ B) P(B|A)P(A) = P(B) P(B)

I Law of Total Probability

¯ )P(A ¯) P(B) = P(B|A)P(A) + P(B|A Hence we can calculate the denominator of Bayes rule.

Prisoner’s Dilemma Example A, B , C . Each believe are equally likely to be set free.

I Three prisoners

I Prisoner

A goes to the warden W and asks if s/he is getting

axed.

A anything about him. He provides the new information: W B = “B is to be executed”

I The Warden can’t tell

Uniform Prior Probabilities leads to Prior P (Pardon)

A B C 0.33 0.33 0.33

I Posterior ? Compute P(A|W B)?

C overhears the conversation? Compute P(C|W B)?

I What happens if

Monte Hall Problem

Problem is named after the host of the long-running TV show, Let’s make a Deal.

I Marilyn vos Savant had a column with the correct answer that

many Mathematicians thought was wrong! I A contestant is given the choice of 3 doors.

There is a prize (a car, say) behind one of the doors and something worthless behind the other two doors: two goats.

Puzzle The game is as follows: I You pick a door. I Monty then opens one of the other two doors, revealing a goat. I You have the choice of switching doors.

Is it advantageous to switch? Conditional Probabilities Assume you pick door A at random. Then P(A) = (1/3). You need to figure out P(A|MB) after Monte reveals B is a goat.

Using Probability as Evidence Suppose that the event G is that the suspect is the criminal. I Compute the conditional probability of guilt

P(G|evidence) Evidence E : suspect and criminal possess the same trait γ (eg DNA match). I Bayes Theorem

P(G|evidence) =

P(evidence|G)P(G) P(evidence)

I In terms of odds

P(I|evidence) P(evidence|I) P(I) = P(G|evidence) P(evidence|G) P(G)

Posterior Odds Posterior Odds equals the likelihood ratio times the prior odds (given ’side’ information)

1. Prior Odds O(G) ? 2. In many problems this need to be carefully thought about. 3. “what if” analysis? 4. The likelihood ratio is common to all observes and updates everyone’s prior probabilities.

Prosecutor’s Fallacy The most common fallacy is confusing P(evidence|G) with P(G|evidence)

I Bayes rule yields

P(G|evidence) =

P(evidence|G)p(G) P(evidence)

I Your assessment of P(G) will matter.

Island Problem

Suppose there’s a criminal on a island of N + 1 people.

I Let I denote innocence and G guilt. I Evidence E: the suspect matches a trait with the criminal. I The probabilities are

p(E|I ) = p and p(E|G) = 1

Bayes factor Bayes factors are likelihood ratios I The Bayes factor is given by

p(E|I ) =p p(E|G) I If we start with a uniform prior distribution we have

p(I ) = I Priors will matter!

1 and odds(I ) = N N+1

Island Problem (contd) Posterior Probability related to Odds p(I |y) =

1 1 + odds(I |y)

I Prosecutors’ fallacy

The posterior probability p(I |y) 6= p(y|I ) = p. I Suppose that N

= 103 and p = 10−3 . Then p(I |y) =

1 1 = 3 − 3 2 1 + 10 · 10

The odds on innocence are odds(I |y) = 1. There’s a 50/50 chance that the criminal has been found.

Sally Clark Case: Independence or Bayes?

Sally Clark was accused and convicted of killing her two children They could have both died of SIDS. I The chance of a family which are non-smokers and over 25

having a SIDS death is around 1 in 8, 500. I The chance of a family which has already had a SIDS death

having a second is around 1 in 100. I The chance of a mother killing her two children is around 1 in

1, 000, 000.

Bayes or Independence 1. Under Bayes P (both SIDS) = P (first SIDS) P (Second SIDS|first SIDS) 1 1 1 · = = 8500 100 850, 000 The

1 100

comes from taking into account genetics.

2. Independence, as the court did, gets you P (both SIDS) = (1/8500)(1/8500) = (1/73, 000, 000) 3. By Bayes rule p(I |E) P(E ∩ I ) = p(G|E) P(G ∩ I ) P(E ∩ I ) = P(E|I )P(I ) needs discussion of p(I ).

Comparison I Hence putting these two together gives the odds of guilt as

1/850, 000 p(I |E) = = 1.15 p(G|E) 1/1, 000, 000 In terms of posterior probabilities p(G|E) =

1 = 0.465 1 + O(G|E)

I If you use independence

p(I |E) 1 = and p(G|E) ≈ 0.99 p(G|E) 73 The suspect looks guilty.

OJ Simpson case Prosecutor’s Fallacy: P(G|T ) 6= P(T |G). I Dr. Robin Cotton testified that the chance of the blood being

someone else’s other than Simpson’s was one-in-17 million or more I Suppose that you are a juror in a murder case of a husband who

is accused of killing his wife. The husband is known is have battered her in the past. Consider the three events: G “husband murders wife in a given year”, M “wife is murdered in a given year”, B “husband is known to batter his wife”. I You now have the following probability facts. Only one tenth of

one percent of husbands who batter their wife actually murder them. Conditional on eventually murdering their wife, there a one in ten chance it happens in a given year.

OJ Simpson (contd) I In a given year, there are about 5 murders per 100, 000 of

population in the United States. How do you use the prior information? p(M|G, B) p(G|B) p(G|M, B) = p(I |M, B) p(M|I, B) p(I |B) By assumption, we have p(M|G, B) = 1, p(M|I, B) =

1 p(G|B) 1 and = 20, 000 p(I |B) 10, 000

I Putting it all together

p(G|M, B) 2 = 2 and p(G|M, B) = p(I |M, B) 3 More than a 50/50 chance that your spouse murdered you!

Fallacy p(G|B) 6= p(G|B, M) Alan Dershowitz stated to the press that fewer than 1 in 2000 of batterers go on to murder their wives (in any given year). ¯ B) I We estimate p(M|G,

¯) = = p(M|G

1 20,000 .

I The Bayes factor is then

p(G|M, B) 1/2000 ¯ |M, B) = 1/20, 000 = 10 p(G which implies posterior probabilities ¯ |M, B) = p(G

1 10 and p(G|M, B) = 1 + 10 11

Hence its over 90% chance that O.J. is guilty based on this information! Dershowitz intended this information to exonerate O.J.

Proving Innocence DNA evidence is very good at proving innocence: O(G|E) =

p(y|I ) O(G) p(y|G)

I A small Bayes factor p(y|I )/p(y|G) implies O(G|E) is also small. I Hence, we see that no match implies innocence. I Match probability is the probability that unrelated individual

chosen randomly from a reference population will match the profile.

More than One Criminal Let E two criminals and suspect matches C. I Rare blood type (p

= 0.1) and Common blood type C (p = 0.6)

I Evidence for the prosecution? I Bayes: P(C|I )

= 0.6 and P(C|G) = 0.5 BF =

P(C|G) 0.5 = = 0.93 < 1! P(C|I ) 0.6

I Evidence for the Defense. Posterior Odds on Guilt are

decreased.

The GateCrasher

Example I 1000 people where after collecting receipts (no tickets) you

realize only 499 paided. I On the basis of the base rate evidence - the only potentially

relevant evidence – there’s a 0.501 probability that a random person is guilty. I Does this civil case deserve to succeed on the balance of

probabilities?

Base Rate Fallacies

“Witness” 80 % certain saw a “checker” C taxi in the accident. I What’s your P(C|E) ? I Need P(C). Say P(C)

= 0.2 and P(E) = 0.8.

I Then your posterior is

P(C|E) =

0.8 · 0.2 = 0.5 0.8 · 0.2 + 0.2 · 0.8

Therefore O(C) = 1 a 50/50 bet.

Updating Fallacies

When you have a small sample size, Bayes rule still updates probabilities I Two players: either 70 % A or 30 % A I Observe A beats B 3 times out of 4. I What’s P(A

= 70% player) ?

I Most people don’t update quickly enough in light of new data

Edwards (1968).

Sequential Learning: Bellman Bellman’s principle of optimality An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (Bellman, 1956)

In Math!! For a finite horizon in discrete time with no discounting is given by V (x, t) = max (u(x, d, t) + V (p(x, d, t), t + 1)) for t < T d

Value function

Secretary Problem Also called the matching or marriage problem I You will see items (spouses) from a distribution of types F(x). I You clearly would like to pick the maximum.

You see these chronologically. After you decide no, you can’t go back and select it. I Strategy: wait for the length of time

1 1 = = 0.3678 e 2.718281828 Select after you observe an item greater than the current best.

Strategy

What’s your best strategy? I Turns out its insensitive to the choice of distribution. I Although there is the random sample i.i.d. assumption lurking. I You’ll not doubt get married between 18 and 60.

Waiting

1 e

along this sequence gets you to the age 32!

I Then, pick the next best person!

Q-Learning and Deal-No Deal

Rule of Thumb: Continue as long as there are two large prizes left. I Continuation value is large.

For example, with three prizes and two large ones. Risk averse people will naively choose deal, when if they incorporated the continuation value they would choose no deal. I Data: Suzanne and Frank’s choices on the Dutch version of the

show

Susanne’s Choices Prize e0.01 e0.20 e0.50 e1 e5 e10 e20 e50 e100 e200 e300 e400 e500 e1,000 e2,500 e5,000 e7,500 e10,000 e12,500 e15,000 e20,000 e25,000 e50,000 e100,000 e150,000 e250,000 Average e Offer e Offer % Decision

1 × × ×

2 × × ×

3 × × ×

4 ×

5

6

7

8

9

×

×

×

×

× × ×

× × ×

×

×

× ×

×

×

× × ×

× ×

× ×

× ×

× ×

× ×

×

×

× × × × × × × × × 32,094 3,400 11% No Deal

× ×

×

× ×

×

×

×

× ×

× ×

× ×

× ×

× ×

× ×

× ×

× ×

21,431 4,350 20% No Deal

26,491 10,000 38% No Deal

34,825 15,600 45% No Deal

46,417 25,000 54% No Deal

50,700 31,400 62% No Deal

62,750 46,000 73% No Deal

83,667 75,300 90% No Deal

125,00 125,00 100% No Dea

Table: Frank’s Choices Prize e0.01 e0.20 e0.50 e1 e5 e10 e20 e50 e100 e500 e1,000 e2,500 e5,000 e7,500 e10,000 e25,000 e50,000 e75,000 e100,000 e200,000 e300,000 e400,000 e500,000 e1,000,000 e2,500,000 e5,000,000 Average e Offer e Offer % Decision

1 × × × ×

2 × × × ×

3

4

5

6

7

× ×

× ×

× ×

×

×

× ×

× ×

× ×

× ×

× ×

× ×

× × ×

× ×

×

× × × × × ×

×

×

×

×

× × × ×

×

×

×

×

×

×

64,502 8,000 12% No Deal

85,230 23,000 27% No Deal

95,004 44,000 46% No Deal

85,005 52,000 61% No Deal

102,006 75,000 74% No Deal

× × × × × × × × × × × 383,427 17,000 4% No Deal

8

9

× ×

× ×

×

×

×

×

2,508 2,400 96% No Deal

3,343 3,500 105% No Deal

×

5,005 6,000 120% No De

Q-Learning Rationalising Waiting

There’s a matrix of Q-values that solves the problem. I Let s denote the current state of the system and a an action. I The Q-value, Qt (s, a), is the value of using action a today and

then proceeding optimally in the future. We use a = 1 to mean no deal and a = 0 means deal. I The Bellman equation for Q-values becomes

Qt (s, a) = u(s, a) + ∑ P(s? |s, a) max Qt+1 (s? , a) s?

a

Value Function The value function and optimal action are given by V (s) = max Q(s, a) and a? = argmaxQ(s, a) a

I Transition Matrix.

Consider the problem where you have three prizes left. Now s is the current state of three prizes. s? = {all sets of two prizes} and P(s? |s, a = 1) = where the transition matrix is uniform to the next state. I There’s no continuation for P(s? |s, a

= 0).

1 3

Utility

I Utility.

The utility of the next state depends on the contestants value for money and the bidding function of the banker u(B(s? )) = in power utility case.

B (s? )1− γ − 1 1−γ

Banker’s Function

Expected value implies B(s) = s¯ where s are the remaining prizes. I The website uses the following criteria: with three prizes left:

B(s) = 0.305 × big + 0.5 × small and with two prizes left B(s) = 0.355 × big + 0.5 × small

Example

Three prizes left: s = {750, 500, 25}. I Assume the contestant is risk averse with log-utility U (x) I Banker offers the expected value we get

u(B(s = {750, 500, 25})) = ln(1275/3) = 6.052 and so Qt (s, a = 0) = 6.052. I In the continuation problem, s?

s1?

= {750, 500} and

s2?

= {s1? , s2? , s3? } where = {750, 25} and s3? = {500, 25}.

= ln x.

Q-values We’ll have offers 625, 387.5, 137.5 under the expected value. I As the banker offers expected value the optimal action at time

t + 1 is to take the deal a = 0 with Q-values given by Qt (s, a = 1) =

Qt + 1 ( s ? , a ) ∑? P(s? |s, a = 1) max a s

1 = (ln(625) + ln(387.5) + ln(262.5)) = 5.989 3 as immediate utility u(s, a) = 0. I Hence as

Qt (s, a = 1) = 5.989 < 6.052 = Qt (s, a = 0) the optimal action is a? = 0, deal.

Utilities Continuation value Not large enough to overcome the generous (expected value) offered by the banker. I Sensitivity analysis: Different Banker’s bidding function:

If we use the function from the website (2 prizes): B(s) = 0.355 × big + 0.5 × small Hence B(s1? = {750, 500}) = 516.25 B(s2? = {750, 25}) = 278.75 B(s3? = {500, 25}) = 190

Optimal Action I The optimal action with two prizes left for the contestant is

1 (ln(750) + ln(500)) = 6.415 2 > 6.246 = Qt+1 (s1? , a = 0) = ln (516.25) 1 Qt+1 (s1? , a = 1) = (ln(750) + ln(25)) = 4.9194 2 < 5.63 = Qt+1 (s1? , a = 0) = ln (278.75) 1 Qt+1 (s1? , a = 1) = (ln(500) + ln(25)) = 4.716 2 < 5.247 = Qt+1 (s1? , a = 0) = (516.25)

Qt+1 (s1? , a = 1) =

Hence future optimal policy with be no deal under s1? , and deal under s2? , s3? .

Solve for Q-values

I Therefore solving for Q-values at the previous step gives

Qt (s, a = 1) =

Qt+1 (s? , a) ∑? P(s? |s, a = 1) max a s

1 = (6.415 + 5.63 + 5.247) = 5.764 3 with a monetary equivalent as exp(5.764) = 318.62.

Premium With three prizes we have Qt (s, a = 0) = u(B(s = {750, 500, 25}))

= ln (0.305 × 750 + 0.5 × 25) = ln(241.25) = 5.48. The contestant is offered $ 241. I Now we have Qt (s, a

= 1) = 5.7079 > 5.48 = Qt (s, a = 0) and the optimal action is a? = 1, no deal. The continuation value is large.

I The premium is $ 241 compared to $319, a 33% premium.

Overview

Bayes accounts for probability in many scenarios I Prisoner’s Dilemma I Monte Hall. I Island Problem I Sally Clark I O.J. Simpson I Bellman and Sequential Learning

Probability: 41901 Week 3: Distributions Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Topics

1. Markov Dependence 2. SP500 Hidden Markov Model 3. Random Variables 4. Transformations 5. Expectations 6. Moment Generating Functions Reading: DeGroot and Schervish: 3, p.97-181 and 4, 181-247

Markov Dependence I We can always factor a joint distribution as

p ( Xn , Xn − 1 , . . . , X1 ) = p ( Xn | Xn − 1 , . . . , X1 ) . . . p ( X2 | X1 ) p ( X1 )

Example I A process has the Markov Property if

p ( Xn | Xn − 1 , . . . , X1 ) = p ( Xn | Xn − 1 ) I Only the current history matter when determining the

probabilities.

SP500 daily ups and downs

I

I

Daily return data from 1948 to 2007 for the SP500 index of stocks Can we calculate the probability of ups and downs?

A real world probability model Hidden Markov Models

Are stock returns a random walk? Hidden Markov Models (Baum-Welch, Viterbi) I Daily returns on the SP500 stock market index.

Build a hidden Markov model to predict the ups and downs. I Suppose that stock market returns on the next four days are

X1 , . . . , X4 . I Let’s empirical determine conditionals and marginals

SP500 Data Marginal and Bivariate Distributions I Empirically, what do we get? Daily returns from 1948 − 2007.

x P ( Xi ) = x

Down 0.474

Up 0.526

I Finding p(X2 |X1 ) is twice as much computational effort:

counting UU, UD, DU, DD transitions. Xi Xi−1 = Down Xi−1 = Up

Down 0.519 0.433

Up 0.481 0.567

Conditioned on two days I Let’s do p(X3 |X2 , X1 )

Xi − 2 Down Down Up Up

Xi − 1 Down Up Down Up

Down 0.501 0.412 0.539 0.449

Up 0.499 0.588 0.461 0.551

I We could do the distribution p(X2 , X3 |X1 ). This is a joint,

marginal and conditional distribution all at the same time. Joint because more than one variable (X2 , X3 ), marginal because it ignores X4 and conditional because its given X1 .

Joint Probabilities

I Under Markov dependence

P(UUD) = p(X1 = U )p(X2 = U |X1 = U )p(X3 |X2 = U, X1 = U )

= (0.526)(0.567)(0.433) I Under independence we would have got

P(UUD) = P(X1 = U )p(X2 = U )p(X3 = D)

= (.526)(.526)(.474) = 0.131

Special Distributions See Common Distributions 1. Bernoulli and Binomial 2. Hypergeometric 3. Poisson 4. Negative Binomial 5. Normal Distribution 6. Gamma Distribution 7. Beta Distribution 8. Multinomial Distribution 9. Bivariate Normal Distribution 10. Wishart Distribution ...

Continuous Random Variables 1. We will be interesting in random variables X : Ω → < 2. The cumulative distribution function of X is defined as FX (x) = P (X ≤ x) 1 − FX (x) = P (X > x) 3. The probability density function of X is defined as 0 fX (x) = FX (x) and P (X ≤ x) =

Z x −∞

fX (x)dx

4. We have the following properties: fX (x) ≥ 0 and FX (x) is non-decreasing 5. limx→−∞ FX (x) = 0 and limx→∞ FX (x) = 1.

Intuition: cdf I

For a discrete random variable FX (x) is a step function with the heights of the steps being P ( X = xi )

I

For a continuous variable it’s a continuous non-decreasing curve.

Intuition: pdf I The intuitive interpretation of a pdf is that for small ∆x

P (x < X ≤ x + ∆x) =

Z x+∆x x

fX (x)dx ≈ fX (x)∆x

X lies in a small interval with probability proportional to fX (x).

I We will also consider the multivariate generalization of this.

Transformations The cdf identity gives FY (y) = P(Y ≤ y) = P(g(X) ≤ y)

I Hence if the function g(·) is monotone we can invert to get

FY (y) =

Z g(x)≤y

fX (x)dx

= P(X ≤ g−1 (y)) = FX (g−1 (y)) If g is decreasing FY (y) = P(X ≥ g−1 (y)) = 1 − FX (g−1 (y))

I If g is increasing FY (y)

Exponential

Example I

I

I

Suppose X ∼ U (0, 1) and Y = g(X) = − log X ∼ Exp(1). Suppose X ∼ Γ(α, β) then what is the pdf for Y = 1/X ? Inverse Gamma distribution Stochastic variances in regression models.

Transformation Identity

1. Theorem 1: Let X have pdf fX (x) and let Y = g(X). Then if g is a monotone function we have d −1 −1 fY (y) = fX (g (y)) g (y) dy There’s also a multivariate version of this that we’ll see later. I Suppose X is a continuous rv, what’s the pdf for Y I Let X

∼ N (0, 1), what’s the pdf for Y = X2 ?

= X2 ?

Probability Integral Transform

Theorem Suppose that U ∼ U [0, 1], then for any continuous distribution function F, the random variable X = F−1 (U ) has distribution function F.

∈ [0, 1], P (U ≤ u) = u, so we have P (X ≤ x) = P F−1 (U ) ≤ x = P (U ≤ F(x)) = F(x)

I Remember that for u

−1 Hence, X = FX (U ).

Inequalities and Identities 1. Markov P (g(X ) ≥ c) ≤

E(g(X)) where g(X) ≥ 0 c

2. Chebyshev P (|X − µ| ≥ c) ≤ 3. Jensen

Var(X) c2

E (φ(X)) ≤ φ (E(X))

4. Cauchy-Schwarz corr(X, Y) ≤ 1 Chebyshev follows from Markov. Mike Steele and Cauchy-Schwarz.

Probability: 41901 Week 4: Expectations Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Expectation and Variance

I The expectation of X is given by

E (X ) =

Z ∞ −∞

xfX (x)dx

I The variance of X is given by

Var(X) = E (X − µ)2 = E X2 − µ2 I Also derivatives of moment generating functions (mgfs).

Conditional Expectation Let X denote our random variable and Y a conditioning variable. I In the continuous case

E (g(X)) =

Z ∞ −∞

g(x)fX (x)dx

I The conditional expectation (given Y) is determined by

E (g(X)|Y) =

Z ∞ −∞

g(x)fX (x|Y)dx

where fX|Y (x|y) denotes the conditional density of X|Y.

Iterated Expectation

I The following fact is known as the law of iterated expectation

EY EX | Y ( X | Y ) = E ( X ) This is a very useful way of calculating E (X) (martingales) I Sometimes called the Tower property of expectations. I A similar decomposition for a variance

Var(X) = E (Var(X|Y)) + Var (E(X|Y)) .

Exponential Distribution Example I One of the most important distributions is the exponential

fX (x) = λe−λx and FX (x) =

Z x 0

λe−λx dx = 1 − e−λx

λ is called a parameter. I The expectation of X is

E (X ) =

Z ∞ 0

xλe

−λx

dx =

Z ∞ 0

1 xd −e−λx = λ

I The variance of X is

1 Var(X) = E X2 − µ2 = 2 λ

Simulation

Suppose that X ∼ FX (x) and let Y = g(X). How do we find FY (y) and fY (y) ? I von Neumann

Given a uniform U, how do we find X = g(U )?

→ (U, V ). We need to find f(U,V ) (u, v) from fX,Y (x, y)

I In the bivariate case (X, Y)

I Applications: Simulation, MCMC and PF.

Transformations An important application is how to transform multiple random variables? Here’s the basic set-up: I Suppose that we have random variables:

(X, Y) ∼ fX,Y (x, y) A transformation of interest given by: U = g(X, Y) and V = h(X, Y) I The problem is how to compute fU,V (u, v) ? Jacobian

∂(x, y) = J= ∂(u, v)

∂x ∂u ∂y ∂u

∂x ∂v ∂y ∂v

Bivariate Change of Variable I Theorem: (change of variable)

∂(x, y) fU,V (u, v) = fX,Y (h1 (u, v), h2 (u, v)) ∂(u, v) The last term is the Jacobian. This can be calculated in two ways. ∂(x, y) = 1/ ∂(u, v) ∂(x, y) ∂(u, v) I So we don’t always need the inverse transformation

(x, y) = (g−1 (u, v), h−1 (u, v))

Example: Exponentials Example I Suppose that X and Y are independent, identically distributed

random variables each with an Exp(λ) distribution. Let U = X + Y and V =

X X+Y

I The joint probability distribution

fX,Y (x, y) = λ2 e−λ(x+y) for 0 < x, y < ∞

= uv, y = u(1 − v). The range of definition 0 < u < ∞, 0 < v < 1.

I The inverse transformation is x

Jacobian

I We can calculate the Jacobian as

J=

∂x ∂u ∂y ∂u

∂x ∂v ∂y ∂v

v = 1−v

u −u

I Thus the joint density of (U, V ) is

fU,V (u, v) = λ2 ue−λu I Therefore

U ∼ Γ (2, λ) and V ∼ U (0, 1)

Example: Normals Suppose that X, Y are i.i.d. each with a standard normal N (0, 1) distribution. I Let D = X2 + Y2 and Θ = tan−1 Y . X I The joint distribution y2 x2 1 1 fX,Y (x, y) = √ e− 2 √ e− 2 for − ∞ < x, y < ∞ 2π 2π

I We can calculate the Jacobian as

J=

∂d ∂x ∂θ ∂x

∂d ∂y ∂θ ∂y

2x = − y x2 + y2

2y x x2 + y2

Example: Normal Simulation I Thus the joint density of (D, Θ) is

fD,Θ (d, θ ) =

1 −d e 2 where 0 ≤ d < ∞, 0 ≤ θ ≤ 2π 4π

I Therefore, we have

1 D ∼ Exp and Θ ∼ U (0, 2π ) 2 I Box-Muller transform for simulating normal random variables:

draw D = −2 log U1 and Θ = 2πU2 √ p X = Dcos(Θ) = −2 log U1 cos(2πU2 ) √ p Y = Dsin(Θ) = −2 log U1 sin(2πU2 )

Simulating a Cauchy A Cauchy is a ratio of normals: C = X/Y. I Let X, Y

∼ N (0, 1). fX,Y (x, y) =

I Transformation U

1 − 1 ( x2 + y2 ) e 2 − ∞ < x, y < ∞ 2π

= X/Y and V = |Y|. Inverse x = uv, y = v

I Jacobian

v 0

u =v 1

I Joint and marginal

fU,V (u, v) = fU (u) =

Z v

1 − 1 v2 ( 1 + u2 ) ve 2 0 < v < ∞, −∞ < u < ∞ . 2π

fU,V (u, v)dv = 2

A Cauchy distribution.

Z ∞ 1 0

2π

1 2 ( 1 + u2 )

ve− 2 v

dv =

1 1 π 1 + u2

Discrete Distributions Consider a countably infinite set of possible outcomes Ω = {ω1 , ω2 , . . .} with associated probabilities pi . We must have ∞

pi ≥ 0 and

∑ pi = 1

i=1

I The Bernoulli distribution

Sample space consists of two outcomes Ω = {H, T } and p = P(H ) and q = 1 − p = P(T ). I Other distributions: Geometric, Poisson and Negative Binomial

Binomial Distribution

This models the number of heads occurring in n successive trails of the Bernoulli coin. I Thus Ω

= {0, 1, . . . , n} and n fX (x) = px (1 − p)n−x 0 ≤ x ≤ n x

I Generalizations?

Random probability and hierarchical exchangeable models.

Sums of Binomial Distributions

Distribution of sums I The probability generating function is

E zX =

n

∑ zr

r=0

I Therefore X + Y I In general Xi

n x

pr qn−r = (pz + q)n

∼ Bin(n + m, p)

∼ Bin(ni , p) then ∑ki=1 Xi ∼ Bin ∑ki=1 ni , p

Sums of Poisson Distributions X + Y where X ∼ Poi(λ) and Y ∼ Poi(µ)? I The probability generating function

E zX =

∞

λr

∑ zr e−λ r!

= e− λ (1−z)

r=0

Let z = et to get the moment generating function. I Hence for the sum X + Y we have

e−λ(1−z) e−µ(1−z) = e−(λ+µ)(1−z) Hence X + Y ∼ Poi(λ + µ).

Sums of Random Variables

I For any random variables, the expectation of the sum is the sum

of expectations E ( X1 + . . . + Xn ) = E ( X1 ) + . . . + E ( Xn ) I If X1 , . . . , Xn are independent and identically distributed (i.i.d)

Var

X1 + . . . + Xn n

=

1 Var(X1 ) n

Cauchy Distribution I The Cauchy distribution has a pdf given by

fX ( x ) =

1 1 π 1 + x2

I Special case of a tν -distribution which has density proportional

to

− 1 (1+ ν ) 2 fX (x|ν) = Cν 1 + x2

I p

= 1 then T is a Cauchy

I p

→ ∞ then T →d N (0, 1)

I E(T )

= 0, Var(T ) = p/(p − 2) for p > 2.

Sums of Cauchy’s Example I If X1 , . . . , Xn

∼ C(0, 1) then 1 (X + . . . + Xn ) ∼ C(0, 1) n 1

I Averaging a set of Cauchy’s leads back to the same Cauchy !

The normal distribution behaves differently: if Xi ∼ N µ, σ2 1 σ2 (X + . . . + Xn ) ∼ N µ, n 1 n ¯ → µ a constant (CLT) Eventually X

Moment Generating Functions

I Definition. The moment generating function (mgf) is defined by

mX (t) = E etX that is

mX ( t ) =

Z ∞ 0

etX fX (x)dx

I Here θ is a parameter.

You’ll need to check when the function mX (θ ) exists.

Moment Identity

I There is an equivalence between knowing the mgf and the set of

moments of the distribution given by the following (n)

Theorem: If X has mgf mX (θ ) then E(Xn ) = mX (0) (n)

Here mX (0) denotes the nth derivative evaluated at zero. I What about a C(0, 1) random variable ?

Characteristic Functions Characteristic Function: φX (t). I This function exits for all distributions and all values of t.

It is defined by φX (t) = E eitX I Here eitX

= cos(tX) + i sin(tX) with i2 = −1.

I Technically better to do everything in terms of characteristic

functions as they always exist. |φX (t)| ≤ E |eitX | = 1

φC (t)? Cauchy distribution φC (t) = e−|t| . I mgf only exists at the origin I We can calculate

φX (t) =

1 π

=e

Z ∞

−∞ −|t|

eitx dx 1 + x2

I Cauchy’s residue theorem calculates the complex integral

Properties

Here’s a list of special properties

I Theorem: If MX (t)

= MY (t) then FX (x) = FY (y). n I mgf of an average MX¯ (t) = MX t n I Example: Normal and Cauchy distributions. I MaX+b (t)

= ebt MX (at)

I Probability generating function pX (z)

= E zX with z = eθ .

mgf for normals

I The mgf of X

∼ N µ, σ2 is given by 1 mX (θ ) = exp µθ + σ2 θ 2 2

I Hence the sums of two normals is normal as the mgf is

1 1 exp µX θ + σ2 θ 2 exp µY θ + τ 2 θ 2 2 2 which gives X + Y ∼ N µX + µY , σ2 + τ 2

Sums of mgfs I If X1 , . . . , Xn are independent random variables where mgfs

mX i ( θ )

n

mX1 +...Xn (θ ) =

∏ mXi (θ ) i=1

I This follows from the identity

E eθ (X1 +...Xn ) =

n

∏E

eθXi

i=1

I If there are identically distributed

mX1 +...Xn (θ ) = mX (θ )n mgf=Laplace transform, characteristic function=Fourier.

Properties of mgfs Theorem

The mgf mX (θ ) = E eθX uniquely determines the distribution of X provided it is defined for some open interval of θ values. m(r) ( 0 ) = E ( X r )

I Key identity is

d mX ( θ ) = E dθ

d θX e dθ

= E XeθX

¯ Distribution of X I Let (X1 , . . . , Xn ) be an iid sample from any distribution, then

mX¯ (θ ) = mX

n θ n

I Example: Suppose the X0 s are N (µ, σ2 ). Then i

1 σ2 2 mX¯ (θ ) = exp µθ + θ 2 n

¯ is normal The distribution of X ¯ I Example: Suppose the X0 s are Γ (α, β ) then X i

∼ Γ(nα, β/n).

Convergence Concepts I Convergence in Probability: A sequence of random variables

X1 , X2 , . . . converges in probability to a random variable X if, for every e > 0, lim P (|Xn − X| ≥ e) = 0 n→ ∞

or equivalently limn→∞ P (|Xn − X| ≤ e) = 1. I Weak Law of Large Numbers: Let X1 , X2 , . . . be a random sample

(iid) with E(Xi ) = µ and Var(Xi ) = σ2 < ∞. Then, for every ¯ n = 1 ∑ Xi satisfies e > 0, X n ¯ n − µ| ≥ e) = 0 lim P (|X

n→ ∞

Almost Sure Convergence I Almost surely: A sequence of random variables X1 , X2 , . . .

converges almost surely to a random variable X if, for every e > 0, P lim |Xn − X| ≥ e = 0 n→ ∞

I Strong Law of Large Numbers: Let X1 , X2 , . . . be a random sample

(iid) with E(Xi ) = µ and Var(Xi ) = σ2 < ∞. Then, for every e > 0, ¯ n − µ| ≥ e = 0 P lim |X n→ ∞

¯ n →a.s. µ. ∑ Xi . Hence X Don’t need finite variance (see martingale convergence proof) ¯n = where X

1 n

Central Limit Theorem I Let X1 , X2 , . . . be a random sample (iid) with E(Xi )

= µ and Var(Xi ) = σ2 < ∞. Suppose that the mgf exists in a neighborhood of zero ie mX (θ ) exists for some θ > 0.

Theorem The sample mean converges in distribution to a normal random variable Z x y2 X1 + . . . + Xn − nµ 1 √ √ lim P | ≤ x| = e− 2 dy = Φ(x) n→ ∞ σ n 2π −∞ which is the distribution function of a standard N (0, 1). I One can expand the mgf m(θ ) and consider the limit as n √

I

¯ n −µ) n(X σ

→D Z

→ ∞.

Sketch Proof Key facts: Taylor Series expansion of mgf I

M√

¯ n −µ)/σ (t) n(X

I

MY

t √

n n

I Therefore

lim

n→ ∞

1 = 1+ 2

MY

=

t √

MY

n n

t √

t √

n n

2 n

+ RY

= exp(t2 /2)

t √

n

Order Statistics Order Statistics I The order statistics of a random sample X1 , . . . , Xn are the

sample values in ascending order, denoted by X(1) , . . . , X(n) or equivalently X(1) = min Xi = Xmin and X(n) = max Xi = Xmax 1≤i≤n

1≤i≤n

I The median Xmed is also widely studied as it is a consistent

estimator for the location of a distribution for a wide family and is less sensitive to extreme observations (breakdown point).

Proof I Consider the cases X(1) and X(n) .

max Xi ≤ x

FX(n) (x) = P

1≤i≤n n

= (FX (x))

I Differentiating with respect to x gives

fX(n) (x) = nfX (x) (FX (x))n−1 I For the minimum, let’s compute

1 − FX(1) (x) = P (Xmin ≥ x)

= (1 − FX (x))n Differentiating with respect to x gives fX(n) (x) = nfX (x) (1 − FX (x))n−1

General Case

I Theorem:. Let X(1) , . . . , X(n) denote the order statistics from a

population with cdf FX (x) and pdf fX (x). Then the pdf of X(j) is fX(j) (x) =

n f (x) (FX (x))j−1 (1 − FX (x))n−j (j − 1)(n − j) X

I For the general case we get n

FX(j) (x) =

∑

k =j

n k

(FX (x))k (1 − FX (x))n−k

Exchangeability: Bayes Billard Balls

I You roll a Billiard ball on a table and record its distance

θ ∈ [0, 1]. I You roll n balls, what’s the distribution of balls Y

= y1 + . . . + yn

to the left of θ? n θ k (1 − θ )n−k where θ ∼ p(θ ) k Z 1 n P (Y ≤ k ) = θ k (1 − θ )n−k p(θ )dθ k 0

P (Y ≤ k | θ ) =

If p(θ ) ∼ U (0, 1) then Y is uniform with probability 1/(n + 1).

Independence versus Exchangeability I de Finetti’s theorem: a sequence of 0-1 exchangeable random

variables is a mixture distribution for some prior distribution f (p) p ( y1 + . . . + yn = k ) =

Z 1 0

k

k

p∑i=1 yi (1 − p)n−∑i=1 yi f (p)dp

I Independence yields

p(y1 = 0, y2 = 0, . . . , yn = 0) = pn I Independence underestimates the odds of a tail event (Black

Swan).

Moments Matter I Prior moments matter. Consider, the Binomial-Beta modeling of

a sequence of successes and failures. Then the marginal probability of a failure is the prior mean p(y1 = 0) = E(p) I If I observe a failure, the conditional probability of the next

failure is a ratio of moments p(y2 = 0|y1 = 0) =

E(p2 ) E(p)

I Independence is the same as the prior mean, E(p).

Probability: 41901 Week 5: Modern Regression Methods Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Topics

I I I I I I

Keynes vs Buffett: Regression Superbowl Spread Data Logistic Regression Least Squares Ridge Regression Lasso Regression

Keynes versus Buffett CAPM keynes = 15.08 + 1.83 market buffett = 18.06 + 0.486 market

Keynes and Rate returns

King’s College Cambridge

3 2

Market 7.9 6.6 -20.3 -25.0 -5.8 21.5 -0.7 5.3 10.2 -0.5 -16.1 -7.2 -12.9 12.5 0.8 15.6 5.4 0.8

1

Keynes -3.4 0.8 -32.4 -24.6 44.8 35.1 33.1 44.3 56.0 8.5 -40.1 12.9 -15.6 33.5 -0.9 53.9 14.5 14.6

4

Keynes Rate

Year 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945

5

10 Period

Keynes vs Cash

15

Example: Super Bowl Spread Data Favorite 1 GreenBay 2 GreenBay 3 Baltimore 4 Minnesota 5 Dallas 6 Dallas 7 Miami 8 Miami 9 Pittsburgh 10 Pittsburgh 11 Oakland 12 Dallas 13 Pittsburgh 14 Pittsburgh 15 Philadelphia 16 SanFrancisco 17 Miami 18 Washington 19 SanFrancisco 20 Chicago 21 NYGiants 22 Denver

23 SanFrancisco Cincinnati Underdog Spread Outcome Upset 24 SanFrancisco Denver KansasCity 14.0 25.0 0 25 Buffalo NYGiants Oakland 13.5 19.0 0 26 Washington Buffalo NYJets 18.0 -9.0 1 27 Dallas Buffalo KansasCity 12.5 -16.0 1 28 Dallas Buffalo Baltimore 2.0 -3.0 1 29 SanFrancisco SanDiego Miami 6.0 21.0 0 30 Dallas Pittsburgh Washington 1.0 7.0 0 31 GreenBay NewEngland Minnesota 7.0 17.0 0 32 Denver GreenBay Minnesota 3.0 10.0 0 33 Denver Atlanta Dallas 6.0 -4.0 1 34 StLouis Tennessee Minnesota 4.5 18.0 0 35 Baltimore NewYork Denver 5.5 17.0 0 36 NewEngland StLouis Dallas 3.5 4.0 0 37 Oakland TampaBay LARams 10.5 12.0 0 38 NewEngland Carolina Oakland 3.0 -17.0 1 39 NewEngland Philadelphia Cincinnati 1.0 5.0 0 40 Pittsburgh NewEngland Washington 3.0 -10.0 1 41 Indianapolis Chicago LARaiders 2.5 -29.0 1 42 NewEngland NYGiants Miami 3.0 22.0 0 43 Pittsburgh Seattle NewEngland 10.0 36.0 0 44 NewOrleans Indianapolis Denver 9.5 19.0 0 45 GreenBay Pittsburgh Washington 3.0 -32.0 1 46 NewEngland NYGiants

7.0 11.5 6.0 7.0 7.0 10.0 19.0 13.0 14.0 12.0 7.5 7.0 3.0 14.0 3.5 7.0 4.0 4.0 7.0 14.0 6.5 5.0 2.5 2.5

4.0 45.0 -1.0 13.0 35.0 17.0 23.0 10.0 14.0 7.0 15.0 7.0 27.0 -3.0 -27.0 3.0 3.0 11.0 8.0 -3.0 4.0 14.0 3.5 -4.0

0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1

Superbowl Spread Data Let’s build a regression model

model = lm(Outcome ∼ Spread) Residuals: Min 1Q -35.334 -6.503

Median 0.788

3Q 8.475

Max 33.761

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5444 4.3790 0.124 0.9016 Spread 0.9299 0.5083 1.829 0.0741 .

Residual standard error: 15.74 on 44 degrees of freedom Multiple R-squared: 0.07068, Adjusted R-squared: 0.04956 F-statistic: 3.347 on 1 and 44 DF, p-value: 0.07413

SuperBowl = 0.544 + 0.9299 Spread

How Close is the Spread? Plot the Empirical versus Theoretical plot(Outcome,Spread,pch=20,col=4)

# Plot Data

abline(super,col=2)

# Add the least squares line

abline(a=0,b=1,col=3)

# Add the line outcome=spread

I

The two lines are nearly on top of each other!

I

There’s still a lot of uncertainty left over. This is measured by the residual standard error: s = 15.7

Regression Model Superbowl Data ●

15

●

●

●

● ●

● ● ●

10

●

● ●

● ●●

●●

●

●

●

● ●

●

● ●

5

Spread

●

●

● ● ● ●

● ●

●

● ●

●

● ●

●

●

● ●

●

−20

●

0

20 Outcome

40

Data

Let’s look at favourites and underdogs

SanFrancisco

0 Outcome

20

40

15 10

Spread

10 5

Spread

NewEngland GreenBay GreenBay GreenBay Dallas Minnesota Denver SanFrancisco Pittsburgh Dallas Chicago NYGiants Denver NewEngland Washington Indianapolis Miami Dallas SanFrancisco StLouis Pittsburgh Pittsburgh Dallas Buffalo Dallas NewOrleans Oakland NewEngland Pittsburgh Pittsburgh Oakland Pittsburgh Philadelphia Denver Miami Baltimore SanFrancisco Washington NewEngland GreenBay Dallas Miami SanFrancisco −20

SanDiego

NYJets

15

Baltimore

NewEngland KansasCity NYGiants StLouis Oakland Pittsburgh KansasCity GreenBay Denver LARams NewEngland Buffalo Denver

Atlanta Buffalo Chicago Tennessee Minnesota Buffalo Cincinnati Carolina Arizona Dallas Miami NYGiants Denver Indianapolis Minnesota NewEngland Philadelphia TampaBay Dallas Washington Washington Minnesota Miami NewYork Oakland Pittsburgh LARaiders NYGiants Baltimore Washington Cincinnati

5

I

−20

0 Outcome

20

40

Residuals

I

We should check our modelling assumptions

I

The studentized residuals are: rstudent(model)

residual 14

Normal Q−Q Plot

● ●

8 4

Frequency

10

●

2

1 0 −1 −2

●●

●

0

Sample Quantiles

●

6

2

● ● ● ● ●● ●●● ●● ●●●●●● ● ● ● ●●●●● ●●● ●●● ●●● ● ● ●● ●

−2

−1

0

1

Theoretical Quantiles

2

−2

−1

0

1

rstudent(super)

2

Logistic Regression Suppose that we have a response variables of ones and zeroes Y = 1, 0 and X are our usual set of predictors/covariates The logistic regression model is linear in log-odds P (win) log = x> β 1 − P (win) Data: NBA point spreads. Logistic regression: Does point spread predict whether the favourite wins or not? The β will be highly statistically significant.

Data

0

40

favwin

80

1

favwin=1 favwin=0

0

Frequency

120

Let’s look at the histogram of the data first, by group.

0

20 spread

40

0

20 spread

40

R: Logit Logistic Regression nbareg = glm(favwin~spread-1, family=binomial) summary(nbareg) Call: glm(formula = favwin ~ spread - 1, family = binomial) Deviance Residuals: Min 1Q Median -2.5741 0.1587 0.4620

3Q 0.8135

Max 1.1119

Coefficients: Estimate Std. Error z value Pr(>|z|) spread 0.15600 0.01377 11.33 X)−1 X> y This can be numerically unstable when X> X is ill-conditioned. Happens when p is large. I Ridge Regression

βˆ ridge = (X> X + λI )−1 X> y over a regularisation path of λ’s.

Least Absolute: L1-norm Least Absolute Shrinkage and Selection Operator I (Lasso) Show that the solution to the lasso objective function

arg minβ

1 (y − β )2 + λ | β | 2

is the soft-thresholding operator defined by βˆ = soft(y; λ) = (y − λsgn(y))+ Here sgn is the sign function and (x)+ = max(x, 0).

= | β| and solve the joint constrained optimisation problem which is differentiable

I Hint: define a slack variable z

Linear Regression Shortcomings

Setup: yi = xi> β + ei where β = ( β 1 , . . . , β p ) fr 1 ≤ i ≤ n. Equivalently: y = Xβ + e and minβ ||y − Xβ||2 . I

Predictive Ability The MLE (maximum likelihood) or OLS (Ordinary Least Squares) estimator is designed to have zero bias. That means it can suffer with high variance! There’s a variance/bias tradeoff that we have to trade-off.

I

Interpretability

I

Cross-Validation Fit the model on training data. ˆ Use model to predict Y-values for the holdout sample. Calculate predicted MSE Smallest MSE wins.

1 N

ˆ 2 ∑N j=1 (Yj − Yj ) .

Ridge Regression Ridge Regression ˆ towards zero. Similar to least squares but shrinks the β’s Formally, we have the minimisation problem p

βˆ ridge = arg minβ ||y − Xβ||2 + λ ∑ | β j |2 j=1

I

Equivalently called Tikhonov regularisation: p 2 ∑j=1 | β j |2 = || β|| .

I

Nothing is for free: we have to pick λ

I

ˆ to λ Regularisation path: sensible of β’s

I

ˆ Out-of-sample cross-validation for λ.

Example

Simulated Data: n = 50, p = 30 and σ2 = 1. True model is linear with 10 large coefficients between 0.5 and 1. I Linear Regression

Bias squared = 0.006 and variance = 0.627. Prediction error = 1 + 0.006 + 0.627 = 1.633 I We’ll do better by shrinking the coefficients to reduce the

variance I How big a gain will we get with Ridge/Bayes Regression?

4 3 0

1

2

Frequency

5

6

7

Example: True Coefficients

0.0

0.2

0.4

0.6

0.8

1.0

True coefficients

Figure: Shrinkage will Help

1.70

Linear regression Ridge regression

1.60 1.50

Prediction error

1.80

Gain in Prediction error

High

Low 0

5

10

15

20

25

Amount of shrinkage

Figure: Ridge Ridge Regression At best: Bias squared = 0.077 and variance = 0.402. Prediction error = 1 + 0.077 + 0.403 = 1.48

Bias-Variance Tradeoff

0.2

0.4

0.6

0.8

Simulated data: n = 50, p = 30.

0.0

Linear MSE Ridge MSE Ridge Bias^2 Ridge Var 0

5

10

15 λ

Figure: Ridge

20

25

Lasso Regression Lasso Regression Shrinkage methods that performs automatic selection Designed to solve the minimisation problem p

βˆ ridge = arg minβ ||y − Xβ||2 + λ ∑ | β j | j=1

I

An L1 -regularisation penalty

I

ˆ to zero. Nice feature that it forces some of the β’s Automatic variable selection!

I

Computationally fast: Convex optimisation

I

Still have to pick λ.

I

Regularisation and predictive cross-validation

Regularisation Path: Coefficient Estimates Cross-Validation: CV 8

8

8

−6

−5

−4

6

4

2

0

−3

−2

−1

0

svi

0.2

lweight

lbph pgg45

gleason

0.0

Coefficients

0.4

0.6

lcavol

lcp age

Log Lambda

Figure: Lasso

8

8

−2

0

8

8

8

2

4

6

lcavol

0.3

svi

0.2

lweight

0.1

lbph pgg45

0.0

gleason

lcp

−0.1

Coefficients

0.4

0.5

0.6

What about Ridge?

age

Log Lambda

Figure: Ridge Coefficient Estimates

Prostate Data CV Regularisation Path Hastie, Friedman and Tibshriani (2013). Elements of Statistical Learning 8

8

8

8

8

8

8

8

8

8

8

8

8

8

7

7

7

6

6

6

5

5

5

5

3

3

3

3

3

2

1

1

1

1

1.4

1.6

8

●

1.2

●

1.0

●

● ● ●

0.8

● ● ● ● ● ● ● ●

0.6

Mean−Squared Error

●

●

●

●

●

●

●

●

●

●

−6

●

●

●

●

●

●

●

●

●

●

−5

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−4

●

●

●

●

●

●

−3 log(Lambda)

Figure: Lasso

●

●

●

●

●

●

●

−2

●

●

●

−1

0

What about Ridge? 8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8.0

8

●

●

●

●

7.5

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

7.0

Mean−Squared Error

● ●

−2

0

2

4

log(Lambda)

Figure: Ridge: CV regularisation

6

MSE Example I A very popular statistical criterion is mean squared error I MSE is defined by

MSE =

∑(Y − Yˆ )2 Y

ˆ about the variable Y. I You make a prediction Y ˆ )2 . I After the outcome Y, you calculate (Y − Y I In data mining, it is popular to use a holdout sample and after

you’ve built your statistical model to test it out-of-sample in terms of its mean squared error performance.

Example I The Brier score of a set of forecasts is defined by

∑(Y − Yˆ ) Y

I Then you check to see if it’s statistically significantly different

from zero. I It’s less sensitive to outliers but not as popular as MSE.

ˆ) I Your forecasts are Calibrated or unbiased if E(Y

= Y.

Netflix Prize

I A massive dataset of Netflix customer’s preferences.

People ratings 1 − 5 by Movies 180 million data-points. I Movies rated on a scale of 1-5 I The challenge was a 10% improvement on the current Netflix

prediction algorithm I Build a model to make predictions to recommend movies you

haven’t seen yet

MSE Goal

I $1 million prize if you could improve the Netflix rating system

at 10% I After two years finally won in Oct 2009. I The MSE goal was a level of 0.854. I After two years of improving their MSE and being in the lead

for most of 2009, a noted Bayesian Statistician Chris Volinsky and his colleagues ended up losing with 4 mins to go!

Statistical Effects

I Downside momentum: once start panning movies, people give

lower ratings to ones they like I Time Trends: Memento (short term memory loss) 3.4 rating on

opening days, 4 after a few weeks I Model averaging: 700 different models (regressions, ... ) and a

weighted averaged prediction seems to work best. I New prize: data includes demographics, age, gender, movie

details

Probability: 41901 Week 6: Bayesian Statistics Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Introduction to Bayesian Methods Modern Statistical Learning I

Bayes Rule and Probabilistic Learning

I

Computationally challenging: MCMC and Particle Filtering

I

Many applications in Finance: Asset pricing and corporate finance problems.

Lindley, D.V. Making Decisions Bernardo, J. and A.F.M. Smith Bayesian Theory

Recent Bayesian Books

Bayesian Theory and Applications I Hierarchical Models and MCMC I Bayesian Nonparametrics Machine Learning I Dynamic State Space Models . . .

Popular Books

McGrayne (2012): The Theory that would not Die I History of Bayes-Laplace I Code breaking I Bayes search: Air France . . .

Nate Silver: NYT

Silver (2012): The Signal and The Noise I Presidential Elections I Bayes dominant methodology I Predicting College Basketball/Oscars . . .

Things to Know Explosion of Models and Algorithms starting in 1950s I Bayesian Regularisation and Sparsity I Hierarchical Models and Shrinkage I Hidden Markov Models I Nonlinear Non-Gaussian State Space Models

Algorithms I Monte Carlo Method (von Neumann and Ulam, 1940s) I Metropolis-Hastings (Metropolis, 1950s) I Gibbs Sampling (Geman and Geman, Gelfand and Smith, 1980s) I Sequential Particle Filtering

Probabilistic Reasoning

Bayesians only make Probability statements I Bayesian Probability (Ramsey, 1926, de Finetti, 1931)

1. Beta-Binomial Learning: Black Swans 2. Elections: Nate Silver 3. Baseball: Kenny Lofton and Derek Jeter I Monte Carlo (von Neumann and Ulam, Metropolis, 1940s)

Shrinkage Estimation: (Lindley and Smith, Efron and Morris, 1970s)

Bayes Learning: Beta-Binomial Recursive Bayesian Learning I Likelihood for Bernoulli T

p (y| θ ) =

T

T

∏ p (yt |θ ) = θ ∑t=1 yt (1 − θ )T−∑t=1 yt .

t=1

I Initial distribution θ

∼ B (a, A). The Beta pdf is given by

p (θ |a, A) =

θ a−1 (1 − θ )A−1 , B (a, A)

I Posterior is Beta

p (θ |y) ∼ B (aT , AT ) and aT = a +

T

T

t=1

t=1

∑ yt , AT = A + T − ∑ yt

The posterior mean and variance are E [ θ |y] =

aT AT aT and var [θ |y] = , 2 aT + AT (aT + AT ) (aT + AT + 1)

Black Swans Observing White Swans and having never seen a Black Swan. (Taleb, 2008, The Black Swan: the Impact of the Highly Improbable). What’s the Probability of Black Swan event sometime in the future? I Suppose that after T trials you have only seen successes

(y1 , . . . , yT ) = (1, . . . , 1). The next trial being a success has p(yT+1 = 1|y1 , . . . , yT ) =

T+1 T+2

For large T is almost certain. Here a = A = 1. I The probability of never seeing a Black Swan is given by

p(yT+1 = 1, . . . , yT+n = 1|y1 , . . . , yT ) =

T+1 →0 T+n+1

I Black Swan will eventually happen – don’t be surprised when it

actually happens.

Baseball Hierarchical Model I Batter-pitcher match-up I Prior information on overall ability of a player. I Small sample size, pitcher variation. I Let pi denote Jeter’s ability. Observed number of hits yi

(yi |pi ) ∼ Bin(Ti , pi ) with pi ∼ Be(α, β) where Ti is the number of at-bats against pitcher i. I A priori E(pi )

= α/(α + β) = p¯ i .

I The extra heterogeneity leads to a prior variance

Var(pi ) = p¯ i (1 − p¯ i )φ where φ = (α + β + 1)−1 .

Sports Data: Baseball Table: Kenny Lofton hitting Pitcher J.C. Romero S. Lewis B. Tomko T. Hoffman K. Tapani A. Cook J. Abbott A.J. Burnett K. Rogers A. Harang K. Appier R. Clemens C. Zambrano N. Ryan E. Hanson E. Milton M. Prior Total

At-bats 9 5 20 6 45 9 34 15 43 6 49 62 9 10 41 19 7 7630

Hits 6 3 11 3 22 4 14 6 17 2 15 14 2 2 7 1 0 2283

ObsAvg .667 .600 .550 .500 .489 .444 .412 .400 .395 .333 .306 .226 .222 .200 .171 .056 .000 .299

Kenny Lofton versus individual pitchers.

Sports Data: Baseball On Aug 29, 2006, Kenny Lofton (career .299 average, and current .308 average for 2006 season) was facing the pitcher Milton (current record 1 for 19). He was rested and replaced by a .275 hitter. I Is putting in a weaker player really a better bet? I Was this just an over-reaction to bad luck in the Lofton-Milton

matchup? P (≤ 1 hit in19 attempts|p = 0.3) = 0.01 an unlikely 1-in-100 event. I Bayes

Lofton’s batting estimates that vary from .265 to .340. The lowest being against Milton. Conclusion: .265 < .275 resting Lofton against Milton is justified.

Bayes Model

Batter-pitcher match-up I Small sample sizes and pitcher variation. I Let pi denote Lofton’s ability. Observed number of hits yi

(yi |pi ) ∼ Bin(Ti , pi ) with pi ∼ Be(α, β) where Ti is the number of at-bats against pitcher i.

Hierarchical Model Table: Derek Jeter hierarchical model estimates Pitcher R. Mendoza H. Nomo A.J.Burnett E. Milton D. Cone R. Lopez K. Escobar J. Wettland T. Wakefield P. Martinez K. Benson T. Hudson J. Smoltz F. Garcia B. Radke D. Kolb J. Julio Total

At-bats 6 20 5 28 8 45 39 5 81 83 8 24 5 25 41 5 13 6530

Hits 5 12 3 14 4 21 16 2 26 21 2 6 1 5 8 0 0 2061

ObsAvg .833 .600 .600 .500 .500 .467 .410 .400 .321 .253 .250 .250 .200 .200 .195 .000 .000 .316

EstAvg .322 .326 .320 .324 .320 .326 .322 .318 .318 .312 .317 .315 .314 .314 .311 .316 .312

Derek Jeter 2006 season versus individual pitchers.

95% Int (.282,.394) (.289,.407) (.275,.381) (.291,.397) (.218,.381) (.291,.401) (.281, .386) (.275,.375) (.279,.364) (.254,.347) (.264,.368) (.260,.362) (.253,.355) (.253,.355) (.247,.347) (.258,.363) (.243,.350 )

Bayes Estimates

I Stern (2007) estimates φˆ

= 0.002 (Jeter) doesn’t vary much across the population of pitchers.

I The extremes are shrunk the most also matchups with the

smallest sample sizes. I Jeter had a season .308 average. We see that his Bayes estimates

vary from .311 to .327 and that he is very consistent. I If all players had a similar record then the assumption of a

constant batting average would make sense.

Bayes Elections: Nate Silver Multinomial-Dirichlet Predicting the Electoral Vote (EV) I Multinomial-Dirichlet: (pˆ |p)

∼ Multi(p), (p|α) ∼ Dir(α)

pObama = (p1 , . . . , p51 |pˆ ) ∼ Dir (α + pˆ )

≡ 1. http://www.electoral-vote.com/evp2012/Pres/prespolls.csv

I Flat uninformative prior α

I Calculate probabilities via simulation: rdirichlet

p pj,O |data

and p (EV > 270|data)

I The election vote prediction is given by the sum 51

EV =

∑ EV (j)E

pj |data

j=1

where EV (j) are for individual states

Polling Data: electoral-vote.com I Electoral Vote (EV), Polling Data: Mitt and Obama percentages State M.pct O.pct EV 1 Alabama 58 36 9 2 Alaska 55 37 3 3 Arizona 50 46 10 4 Arkansas 51 44 6 5 California 33 55 55 6 Colorado 45 52 9 7 Connecticut 31 56 7 8 Delaware 38 56 3 9 D.C. 13 82 3 10 Florida 46 50 27 11 Georgia 52 47 15 12 Hawaii 32 63 4 13 Idaho 68 26 4 14 Illinois 35 59 21 15 Indiana 48 48 11 16 Iowa 37 54 7 17 Kansas 63 31 6 18 Kentucky 51 42 8 19 Louisiana 50 43 9 20 Maine 35 56 4 21 Maryland 39 54 10 22 Massachusetts 34 53 12 23 Michigan 37 53 17 24 Minnesota 42 53 10 25 Mississippi 46 33 6 ...

26 Missouri 27 Montana 28 Nebraska 29 Nevada 30 New.Hampshire 31 New.Jersey 32 New.Mexico 33 New.York 34 North.Carolina 35 North.Dakota 36 Ohio 37 Oklahoma 38 Oregon 39 Pennsylvania 40 Rhode.Island 41 South.Carolina 42 South.Dakota 43 Tennessee 44 Texas 45 Utah 46 Vermont 47 Virginia 48 Washington 49 West.Virginia 50 Wisconsin 51 Wyoming

48 49 60 43 42 34 43 31 49 43 47 61 34 46 31 59 48 55 57 55 36 44 39 53 42 58

48 46 34 47 53 55 51 62 46 45 45 34 48 52 45 39 41 39 38 32 57 47 51 44 53 32

11 3 5 5 4 15 5 31 15 3 20 7 7 21 4 8 3 11 34 5 3 13 11 5 10 3

0.02 0.01 0.00

Density

0.03

0.04

Histogram of sim.EV

260

280

300

320

340

360

380

sim.EV

Figure: Election 2008 Prediction. Obama 370

Electoral College Results Probability

0.020

Density

0.015

270 EV to Win

0.010

Actual Obama

0.005

0.000

250

300

350

Electoral Votes for Obama

Figure: Election 2012 Prediction. Obama 332.

SAT Scores SAT (200 − 800): 8 high schools and estimate effects. School Estimated yj St. Error σj Average Treatment θi A 28 15 ? B 8 10 ? C -3 16 ? D 7 11 ? E -1 9 ? F 1 11 ? G 18 10 ? H 12 18 ? I θj average effects of coaching programs I yj estimated treatment effects, for school j, standard error σj .

Estimates? Two programs appear to work (improvements of 18 and 28) I Large standard errors. Overlapping Confidence Intervals? I Classical hypothesis test fails to reject the hypothesis that the

θj ’s are equal. I Pooled estimate is

θˆ =

∑j (yj /σj2 ) ∑j (1/σj2 )

= 7.9

with a standard error of 4.2. I Neither separate or pooled seems sensible.

Bayesian shrinkage!

Data and Model Hierarchical Model is given by yj |θj ∼ N (θj , σj2 ) σj2 assumed known. Unequal variances – differential shrinkage.

∼ N (µ, τ 2 ) for 1 ≤ j ≤ 8. Traditional random effects model. Exchangeable prior for the treatment effects. As τ → 0 (complete pooling) and as τ → ∞ (separate estimates).

I Prior Distribution: θj

I Hyper-prior Distribution: p(µ, τ 2 ) ∝ 1/τ.

The posterior p(µ, τ 2 |y) can be used to “estimate” (µ, τ 2 ).

Posterior

I Joint Posterior Distribution

p(θ, µ, τ |y) ∝ p(µ, τ )p(θ |µ, τ )p(y|θ ) 8

8

i=1

j=1

∝ p(µ, τ 2 ) ∏ N (θj |µ, τ 2 ) ∏ N (yj |θj ) ∝τ I MCMC!

−9

1 1 1 1 exp − ∑ 2 (θj − µ)2 − ∑ 2 (yj − θj )2 2 j τ 2 j σj

!

Inference Report posterior quantiles School 2.5% 25% 50% A -2 6 10 B -5 4 8 C -12 3 7 D -6 4 8 E -10 2 6 F -9 2 6 G -1 6 10 H -7 4 8 µ -2 5 8 τ 0.3 2.3 5.1

75% 16 12 11 12 10 10 15 13 11 8.8

97.5% 32 20 22 21 19 19 27 23 18 21

Probability: 41901 Week 7: Hierarchical Models Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Bayesian Shrinkage

Bayesian shrinkage provides a way of modeling complex datasets.

1. Coin Tossing: Lindley’s Paradox 2. Baseball batting averages: Stein’s Paradox 3. Toxoplasmosis 4. SAT scores 5. Clinical Trials

Bayesian Inference Key Idea: Explicit use of probability for summarizing uncertainty. 1. A probability distribution for data given parameters f (y|θ ) Likelihood 2. A probability distribution for unknown parameters p(θ ) Prior 3. Inference for unknowns conditional on observed data Inverse probability (Bayes Theorem); Formal decision making (Loss, Utility)

Posterior Inference I Bayes theorem to derive posterior distributions

p( θ |y) = p(y) =

p(y| θ )p( θ ) p(y) Z

p(y|θ )p(θ )dθ

Allows you to make probability statements I They can be very different from p-values!

Hypothesis testing and Sequential problems I Markov chain Monte Carlo (MCMC) and Filtering (PF)

Conjugate Priors Example I Definition: Let

F denote the class of distributions f (y|θ ).

A class Π of prior distributions is conjugate for F if the posterior distribution is in the class Π for all f ∈ F , π ∈ Π, y ∈ Y . I Example: Binomial/Beta:

Suppose that Y1 , . . . , Yn ∼ Ber(p). Let p ∼ Beta(α, β) where (α, β) are known hyper-parameters. The beta-family is very flexible and has prior mean E(p) =

α α+ β .

Posterior ¯) I The posterior for p is given by p(p|Y Here Y¯ is a sufficient statistic. I

Bayes theorem gives p(p|y) ∝ f (y|p)p(p|α, β) ∝ p∑ yi (1 − p)n−∑ yi · pα−1 (1 − p) β−1 ∝ pα+∑ yi −1 (1 − p)n−∑ yi + β−1

∼ Beta(α + ∑ yi , β + n − ∑ yi ) I

The posterior mean can be written as a weighted combination of the sample mean y¯ and the prior mean E(p)–a shrinkage estimator.

Example Example: Poisson/Gamma: Suppose that Y1 , . . . , Yn ∼ Poi(λ). Let λ ∼ Gamma(α, β) where (α, β) are known hyper-parameters. I The posterior is then

p(λ|y) ∝ exp(−nλ)λ∑ yi λα−1 exp(− βλ)

∼ Gamma(α + ∑ yi , n + β)

Normal-Normal Posterior Distribution Using Bayes rule we get p( µ |y) ∝ p(y| µ )p( µ )

I Posterior is given by

1 p(µ|y) ∝ exp − 2 2σ

n

1 ∑ (yi − µ) − 2τ2 (µ − µ0 )2 i=1

!

2

Hence µ|y ∼ N µˆ B , Vµ where µˆ B =

n/σ2 1/τ 2 n 1 ¯ y + µ0 and Vµ−1 = 2 + 2 n/σ2 + 1/τ 2 n/σ2 + 1/τ 2 σ τ

A shrinkage estimator.

Lindley’s Paradox

Often evidence which, for a Bayesian statistician, strikingly supports the null leads to rejection by standard classical procedures. I Do Bayes and Classical always agree?

Bayes computes the probability of the null being true given the data p(H0 |D). That’s not the p-value. why? I Surely they agree asymptotically? I How do we model the prior and compute likelihood ratios

L(H0 |D) in the Bayesian framework?

Bayes t-ratio Edwards, Lindman and Savage (1963) There’s a simple approximation for the likelihood ratio. √ √ 1 2 L(p0 ) ≈ 2π n exp − t 2 I The key is that the Bayes test will have the factor

√

n which will

asymptotically favour the null. I There is only a big problem when 2

typically the most interesting case!

< t < 4 – but this is

Coin Tossing

Let’s start with intuition: imagine a coin tossing experiment and you want to determine whether the coin is “fair” H0 : p = 12 . There are four experiments. Expt n r L ( p0 )

1 50 32 0.81

2 100 60 1.09

3 400 220 2.17

4 10,000 5098 11.68

Coin Tossing

There’s lots of implications: I Classical: In each case the t-ratio is approx 2. They we just H0 ( a

fair coin) at the 5% level in each experiment. I Bayes: L(p0 ) grows to infinity and so they is overwhelming

evidence for H0 . Connelly shows that the Monday effect disappears when you compute the Bayes version.

Shrinkage Estimation

Stein showed that it is possible to make a uniform improvement on the MLE in terms of MSE. I Mistrust of the statistical interpretation of Stein’s result.

In particular, the loss function. I Difficulties in adapting the procedure to special cases I Long familiarity with good properties for the MLE I Any gains from a “complicated” procedure could not be worth

the extra trouble (Tukey, savings not more than 10 % in practice)

Baseball Batting Averages Data: 18 major-league players after 45 at bats (1970 season) Player Clemente Robinson Howard Johnstone Berry Spencer Kessinger Alvarado Santo Swoboda Unser Williams Scott Petrocelli Rodriguez Campanens Munson Alvis

y¯ i 0.400 0.378 0.356 0.333 0.311 0.311 0.311 0.267 0.244 0.244 0.222 0.222 0.222 0.222 0.222 0.200 0.178 0.156

E ( pi | D ) 0.290 0.286 0.281 0.277 0.273 0.273 0.268 0.264 0.259 0.259 0.254 0.254 0.254 0.254 0.254 0.259 0.244 0.239

average season 0.346 0.298 0.276 0.222 0.273 0.270 0.263 0.210 0.269 0.230 0.264 0.256 0.303 0.264 0.226 0.285 0.316 0.200

Baseball Data

0.35 0.25 0.15

predicted average

First Shrinkage Estimator: Efron and Morris

Clemente Robinson Howard Johnstone Berry Kessinger Alvarado Santo Petrocelli Campaneris Munson Alvis First 45 at bats

Average of remainder

a

Figure: Baseball Shrinkage

J−S estimator

Shrinkage Let θi denote the end of season average I Lindley: shrink to the overall grand mean

c = 1− and

(k − 3) σ 2 ∑(y¯ i − y¯ )2

θˆ = c¯yi + (1 − c)y¯

where y¯ is the overall grand mean.

= 0.212 and y¯ = 0.265. 2 ˆ Compute ∑(θi − y¯ obs i ) and see which is lower:

I Baseball data c

MLE = 0.077 STEIN = 0.022 That’s a factor of 3.5 times better!

Batting Averages

0.50

'Clemente' batting averages over 1970 season: .400 after 45 at bats; .346 for remainder ; .352 overall

● ●●

0.40

● ● ● ● ●

0.35

●

●

0.30

●

● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ●

●

.400 after 45 at bats

● ● ●

●

● ●

0.25

●

0.20

Batting average

0.45

●● ● ●

●

.346 remainder of season

●

● 5

10

20

45

100

200

●

300

400 20

●

Number At Bats (square root scale)

.352

Further Paradoxes Shrinkage on Clemente too severe: zCl = 0.265 + 0.212(0.400 − 0.265) = 0.294. The 0.212 seems a little severe I Limited translation rules, maximum shrinkage eg. 80%

= 1, n = 2). zO0 C = 0.265 + 0.212(0.5 − 0.265) = 0.421.

I Not enough shrinkage eg O’Connor ( y

Still better than Ted Williams 0.406 in 1941.

= 19) will further improve MSE performance! It will change the shrinkage factors.

I Foreign car sales (k

I Clearly an improvement over the Stein estimator is

θˆS+ = max

k−2 1− ∑ Y¯ 2 i

!

! , 0 Y¯ i

Prior Information

Extra Information I Empirical distribution of all major league players

θi ∼ N (0.270, 0.015) The 0.270 provides another origin to shrink to and the prior variance 0.015 would give a different shrinkage factor. I To fully understand maybe we should build a probabilistic

model and see what the posterior mean is as our estimator for the unknown parameters.

Unequal Variances Model Yi |θi ∼ N (θi , Di ) where θi ∼ N (θ0 , A) ∼ N (0.270, 0.015). I The Di can be different – unequal variances I Bayes posterior means are given by

E(θi |Y) = (1 − Bi )Yi where Bi =

Di Di + A

ˆ is estimated from the data, see Efron and Morris (1975). where A I Different shrinkage factors as different variances Di .

Di ∝ ni−1 and so smaller sample sizes are shrunk more. Makes sense.

Toxoplasmosis Data Disease of Blood that is endemic in tropical regions. Data: 5000 people in El Salvador drawn in varying sample sizes from 36 cities. I Estimate “true” prevalences θi 1

≤ i ≤ 36

I Allocation of Resources: should we spend funds on the city

with the highest observed occurrence of the disease? I Same shrinkage factors? I Shrinkage Procedure (Efron and Morris, p315)

zi = ci yi where yi are the observed relative rates (normalized so y¯ = 0 The smaller sample sizes will get shrunk more. The most gentle are in the range 0.6 → 0.9 but some are 0.1 → 0.3.

Clinical Trials Novick and Grizzle: Bayesian Analysis of Clinical Trials Four treatments for duodenal ulcers. Doctors assess the state of the patient. Data is collected sequentially. Armitage is the classis text for computing the sequential rejection rejion for the classical tests (α-spending function, can only look at prespecified times).

Treat Excellent Fair Death A 76 17 7 B 89 10 1 C 86 13 1 D 88 9 3 Conclusion: Cannot reject at the 5% level Use a conjugate binomial/beta prior model and do some sensitivity analysis.

Binomial-Beta Let pi be the death rate proportion under treatment i. I To compare treatment A to B directly compute P(p1 I Prior beta(α, β ) with prior mean E(p)

=

Posterior beta(α + ∑ xi , β + n − ∑ xi )

→ beta(8, 94) For B, beta(1, 1) → beta(2, 100)

I Example: For A, beta(1, 1)

I Inference: P(p1

> p2 |D) ≈ 0.98

α α+ β .

> p2 |D).

Sensitivity Analysis Its important to do a sensitivity analysis. Treat Excellent Fair Death A 76 17 7 B 89 10 1 C 86 13 1 D 88 9 3 I Model: Poisson/Gamma I Prior Γ(m, z) and λi be the expected death rate. I Compute P

λ1 λ2

> c|D

Prob P λλ12 > 1.3|D P λλ21 > 1.6|D

(0,0)

( 100, 2)

( 200 , 5)

0.95

0.88

0.79

0.91

0.80

0.64

Multinomial Probit

I Multinomial Probit. You observe a choice variable

Y ∈ {0, 1, , . . . , p − 1} where consumers choices are formed in a random utility model setting W = Xβ + e Y(W ) = i if max W = Wi

∼ p( β ). Applications: Marketing, Economics, Political Science. Gelman et al (2006) and Rossi et al (2007).

I Distribution of consumers tastes β

Stochastic Volatility

I Stochastic Volatility: You observe a sequence of returns rt with √

expected return µt and volatility

Vt .

I Hierarchical model defined by the sequence of conditionals

rt = µt +

p

Vt e t

µt = regression log Vt = α + β log Vt−1 + σv etv I Here α/(1 − β) describes the long-run average (log)-volatility I σv the volatility of volatility.

Bayes Portfolio Selection I de Finetti and Markowitz: Mean-variance portfolio shrinkage

constraints.

1 min xT Σx − xT µ − λc(x) x 2

I Bayes Predictive Moments

1 xT ι = 1 min xT Var(Rt+1 |Dt )x subject to T x 2 x E ( Rt + 1 | D t ) = µ P I PT (1999): Hierarchical Model

E(Rt+1 |Dt ) =

(tR¯ + λµ0 ) t+λ

Results

I Different shrinkage factors for different history lengths. I Portfolio Allocation in the SP500 index I Entry/exit; splits; spin-offs etc. For example, 73 replacements to

the SP500 index in period 1/1/94 to 12/31/96. I Advantage?: E(α|Dt )

= 0.39, that is 39 bps per month which on an annual basis is 468 bps. The posterior mean for β is p( β|Dt ) = 0.745

I x¯ M

= 12.25% and x¯ PT = 14.05%.

Date General Electric Coca Cola Exxon ATT Philip Morris Royal Dutch Merck Microsoft Johnson/Johnson Intel Procter and Gamble Walmart IBM Hewlett Packard Pepsi Pfizer Dupont AIG Mobil Bristol Myers Squibb GTE General Motors Disney Citicorp BellSouth Motorola

Symbol GE KO XON T MO RD MRK MSFT JNJ INTC PG WMT IBM HWP PEP PFE DD AIG MOB BMY GTE GM DIS CCI BLS MOT

6/96 2.800 2.342 2.142 2.030 1.678 1.636 1.615 1.436 1.320 1.262 1.228 1.208 1.181 1.105 1.061 0.918 0.910 0.910 0.906 0.878 0.849 0.848 0.839 0.831 0.822 0.804

12/89 2.485 1.126 2.672 2.090 1.649 1.774 1.308 ***** 0.845 ***** 1.040 1.084 2.327 0.477 0.719 0.491 1.229 0.723 1.093 1.247 0.975 1.086 0.644 0.400 1.190 *****

12/79 1.640 0.606 3.439 5.197 0.637 1.191 0.773 ***** 0.689 ***** 0.871 ***** 5.341 0.497 ***** 0.408 0.837 ***** 1.659 ***** 0.593 2.079 ***** 0.418 ***** *****

12/69 1.569 1.051 2.957 5.948 ***** ***** 0.906 ***** ***** ***** 0.993 ***** 9.231 ***** ***** 0.486 1.101 ***** 1.040 0.484 0.705 4.399 ***** ***** ***** *****

SP Composition Ford Chervon Amoco Eli Lilly Abbott Labs AmerHome Products FedNatlMortgage McDonald’s Ameritech Cisco Systems CMB SBC Boeing MMM BankAmerica Bell Atlantic Gillette Kodak Chrysler Home Depot Colgate Wells Fargo Nations Bank Amer Express

F CHV AN LLY ABT AHP FNM MCD AIT CSCO CMB SBC BA MMM BAC BEL G EK C HD COL WFC NB AXP

0.798 0.794 0.733 0.720 0.690 0.686 0.686 0.686 0.639 0.633 0.621 0.612 0.598 0.581 0.560 0.556 0.535 0.524 0.507 0.497 0.489 0.478 0.453 0.450

0.883 0.990 1.198 0.814 0.654 0.716 ***** 0.545 0.782 ***** ***** 0.819 0.584 0.762 ***** 0.946 ***** 0.570 ***** ***** 0.499 ***** ***** 0.621

0.485 1.370 1.673 ***** ***** 0.606 ***** ***** ***** ***** ***** ***** 0.462 0.838 0.577 ***** ***** 1.106 ***** ***** ***** ***** ***** *****

0.640 0.966 0.758 ***** ***** 0.793 ***** ***** ***** ***** ***** ***** ***** 1.331 ***** ***** ***** ***** 0.367 ***** ***** ***** ***** *****

Chicago Bears 2014-2015 Season Hierarchical Model

Bayes Learning: Update our beliefs in light of new information I In the 2014-2015 season.

The Bears suffered back-to-back 50-points defeats. Partiots-Bears 51 − 23 Packers-Bears 55 − 14 I Their next game was at home against the Minnesota Vikings.

Current line against the Vikings was −3.5 points. Slightly over a field goal

What’s the Bayes approach to learning the line?

Hierarchical Model Hierarchical Model: let’s model the current average win/lose this year 18.342 σ2 ∼ N θ, y¯ |θ ∼ N θ, n 9 θ ∼ N (0, τ 2 ) Pre-season prior mean µ0 = 0, standard deviation τ = 4. Data y¯ = −9.22. I Bayes Shrinkage estimator

E (θ |y¯ , τ ) =

τ2 τ2 +

σ2 n

y¯

Shrinkage factor is then 0.3. I Our updated estimator is

E (θ |y¯ , τ ) = −2.75 > −.3.5 where is current line is −3.5. I Based on our hierarchical model this is an over-reaction.

One point change on the line is about 3% on a probability scale.

Bayes Bears Last two defeats were 50 points scored by opponent bears=c(-3,8,8,-21,-7,14,-13,-28,-41) > mean(bears) [1] -9.222222 > sd(bears) [1] 18.34242 > tau=4 > sig2=sd(bears)*sd(bears)/9 > tau^2/(sig2+tau^2) [1] 0.2997225 > 0.29997*-9.22 [1] -2.765723 > pnorm(-2.76/18) [1] 0.4390677

Home advantage is worth 3 points. Vikings an average record. Result: Bears 21, Vikings 13

Probability: 41901 Week 8: Bayesian Asset Allocation Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Overview Long history of Bayes in Finance I de Finetti-Markowitz: Mean-Variance Allocation I Kelly-Breiman-Merton: Maximise Expected Growth Rate

Applications 1. Sir Isaac Newton: Lost £20, 000. “I can calculate the motion of heavenly bodies, but not the madness of people” South Sea Bubble Act (1721) Master of the Mint (1700-1727). Newton invented the Gold standard. 2. David Ricardo (1815) Bought all issuance of UK Gilts the day before the Battle of Waterloo 3. J.M. Keynes (1920-1945) King’s College Fund.

South Sea Bubble: 1720

Also owned the East India Company: £30, 000

Post 1720 Bubble Volatility: persistent, asymmetric (leverage) and fat-tails

Dutch East India Company

London Stock Market: 1723-1794

South Sea Bubble Isaac Newton

South Sea Bubble

400 0

Prices

800

Bank.of.England Royal.African.Company Old.East.India.Company South.Sea.Company

1719.6

1720.0

1720.4

1720.8

Time

Figure: Bubble Act 1721

Keynes Investment Returns 1928-1945

Keynes and Rate returns

4

Keynes Rate

3

Market 7.9 6.6 -20.3 -25.0 -5.8 21.5 -0.7 5.3 10.2 -0.5 -16.1 -7.2 -12.9 12.5 0.8 15.6 5.4

2

Keynes -3.4 0.8 -32.4 -24.6 44.8 35.1 33.1 44.3 56.0 8.5 -40.1 12.9 -15.6 33.5 -0.9 53.9 14.5

1

Year 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944

5

10 Period

Keynes vs Cash

15

Keynes Optimal Bayes Rebalancing Mix of Keynes and Rate

●

●

●

●

●

●

5

● ●

Universal Portfolios with Keynes and Rate Keynes Rate "universal"

● ● ● ●

●

●

● ●

●

4

● ●

4

● ●

● ●

●

●

Return from Bayes Portfolio

●

●

●

●

3

3

● ●

● ●

● ● ● ●

2

● ●

2

●

● ● ● ● ●

1

●

●

1

●

●

0

●

●

0.0

0.5

1.0

1.5

Fraction of Keynes in Portfolio

Optimal allocation ω

2.0

2.5

5

10

15

Period

Keynes vs Universal vs Cash

Bayesian Learning ω ? Cover-Stern: Universal Portfolios Returns i.i.d. and allocation maxω E (ln(1 + ω 0 X)) I Realised wealth WT (X, ω )

distribution

= ∏Tt=1 (1 + ω 0 Xt ) and weight

FX (ω ) = R

WT (X, ω ) Ω WT (x, ω )dω

Estimate ω ? with

R ωWT (X, ω )dω ωˆ = EF (ω ) = R Ω WT (x, ω )dω I Dynamically rebalance to guarantee “area-under-parabola”

return

Parrando’s Paradoxes Two losing bets can be combined to a winner

Parrando1 and Parrando2 returns

I Bernoulli market: 1 + f

I Positive Expectation

30 20

or 1 − f with p = 0.51 and f = 0.05

Parrando1 Parrando2

p log(1 + f ) + (1 − p) log(1 − f )

= −0.00025 < 0

0

governed by the median/entropy

10

I Caveat: Growth

0

1000

2000

3000

4000

Period

Brownian Ratchets and cross-entropy of Markov processes

Two Losing Bets+Volatility

5000

Parrando Universal Portfolios with Parrando1 and Parrando2 70

Mix of Parrando1 and Parrando2 Parrando1 Parrando2 "universal"

●

12

●

60

●

●

50

10

●

●

40

●

8

●

6

30

●

●

Return from Bayes Portfolio

20

●

4

●

●

10

●

2

●

● ●

0

● ● ●

0

●

0 0.0

0.2

0.4

0.6

0.8

1.0

1000

2000

3000 Period

Fraction of Parrando1 in Portfolio

Optimal allocation ω

Ex Ante vs Ex Post Realisation

4000

5000

Breiman-Kelly-Merton Rule Kelly Criterion: Optimal wager in binary setting ω? =

p·O−q O

Merton’s Rule in continuous setting is Kelly ω? =

1 µ γ σ2

1. µ: (excess) expected return 2. σ: volatility 3. γ: risk aversion 4. ω ? : optimal position size 5. p = Prob(Up), q = Prob(Down), O = Odds

Example: Kelly Criterion S&P500: I Consider logarithmic utility (CRRA with γ

= 1). This is a pure

Kelly rule. I We assume iid log-normal stock returns with an annualized

expected excess return of 5.7% and a volatility of 16% which is consistent will long-run equity returns. In our continuous time formulation ω ? = 0.057/0.162 = 2.22 and the Kelly criterion which imply that the investor borrows 122% of wealth to invest a total of 220% in stocks. This is a the risk-profile of the Kelly criterion. One also sees that the allocation is highly sensitive to ˆ We consider dynamic learning in a later estimation error in µ. section and show how the long horizon and learning affects the allocation today.

Fractional Kelly The fractional Kelly rule leads to a more realistic allocation. I Suppose that γ

= 3. Then the informational ratio is

0.057 1 0.057 µ = = 0.357 and ω ? = = 74.2% σ 0.16 3 0.162 An investor with such a level of risk aversion then has a more reasonable 74.2% allocation. I This analysis ignores the equilibrium implications. If every

investor acted this way, then this would drive up prices and drive down the equity premium of 5.7%.

Black-Litterman Black-Litterman Bayesian framework for combining investor views with market equilibrium. I In a multivariate returns setting the optimal allocation rule is

ω? =

1 −1 Σ µ γ

The question is how to specify (µ, Σ) pairs? ˆ BL derive Bayesian inference for µ given I For example, given Σ, market equilibrium model and a priori views on the returns of pre-specified portfolios which take the form ˆ . (µˆ |µ) ∼ N µ, τ Σˆ and (Q|µ) ∼ N Pµ, Ω

Posterior Views

I Combining views, the implied posterior is

ˆ Q) ∼ N (Bb, B) (µ|µ, I The mean and variance are specified by

ˆ −1 P and b = (τ Σ ˆ )−1 µˆ + P0 Ω−1 Q B = (τ Σˆ )−1 + P0 Ω These posterior moments then define the optimal allocation rule.

Probability: 41901 Week 9: Stochastic Processes: Brownian Motion http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Brownian Motion A stochastic process {Bt , t ≥ 0} in continuous time taking real values is a Brownian Motion (or a Wiener process) I for each s

≥ 0 and t > 0 Bt+s − Bt ∼ N (0, t).

> s > 0, the random variable Bt − Bs is independent of Bs − B0 = Bs .

I For each t I B0

= 0 and

I Bt is continuous in t

≥ 0.

I Bt is standard BM. σBt includes a volatility.

Brown (1827) and Einstein (1921).

Brownian Motion with Drift

I Here we have the process

Xt = µt + σBt dXt = µdt + σdBt I We can think of the Euler discretization as

√ Xt+∆ − Xt = µ∆ + σ ∆et where et ∼ N (0, 1).

Geometric Brownian Motion

I Geometric Brownian Motion starting at X0

= 1 evolves as

Xt = exp(µt + σBt ) 1 2

We have E (Xt ) = e(µ+ 2 σ )t . I This is equivalent differential (SDE) form

dXt = Xt

1 2 µ + σ dt + σdBt 2

Brownian Motion for Sports Scores Implied Volatility Stern (1994): Brownian Motion and Sports Scores Polson and Stern (2014): The Implied Volatility of a Sports Game I “Team A” rarely loses if they are ahead at halftime I Approximately 75% of games are won by the team that leads at

halftime. There’s a 0.5 probability that the same team wins both halves and a 0.25 probability that the first half winner defeats the second half winner. This only applies to equally matched teams. I For the team that is ahead after 3 of the game, the empirical 4

frequency of winnings is: basketball 80%, baseball 90%, football 80% and hockey 80%.

Model Checking

µ home point advantage. Football 3 points, basketball 5 − 6 points. I Stern estimates for basketball a 1.5 points difference for the first

three quarters with a standard deviation of 7.5 points. Fourth quarter the difference is only 0.22 points with a standard deviation of 7. The q-q plot look normals. Correlations between periods are small and so the random walk model appears to be reasonable.

Evolution of a Game Volatility of Outcome Superbowl Probability

Game Evolution: Second half

10 −10 −30

−20

X

0

P(X > 0 ) = 0.353

Prob

mu=−4,p=0.353,sigma=10.60

−40

−20

0

Winning Margin

20

40

0

50

100 Time

150

200

Brownian Motion for Sports

Probability of team A winning, p = P(X(1) > 0), given point spread (or drift) µ, and standard deviation (or volatility) σ and final score X(1) ∼ N (µ, σ2 ).

∼ N (µ, σ2 ), we have µ p = P(X (1) > 0) = Φ σ

I Given the normality assumption, X(1)

where Φ is the standard normal cdf. I Table 1 uses Φ to convert team A’s advantage µ to a probability

scale using the information ratio µ/σ.

µ/σ p = Φ(µ/σ )

0 0.5

0.25 0.5 0.60 0.69

0.75 1 0.77 0.84

1.25 1.5 0.89 0.93

Probability of winning p versus the ratio µ/σ Evenly matched and µ/σ = 0 then p = 0.5.

2 0.977

Conditional Probability After time t has elapsed: I Current lead of l for team A as the conditional distribution:

(X(1)|X(t) = l) = (X(1) − X(t)) + l ∼ N (l + µ(1 − t), σ2 (1 − t)) I The probability of team A winning at time t given a current lead

of l points is: pt = P(X(1) > 0|X(t) = l) = Φ

l + µ (1 − t) p σ (1 − t)

!

= −4 and volatility is σ = 10.6, then team A has a µ/σ = −4/10.6 = −0.38 volatility point disadvantage.

I Point spread µ

The probability of winning is Φ(−0.38) = 0.353 < 0.5.

Implied Volatility Initial point spread–or the markets’ expectation of the expected outcome. Probability that the underdog team wins is p = Φ(µ/σ ) = Φ(−4/10.6) = 35.3%. I The volatility is a decreasing function of t, illustrating that the

volatility dissipates over the course of a game.

1 1 3 0 1 4 2 4 √t σ 1 − t 10.6 9.18 7.50 5.3 0

Volatility Decay over Time I We can infer the implied volatility, σIV , by solving

σIV :

p=Φ

µ σIV

which gives σIV =

µ . Φ −1 (p )

SuperBowl XLVII: Ravens vs 49ers TradeSports.com

Figure: SuperBowl XLVII

Super Bowl XLVII: Ravens vs San Francisco 49ers Super Bowl XLVII was held at the Superdome in New Orleans on February 3, 2013. We will track X(t) which corresponds to the Raven’s lead over the 49ers at each point in time. Table 3 provides the score at the end of each quarter.

t Ravens 49ers X (t)

0 0 0 0

1 4

7 3 4

1 2

3 4

1 21 28 34 6 23 31 15 5 3

SuperBowl XLVII by Quarter

Initial Market Initial point spread which was set at the Ravens being a four point underdog, µ = −4. µ = E (X(1)) = −4. The Ravens upset the 49ers by 34 − 31 and X(1) = 34 − 31 = 3 with the point spread being beaten by 7 points. I To determine the markets’ assessment of the probability that the

Ravens would win at the beginning of the game we use the money-line odds. San Francisco −175 and Baltimore Ravens +155. This implies that a bettor would have to place $175 to win $100 on the 49ers and a bet of $100 on the Ravens would lead to a win of $155. Convert both of these money-lines to implied probabilities of the each team winning pSF =

100 175 = 0.686 and pBal = = 0.392 100 + 175 100 + 155

Probabilities of Winning The probabilities do not sum to one. This ”excess” probability is in fact the mechanism by which the oddsmakers derive their compensation. The difference is the ”market vig”, also known as the bookmaker’s edge. pSF + pBal = 0.686 + 0.392 = 1.078 providing a 7.8% edge for the bookmakers. I Put differently, if bettors place money proportionally across

both teams then the bookies will make 7.8% of the total staked. We use the mid-point of the spread to determine p implying that p=

1 1 p + (1 − pSF ) = 0.353 2 Bal 2

From the Ravens perspective, we have p = P(X(1) > 0) = 0.353. Baltimore’s win probability started trading at pmkt = 0.38 0

Half Time Analysis The Ravens took a commanding 21 − 6 lead at half time. I The market was trading at pmkt 1

= 0.90.

2

During the 34 minute blackout 42760 contracts changed hands with Baltimore’s win probability ticking down from 95 to 94. The win probability peak of 95% occured again after a third-quarter kickoff return for a touchdown. At the end of the four quarter, however, when the 49ers nearly went into the lead with a touchdown, Baltimore’s win probability had dropped to 30%.

Implied Volatility To calculate the implied volatility of the Superbowl we substitute the pair (µ, p) = (−4, .353) into our definition and solve for σIV . σ=

I We obtain

σIV =

µ Φ −1 (p )

,

µ −4 = 10.60 = −0.377 Φ −1 (p )

where Φ−1 (p) = qnorm(0.353) = −0.377. The 4 point advantage assessed for the 49ers is under a 12 σ favorite. I The outcome X(1)

= 3 was within one standard deviation of the pregame model which had an expectation of µ = −4 and volatility of σ = 10.6.

Half Time Probabilities

What’s the probability of the Ravens winning given their lead at half time? At half time Baltimore led by 15 points, 21 to 6. The conditional mean for the final outcome is 15 + 0.5 ∗ (−4) = 13 √ and the conditional volatility is 10.6 1 − t. These imply a probability of .96 for Baltimore to win the game. I A second estimate of the probability of winning given the half

time lead can be obtained directly from the betting market. From the online betting market we also have traded contracts on TradeSports.com that yield a half time probability of p 1 = 0.90. 2

What’s the implied volatility for the second half?

pmkt reflects all available information t I For example, at half-time t

σIV,t= 1 = 2

=

1 2

we would update

l + µ (1 − t) 15 − 2 √ = 14 √ = Φ − 1 ( pt ) 1 − t Φ−1 (0.9)/ 2

where qnorm(0.9) = 1.28.

> 10.6, the market was expecting a more volatile second half–possibly anticipating a comeback from the 49ers.

I As 14

How can we form a valid betting strategy? Given the initial implied volatility σ = 10.6. At half time with the Ravens having a l + µ(1 − t) = 13 points edge

= 10.6 √ p 1 = Φ 13/(10.6/ 2) = 0.96

I We would assess with σ

2

probability of winning versus the pmkt = 0.90 rate. 1 2

I To determine our optimal bet size, ωbet , on the Ravens we might

appeal to the Kelly criterion (Kelly, 1956) which yields ωbet = p 1 − 2

q1 2

Omkt

= 0.96 −

0.1 = 0.60 1/9

Mu = 4, Sigma = 6, P(win) = 0.75

Mu = 4, Sigma = 16, P(win) = 0.6

Probability

0.000

0.02 0.00

-40

-20

0

20

Lakers' Final Margin of Victory

40

0.010

0.020

0.06 0.04

Probability

0.008 0.004 0.000

Probability

0.012

Mu = 4, Sigma = 32, P(win) = 0.55

-40

-20

0

20


40

-40

-20

0

20


40

10 0 -30

-20

-10

Points ahead, X(t)

20

30

Line = 4 points, P(win) = 0.6

0.0

0.2

0.4

0.6

0.8

1.0

t

Figure: The family of probability distributions for X(t), for any t between 0 and 1. Any vertical slice for a specific value t1 corresponds to a normal distribution for X(t1 ). The darker the shading at t, the higher the probability of the corresponding score-difference for X(t) along that particular time slice.

NBA: Lakers vs Celtics: Game 7, 2010 The Los Angeles Lakers hosting Boston Celtics in Game 7 of the NBA Finals. I At the end of the 3rd quarter, the Celtics are ahead by 17 points,

and look to be dominating the game. Yet the Lakers come storming back over the final 12 minutes of play, and end up scoring the decisive basket in the final minute to earn a 1-point victory, and the championship. I As a discerning basketball fan, you decide to be a bit more

measured in your assessment of the result. Should you be impressed by such an incredible feat? Or was the comeback by the Lakers really not that improbable, after all?

Final Score Pre-game odds the Lakers were thought to have a 60% chance of winning Expected margin of victory of 4 points—“the line”. I We model the Lakers’ lead X(t) at any given time t in the game.

X(t) negative indicates a deficit. Let’s assume that a game begins at t = 0 and ends at t = 1.

= −17 The Lakers are down by 17 points at the 3/4 mark of the game. We would like to know the probability that X(1) > 0, which corresponds to a lead at time t = 1, when the final buzzer sounds.

I Currently X(0.75)

Probability Model Our original question in terms of probability statements means we are interested in P X(1) > 0 | X(0.75) = −17 .

I A natural choice for modeling X(1) is a normal distribution.

Therefore, we write X(1) ∼ N (µ, σ2 ) to indicate that the final margin of victory, X(1), follows a normal distribution with mean µ and variance σ2 . I Volatility σ must ensure that P(X(1)

> 0) = 0.6.

We solve and get σ = 16 (or 15.8 exactly)

Assessing the Comeback We’re now ready to answer our original question: how impressive was the Lakers’ 17-point comeback in the fourth quarter? I We phrased this question in terms of

P X(1) > 0 | X(0.75) = −17 . If this number is very small, then the Lakers have clearly accomplished something remarkable. I To compute this quantity, observe that D

X(1) = X(0.75) + X(1) − X(0.75) .

Distribution of Time-to-Go I By applying the rules above, we know that D

X(1) − X(0.75) = X(1 − 0.75) = X(0.25) . Therefore, D

X(1) = X(0.75) + X(0.25) . Conditioning on the knowledge that X(0.75) = −17. Under our modeling assumptions, X(0.25) ∼ N (0.25µ, 0.25σ2 ) for µ = 4 and σ2 = 162 , the mean and variance we computed using the Vegas odds. I The upshot is: a normally distributed random variable with this

mean and variance has roughly a 2.3% chance of being 17 or more. According to our model, we have seen an unlikely event, but hardly a miracle!

Implications √ Stern √ uses a probit model to check this with regressors l/ 1 − t and µ 1 − t. He estimates for the whole game µˆ = 4.87 and σˆ = 15.82. I Home team greater than 50 % chance of winning if behind by 2

points at halftime. I Figure 3 plots the probability when t

= 0.9 (5 mins left).

I Table 3 provides estimates for baseball µˆ

Our model is highly plausible.

= 0.34 and σˆ = 4.04.

Example Suppose you have a team that has a µ = 6 with a lead l = 6 at half-time t = 12 with an initial volatility of σ = 12. One-σ move increases probability of winning from Φ( 21 ) = 0.69 to Φ(1) = 0.84 I The standard deviation varies in time according to

1 σ t 10 σ √

2 10/1.41

3 10/1.73

4 5

I Aways think in terms of the number of standard deviation at

anyone time that a team is ahead or not and then convert to the probability scale.

Black-Scholes Let X(t) represent a stock price at time t. The Black-Scholes model will provide a convenient way of estimating probabilities such as P(X(1) > 1000|X(0) > 530)

I We know that X(t)

> 0 for all values of t. A simple model is to use a log-normal distribution to describe stock prices. If Z ∼ N (0, 1) is a standard normal then X = eZ is a standard log-normal random variable.

I For example, we will assume that the log-return has a normal

distribution log

X (1) = µ + σZ where Z ∼ N (0, 1) X (0)

∼ N (µ, σ2 )

Model The model can then be written X(1) = X(0)eµ+σZ Hence X(1) we have a log-normal distribution. I Continuously compounded mean µ of log-returns

From properties of the mean of a log-normal 1 2 E(X(1)) = X(0)eµ+ 2 σ . Common to define µ˜ = µ + 21 σ2 or µ = µ˜ − 12 σ2 . 1 2 + σB

X(1) = X(0)eµ˜ − 2 σ

t

I We are now in a position to calculate the probability of interest

X (1) 1000 > log P(X(1) > 1000|X(0) > 530) = P log X (0) 530 0.06348 − µ =F σ

Option Pricing Bachlier (1900) Theory of Speculation was popularized by Black and Scholes (1973). I The formula prices a call option as follows. If a stock price is

currently at X(0) and its future prices is assumed to follow a geometric Brownian motion with constant drift µ and volatility σ as specified by dX(t) = µX(t)dt + σX(t)dBt Here √ µ is the continuously compounded growth rate and Bt = tZ ∼ N (0, t). I Fair price of an option with exercise price K, payout

max(X(T ) − K, 0) at the expiry date T is given by E e−rT max(X(T ) − K, 0) = X(0)Φ (d1 ) − Ke−rt Φ (d2 ) where r is the risk-free rate, Φ is the standard normal cdf and the parameters d1 =

ln(X(T )/K) + (r + 12 σ2 )T ln(X(T )/K) + (r − 21 σ2 )T √ √ and d2 = σ T σ T

Risk Neutral Distribution Solving the differential equation for X(t) we see that 1 2 )t+ σ

X (t ) = X (0 )e( µ − 2 σ

√

tZ

I To obtain the risk neutral evolution of prices, denoted by Q, we

simply replace µ by the risk-free rate r 1 2 )t+ σ

X ( t ) = X ( 0 ) e(r− 2 σ

√

tZ

Hence, to calculate the expectation of the call option payout requires √ EQ e−rT max(aeσ

TZ

− c, 0)

where c = K is the exercise price, a = X(0) is the current stock price. I If we look at incremental movements, we also have a martingale

as

1 2 )(t−s)+ σ (B −B ) s t

X ( t ) = X ( s ) e(r− 2 σ

This implies E (X(t)|X(s)) = X(s)–a martingale.

Calculation I We can calculate the option price using the tail properties of a

normal distribution. Let µt = (r − 12 σ2 )t. Then we need to calculate Z ∞ √ √ E e−rT max(aeµt +σ tZ − c, 0) = e−rT max(aeµt σ tZ − c, 0)p(Z)dZ −∞ Z √ µt +σ tZ −rT ae = e − c p(Z)dZ √ aeµt +σ

−rT

Z

tZ −c≥0

√ µt +σ tZ

−rT

Z

p(Z)dZ − ce e p √ √ σ tZ≥− ln(a/c)−µt σ tZ≥− ln(a/c)−µt √ √ = aP σ tZ > ln(c/a) − µt − ce−rT P σ tZ > ln(c/a) − µt

= ae

where the first integral is calculated as a tail√normal probability having completed the square on the term σ tZ − 12 Z2 .

Black-Scholes This is the same as the Black-Scholes formula E e−rT max(X(T ) − K, 0) = X(0)Φ (d1 ) − Ke−rt Φ (d2 )

I The parameters are given by

d1 =

ln(a/c) + (r − 12 σ2 )t ln(a/c) + (r + 12 σ2 )t √ √ and d2 = σ t σ t

where a = X(0), c = K and t = T.

Extensions There are many limitations and extensions: 1. Jumps (mergers and acquisitions, earnings announcements), 2. Dividends 3. Down-and-out features and exotics 4. stochastic volatility, leverage effect negative shocks stocks implies volatility spikes 5. Contagion. I One important concept is the idea of an implied volatility– all

the departures from the model will show up as volatility smiles and smirks I Adding stochastic volatility and jumps.

Probability: 41901 Week 9: Martingales and Time Series Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Overview This lecture will cover the following topics I Martingales

1. 2. 3. 4.

Doob’s Optimal Stopping Theorem Hitting times Distribution of a Maximum and Hitting time Ruin Problem

I Time Series

1. Winning Kentucky Derby times 2. Pandemic SEIR models

Martingales Martingales: E (Xn |Xn−1 , . . . , X0 ) = Xn−1 sub-Martingales: E (Xn |Xn−1 , . . . , X0 ) ≥ Xn−1 super-Martingales: E (Xn |Xn−1 , . . . , X0 ) ≤ Xn−1 I Random Walk: Sn

= X1 + . . . + Xn

I Asymmetric Random Walk:

Sn q Zn = p I Brownian Motion:

I B2t

E (Bt |Bs ) = Bs

− t, . . .

A Markov Process is a statement about probability distributions not expectations.

Fair and Unfair Games

Think of Xn − Xn−1 as your winnings from a $1 stake in game n. I In the martingale case (fair game)

E(Xn − Xn−1 |Fn−1 ) = 0 I In the super-martingale case (unfair game)

E(Xn − Xn−1 |Fn−1 ) ≤ 0 I A series of bets Cn is previsible if Cn is

Fn−1 -measurable

Gambling Strategies Dubins and Savage 1960s: How to gamble if you must. Yn =

∑ Ci (Xi − Xi−1 ) i

Yn − Yn−1 = Cn (Xn − Xn−1 )

Fn−1 -measurable, eg Cn = f (X1 , . . . , Xn−1 ). Based on iterated expectation

I Here that Cn is

E(Yn − Yn−1 |Fn−1 ) = Cn E(Xn − Xn−1 |Fn−1 ) ≤ 0 or equals zero if X is a martingale.

Geometric BrownianMotion: Xt = exp λBt − 21 λ2t . 2 σ2 /2

Recall that if Z ∼ N (µ, σ2 ) then the mgf is E(etZ ) = etµ+t I Since for t

, ∀t.

> s,

1 Xt = Xs exp λ(Wt − Ws ) − λ2 (t − s) 2

and Wt − Ws is independent of Fs with distribution N (0, t − s). I Therefore

1 E(Xt |Xs ) = Xs E exp λ(Wt − Ws ) − λ2 (t − s) 2

= Xs

Properties Given a Martingale sequence X0 , X1 , . . . , XT−1 , XT . We have E (Xt |Xt−1 ) = Xt−1 . I By the tower property of expectations

E (XT ) = E (E (XT |XT−1 )) = E (XT−1 ) = . . . = E (X0 ) I Valuation

Transform the problem to a martingale. Then value payouts as if things are a martingale E Q ( XT ) = X0

Gambler’s Ruin

For the simple random walk St we can represent the fortune of a gambler at time t where they win $1 with probability p and lose $1 with probability q = 1 − p.

= S0 and are trying to achieve barrier a before bankruptcy 0. Let Ta denote the first hitting time to barrier a.

I They start with initial capital c

I We need P (Ta

< T0 ). We’ll analyze this probability later using martingale methods.

I

We have the following picture

Ruin Problem Feller (1956) and Nierderhoffer (1997): I Question: Compute the probability of Brownian Motion exiting

first from the interval [a, b] (where a ≤ 0, b ≥ 0) through the right-hand endpoint b, say. I This is a ruin problem because if the Brownian motion

represents your fortune evolving while playing some game, and you start with fortune x and wish to compute the probability that your fortune reaches some level c before you go bankrupt, this would be the situation where b = c − x and a = −x. I We’ll use Doob’s theorem to compute P(Tb

< Ta )

Using Martingales A stochastic process {Xt , t ≥ 0} is a martingale relative to Brownian motion {Wt : t ≥ 0} or, more strictly, its filtration {Ft : t ≥ 0}, if for each t ≥ 0, I Xt is adapted to I E(|Xt |)

Ft , so that it is an Ft -r.v.

< ∞ ∀t

I E(Xt |Fs )

= Xs , for each s, 0 ≤ s ≤ t.

I Think of Xs as denoting your fortune at time s I By iterated expectation we have

E ( Xt ) = E ( X0 ) ∀ t ≥ 0

Stopping Times Definition: A map T : Ω → Z+ = {0, 1, 2, . . .} is a stopping time if { ω : T ( ω ) ≤ n} ∈ Fn .

= time you decide to stop playing the game and depends only on the history up to (and including) time n

I Intuitively: T

I Theorem: (stopped supermartingales are supermartingales)

If X is a supermartingale and T is a stopping time then the stopped process XT is a super-martingale and in particular E(Xmin(T,n) ) ≤ E(X0 ) I If X is a martingale, then E(Xmin(T,n) )

= E ( X0 ) .

Note: This is not the same as E(XT ) = E(X0 ).

Gambler’s Doubling Strategy Xn is a simple random walk starting at 0 initial wealth I T

= inf {n : Xn = 1} We know that P(T < ∞) = 1.

I However

1 = E ( XT ) 6 = E ( X0 ) = 0 with probability one I always make a dollar? I Stopping theorem tells us that

E(Xmin(T,n) ) = E(X0 ) ∀n The problem is that E(T ) = ∞ for the doubling strategy.

Doob’s Optimal-Stopping Theorem Theorem Let T be a stopping time. Let X be a super-martingale. Then E ( XT ) ≤ E ( X0 ) In each of the following situations I T is bounded (for some N, T (ω )

< N ∀ω)

I X is bounded and T is a.s. finite. I E(T )

< ∞ and for some K, |Xn (ω ) − Xn−1 (ω )| < K.

In the martingale case, we have E(XT ) = E(X0 )

Waiting times

Monkey typing Shakespeare. I A monkey types symbols at random, one per unit time,

producing an infinite sequence Xn of random variables with values in the set of possible symbols. I If he only types capital letters and it is equally likely to type any

one of the 26 letters, how long on average will it take to type the sequence ABRACADABRA?

Example I A Fair coin will be tossed producing a sequence like

HHHTHTT . . .. I The first player picks their favorite sequence of length three (eg

HHH). The second player picks their sequence. I The winner is the player whose sequence appears first in the

sequence of coin tosses.

Solution Imagine a casino game: Gambler bets $1 on each toss of the coin On the first toss, the first gambler starts to bet on the occurrence of the sequence I House collects the $1 and pays out fair odds I If the first gambler wins he lets his winnings ride on the second

toss. A second gambler enters and starts betting at the beginning of the sequence I There will be I gamblers who win for a particular sequence

= T − ∑I Wi is a martingale. Intuitively this is just the balance in the casino’s book.

I Check that XT

I Now apply optimal stopping:

E ( XT ) = E ( X0 )

Random Walk and Brownian Motion Discrete Time Limit

I

{Xn : n ≥ 0} iid with E(Xn ) = 0 and Var(Xn ) = 1.

I Let Sn

= X1 + . . . + Xn be a random walk. Then, S[nt] Bn (t) = √ →D B(t) n

a standard Brownian motion Bt .

Distribution of a Maximum

Reflection principle. Joint distribution of Wt and Mt = max0≤s≤t Ws ? I Mt represents the highest level reached in the time interval [0, t].

≥ 0 is non-decreasing in t. For a > 0, we define Ta as before so that

I Mt

{Mt ≥ a} = {Ta ≤ t}

Joint Distribution of Mt and Ta

I Taking T

= Ta , we have for a ≥ x and all t ≥ 0.

P(Mt ≥ a, Wt ≤ x) = P(Ta ≤ t, Wt ≤ x) ˜ t) = P(Ta ≤ t, 2a − x ≤ W ˜ t) = 1 − Φ = P(2a − x ≤ W

2a − x √ t

Hitting a Level By symmetry, P(Mt ≥ a, Wt ≥ 2a − x) = P(Mt ≥ a, Wt ≤ x) whence we see that √ P(Ta ≤ t) = P(Mt ≥ a) = 2 1 − Φ(a/ t) Letting t → ∞ we have that P(Ta < ∞) = 1 since Φ(0) = 0.5. I Differentiating with respect to t gives the probability density

function for Ta as √ fTa (t) = a exp −a2 /2t / 2πt3 where t > 0. I This is known as the inverse Gaussian distribution. We can

check directly that E(Ta ) = ∞ (as the density is O(t−3/2 ) as t → ∞).

I This is related to the problem with the doubling-up strategy.

Example: Winning Kentucky Derby Times Trend, Random Walk, Mean Reversion?

36 34 32

speed mph

Kentucky Derby

1880

1920

1960 year

Figure: Winning Speed

2000

Kentucky Derby

34 33 32 31

speed mph

35

time trend removed

Stone Street Riley

Kingman 1880

1920

1960 year

Figure: Winning Speed

2000

Kentucky Derby

36.0 35.0 34.0

speed mph

37.0

MA(1) predictions, outliers removed

0

20

40

60

80

100

120

140

year

Figure: Today’s Horses are getting Worse!

Kentucky Derby Recent Winners

The last six winners are: Secretariat’s winning time is 1 min and 59 2/5 seconds . > tail(mydata) year year_num 135 2009 135 136 2010 136 137 2011 137 138 2012 138 139 2013 139 140 2014 140

date 2-May-09 1-May-10 7-May-11 5-May-12 4-May-13 2-May-14

winner mins secs timeinsec distance spee Mine That Bird 2 2.66 122.66 1.25 36. Super Saver 2 4.04 124.04 1.25 36. Animal Kingdom 2 2.04 122.04 1.25 36. I’ll Have Another 2 1.83 121.83 1.25 36. Orb 2 2.89 122.89 1.25 36. California Chrome 2 3.66 123.66 1.25 36.

Example: History of Pandemics Bill Gates: 12/11/2009:

“I’m most worried about a worldwide

Pandemic ...” Early-period Pandemics Plague of Athens Black Death London Plague Recent Flu Epidemics Spanish Flu Asian Flu Hong Kong Flu

Dates 430 BC 1347 1666

Size of Mortality 25% population. 30% Europe 20% population

Dates 1900-2010 1918-19 H2N2, 1957-58 H3N2, 1968-69

Spanish Flu killed more than WW1 H1N1 Flu 2009: 18, 449 people killed World wide

Size 40-50 million 2 million 6 million

Typical Influenza Mortality Pattern I 1918 influenza and pneumonia mortality, UK

Pattern of mortality led to models of growth

SEIR Epidemic Models Growth “self-reinforcing”: More likely if more infectants I An individual comes into contact with disease at rate β 1 I The susceptible individual contracts the disease with

probability β 2 I Each infectant becomes infectious with rate α per unit time I Each infectant recovers with rate γ per unit time

St + Et + It + Rt = N

Current Models: SEIR susceptible-exposed-infectious-recovered model

Dynamic models that extend earlier models to include exposure and recovery.

60 40 20

number of people

Susceptible Exposed Infected Recovered

0

The coupled SEIR model: S˙ = − βSI E˙ = βSI − αE I˙ = αE − γI R˙ = γI

80

100

SEIR model

0

5

10

15 time

20

25

Infectious disease models Daniel Bernoulli’s (1766) first model of disease transmission in smallpox: “I wish simply that, in matters which so closely concern the well being of the human race, no decision shall be made without all knowledge which a little analysis and calculation can provide” I R.A. Ross, (Nobel Medicine winner, 1902) – math model of

malaria transmission, which ultimately lead to malaria control. Ross-McDonald model

I Kermack and McKendrick: susceptible-infectious-recovered

(SIR) London Plague 1665-1666; Cholera: London 1865, Bombay, 1906.

Example: London Plague, 1666 Village Eyam nr. Sheffield Model of transmission from Infectants, I, to susceptibles, S. Date 1666 Initial July 3 July 19 Aug 3 Aug 19 Sept 3 Sept 19 Oct 3 Oct 19

Susceptibles 254 235 201 153 121 108 97 – 83

Infectives 7 15 22 29 21 8 8 – 0

Initial Population N = 261 = S0 ; Final population S∞ = 83.

Modeling Growth: SI

Coupled Differential eqn S˙ = − βSI, I˙ = ( βS − α)I I Estimates β α

= 6.54 × 10−3 , αβ = 1.53. βˆ ln(S0 /S∞ ) = α S0 − S∞

Predicted maximum 30.4, very close to observed 29 Key: S and I are observed and α, β are estimated in hindsight

R0: Basic Reproductive Ratio

R0 := βS0 /γ > 1 R0 used as the currency in public health (CDC) I R0 can be interpreted as the mean number of secondary infected

that one infectious person introduces in a completely susceptible population produces during his/her infectious lifetime I R0 not only reflects the severity and virulence,but the

demographics, health and mixing characteristics of the society I Caveat R0 is very epidemic-instance dependent.

Transmission Rates R0 for 1918 Episode

I 1918-19 influenza pandemic: Mills et al. 2004: Viboud et al. 2006: Massad et al. 2007: Nishiura, 2007: Chowell et al., 2006: Chowell et al., 2007:

45 US cities England and Wales Sao Paulo Brazil Prussia, Germany Geneva, Switzerland San Francisco

3 (2-4) 1.8 2.7 3.41 2.7-3.8 2.7-3.5

The larger the R0 the more severe the epidemic. Transmission parameters vary substantially from epidemic to epidemic

Probability: 41901 Week 10: Stochastic Volatility, MCMC and PF Nick Polson http://faculty.chicagobooth.edu/nicholas.polson/teaching/41901/

Topics

1. Random Walks and Mean Reversion 2. Continuous Time Models 3. Time-varying Volatility 4. Stochastic Volatility 5. Applications: Portfolio Choice and Option Pricing

Random Walks and Mean Reversion Random Walk has evolution Pt+1 = Pt + et Here E(et ) = 0 and so E(Pt+1 |Pt ) = Pt . I Mean Reversion:

Pt+1 = µ + βPt + et where | β| < 1. The long run average is given by P = µ + βP or P =

I Log-returns are log

Pt + 1 Pt

.

µ 1−β

Time-Varying Volatility A very popular tool is to allow for time-varying volatility I Let’s first make volatility time-varying: σt

Pt+1 = µ + Pt + σt et I If yesterday’s movement was a large down versus large up? I Is volatility mean-reverting? Let eˆ t2 be yesterday’s squared

residual.

σt2+1 = α + βσt2 + γeˆt2

I How about an asymmetry effect? Black (1976)

log σt2+1 = α + β log σt2 + γet − ν|et | I See Rosenberg (1972), Engle (1972). Nelson (1991): EGARCH.

VIX index There’s is typically a spread between the implied and historical volatilities

50 Implied Historical

Volatility (%)

40

30

20

10 1990

1992

1994

1996 Year

1998

2000

Volatility VIX vs SP500 Crisis of 2008: Large jump. Mean reverting diffusion or Shot noise process? How innovation happens: The ferment of finance | The Economist

SP500 VIX Index in the 1990s

Figure: The pattern of spikes and crises in the 1990s

Single Stocks and Earnings: INTC

Figure: Predictable increase in volatility before and sharp drop after earnings.

Stochastic Volatility Jump Models I We’ll use an SVJ model:

dSt

=

µP t St dt + St

p

Vt dWts

(P) + d

Nt ( P )

∑

Sτj − e

Zj (P)

−1

j=1

dVt

= κv (θv − Vt ) dt + σv

p

Vt dWtv (P)

I Here W s (P) and W v (P) are Brownian motions. t t

Nt (P) counts the number of jump times, τj , prior to time t. µt is the equity risk premium, µP t St is the jump compensator.

!

Filtering, Smoothing, Learning and Prediction Data yt depends on a latent state variable, xt . y Observation equation: yt = f xt , ε t State evolution: xt+1 = g xt , εxt+1 ,

I Posterior distribution of p xt |yt where yt

= (y1 , ..., yt )

I Prediction and Bayesian updating.

Z p xt+1 |yt = p (xt+1 |xt ) p xt |yt dxt , updated by Bayes rule p xt+1 |yt+1 ∝ p (yt+1 |xt+1 )p xt+1 |yt . {z }| {z } {z } | | Posterior

Likelihood

Prior

(1)

(2)

SV: Motivating Example SV models (JPR, EJP) yt + 1 =

p

y

Vt + 1 ε t + 1

log (Vt+1 ) = αv + β v log (Vt ) + σv εvt+1 , Typical parameters αv = 0, β v = 0.95, and σv = 0.10. Continuously compounded returns

5

0

−5

0

200

400

600 Time Stochastic volatility

800

1000

1200

400

600 Time

800

1000

1200

4 2 0

0

200

Galton 1877: First Particle Filter

Streaming Data Online Learning Construct an essential state vector Zt+1 . p(Zt+1 |yt+1 ) =

Z

p(Zt+1 |Zt , yt+1 ) dP(Zt |yt+1 )

∝

Z

z }| { p(Zt+1 |Zt , yt+1 ) p(yt+1 |Zt ) dP(Zt |yt ) | {z } | {z } propagate

resample

(i)

1. Re-sample with weights proportional to p(yt+1 |Zt ) and ζ (i) N } i=1

generate {Zt

(i)

ζ (i)

2. Propagate with Zt+1 ∼ p(Zt+1 |Zt

Parameters: p(θ |Zt+1 ) drawn “offline”

(i)

, yt+1 ) to obtain {Zt+1 }N i=1

Nonlinear Model

I The observation and evolution dynamics are

yt =

xt + vt , where vt ∼ N (0, 1) 1 + x2t

xt = xt−1 + wt , where wt ∼ N (0, 0.5) I Initial condition x0

∼ N (1, 10)

Fundamental question: How do the filtering distributions p(xt |yt ) propagate in time?

y −3

2

−2

−1

4

x

0

6

1

2

8

3

Nonlinear: yt = xt /1 + x2t + vt

0

20

60 Time

100

0

20

60 Time

100

Simulate Data # MC sample size N = 1000 # Posterior at time t=0 p(x[0]|y[0])=N(1,10) x = rnorm(N,1,sqrt(10)) hist(x,main="Pr(x[0]|y[0])") # Obtain draws from prior p(x[1]|y[0]) x1 = x + rnorm(N,0,sqrt(0.5)) hist(x1,main="Pr(x[1]|y[0])") # Likelihood function p(y[1]|x[1]) y1 = 5 ths = seq(-30,30,length=1000) plot(ths,dnorm(y1,ths/(1+ths^2),1),type="l",xlab="",ylab="") title(paste("p(y[1]=",y1,"|x[1])",sep=""))

Nonlinear Filtering Pr(x[1]|y[0])

1.0e−05

200 150

0.0e+00

50 0

0 −10

0

5

10

−10

0

5

15

Pr(x[1]|y[1])

Pr(x[2]|y[1])

0

5 x

10

−10

10

30

0.10

p(y[2]=−2|x[2])

0.06

0 −5

−30

0.02

Frequency

300 100

100 200 300 400 500

x1

500

x

0

Frequency

p(y[1]=5|x[1])

100

Frequency

150 100 50

Frequency

200

Pr(x[0]|y[0])

−10

0 x2

5

10

−30

−10

10

30

Resampling Observe y1 = 5, y2 = −2 Key: resample and propagate particles # Computing resampling weights w = dnorm(y1,x1/(1+x1^2),1) # Resample to obtain draws from p(x[1]|y[1]) k = sample(1:N,size=N,replace=TRUE,prob=w) x = x1[k] hist(x,main="Pr(x[1]|y[1])") # Obtain draws from prior p(x[2]|y[1]) x2 = x + rnorm(N,0,sqrt(0.5)) hist(x2,main="Pr(x[2]|y[1])") ...

Propagation of MC error

−10 40

60

80

100

0

20

40

60

Time

Time

N=5000

N=10000

80

100

80

100

0 5 −10

−10

0 5

mx

15

20

15

0

mx

0 5

mx

0 5 −10

mx

15

N=2000

15

N=1000

0

20

40

60

Time

80

100

0

20

40

60

Time

Dynamic Linear Model (DLM) Kalman Filter

Kalman filter for linear Gaussian systems I FFBS (Filter Forward Backwards Sample)

This determines the posterior distribution of the states p(xt |yt ) and p(xt |yT ) Also the joint distribution p(xT |yT ) of the hidden states. I Discrete Hidden Markov Model HMM (Baum-Welch, Viterbi) I With parameters known the Kalman filter gives the exact

recursions.

Simulate DLM Dynamic Linear Models yt = xt + vt and xt = α + βxt−1 + wt Simulate Data T1=50; alpha=-0.01; beta=0.98; sW=0.1; sV=0.25; W=sW^2; V=sV^2 y = rep(0,2*T1) ht = rep(0,2*T1) for (t in 2:(2*T1)){ ht[t] = rnorm(1,alpha+beta*ht[t-1],sW) y[t] = rnorm(1,ht[t],sV) } ht = ht[(T1+1):(2*T1)] y = y[(T1+1):(2*T1)]

Exact calculations Kalman Filter recursions m=rep(0,T1);C=rep(0,T1); m0=0; C0=100 R = C0+W; Q = R+V; A = R/Q m[1] = m0+A*(y[1]-m0); C[1] = R-A^2*Q for (t R Q A m[t] C[t] }

in 2:T1){ = C[t-1]+W = R+V = R/Q = m[t-1]+A*(y[t]-m[t-1]) = R-A^2*Q

# Quantiles q025 q975 h025 sd

= = = =

function(x){quantile(x,0.025)} function(x){quantile(x,0.975)} apply(hs,2,q025) sqrt(apply(hs,2,var))

DLM Data y(t) ●

●

●

● ●

0.0

●

● ●

●

● ●● ● ●

●

●

● ●

−0.5

●●

●

●

●

●

● ●

●● ●

●

●

●

●

●

● ●

● ●

●

●

● ● ●

−1.0

y(t)

●

●

●

●

y(t) m(t) m(t)−2*sqrt(C(t)) m(t)+2*sqrt(C(t)) 0

10

● ●

●

20

30 time

40

50

Bootstrap Filter Now run the particle filter # Bootstrap filter M = 1000 h = rnorm(M,m0,sqrt(C0)) hs = NULL for (t h1 = w = w = h = hs = }

in 1:T1){ rnorm(M,alpha+beta*h,sW) dnorm(y[t],h1,sV) w/sum(w) sample(h1,size=M,replace=T,prob=w) cbind(hs,h)

DLM Filtering

0.5

Filtering

−1.0

−0.5

x(t)

0.0

True SMC

0

10

20

30 time

40

50

Streaming Data How do Parameter Distributions change in Time?

Online Dynamic Learning I Real-time surveillance I Bayes means sequential

updating of information I Update posterior

density p(θ | yt ) with every new observation (t = 1, . . . , T) “sequential learning”

Bayes theorem: p(θ | yt ) ∝ p(yt | θ) p(θ | yt−1 )

Sample – Resample Posterior at t

● ● ●●● ● ●●● ●●●●● ● ●●●●● ● ● ●●

Prior at t+1

●

Weights at t+1

●

● ●●● ● ●

● ● ●● ●

● ●

Posterior at t+1

Prior at t+2

Weights at t+2 −0.5

−0.4

−0.3

●

●

●

●

−0.2

●

●●● ●●

●●● ●●●●●

●●●

●● ●

● ●

● ●● ● ●● ●● ●●

● ● ●

● ● ●● ●

● ● ● ● ●● ●● ● ● ●

●●● ● ● ● ● ●● ● ●● ●●

●● ●

●

●

−0.1

● ● ●● ●

●●

●●

0.0

●

● ●●

●

●●

● ● ● ●

●●●

● ● ●

●

●●

●

●

●

●

●

● ● ●

●●●

●● ●●

●

● ●●

● ●● ●●

●

0.1

●

0.2

● ●●

●

●

●

0.3

0.4

0.5

Resample – Sample Posterior at t

● ● ●●● ● ●●● ●●●●● ● ●●●●● ● ● ●●

●

●●●

Reweights at t+1

●●●●● ●● ●● ● ● ● ●●● ●●●● ●●●●●

●

●● ●

Posterior at t+1

●

● ● ● ●● ●●

● ●●● ●● ● ●● ● ●● ●● ●

● ●

Reweights at t+2

●

● ● ● ●● ● ●

●● ● ● ●●● ● ●●● ● ●●

●

Posterior at t+1

●

−0.5

−0.4

−0.3

●

−0.2

● ●●

−0.1

● ●●

● ●●● ● ● ●● ●● ●●● ● ●●

0.0

0.1

●

● ●

●

0.2

0.3

0.4

0.5

Particle Methods: Blind Propagation

Figure: Propagate-Resample is replaced by Resample-Propagate

Portfolio Choice How to allocate assets between risky assets? 1. Portfolio Analysis: de Finetti (1941) “On a Problem of Re-insurance” Markowitz (1951): “Portfolio Selection” 2. Cash or Stocks? Samuelson (1969) and Merton (1971) Optimal allocation to stocks ω? =

1 µ−r γ σ2

γ is the agent’s risk aversion Dynamically: µt and Vt and hedging demands.

Historical Return Data I Here the returns for US stocks etc

Period 1802-1998 1802-1870 1871-1925 1926-1998 1946-1998

Stocks 7.0 7.0 6.6 7.4 7.8

Bonds 3.5 4.8 3.7 2.2 1.3

Bills 2.9 5.1 3.2 0.7 0.6

Gold - 0.1 0.2 -0.8 0.2 -0.7

I What does the Samuelson/Merton rule yield?

Portfolio strategy: maxωt E[U (Wt+1 |Rt ] ωt∗ =

1 E[Rt+1 − Rf |Rt ] γ Var[Rt+1 |Rt ]

Inflation 1.3 0.1 0.6 3.1 4.2

SP500 Data

Market Timing: time-varying mean-reverting risk premia model Volatility Timing: SV and Leverage Strategy S&P500 SV: γ = 2 SV: γ = 4 SV: γ = 6

Mean 13.6 16.1 11.2 9.6

SD 14.9 17 10.2 7.1

Sharpe 0.49 0.53 0.68 0.94

Sequential Posteriors αµ

(*E-5)

αv 0.020

10

0.015 9

0.010 •••••••• •••••••••••••• ••• • • ••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• ••• ••• •••••••••••••••••••

8 ••••••••••••••• 7

0.005 ••• • •

0.0 -0.005 -0.010

6 1976

1980

1984

1988

1992

1996

2000

1976

1980

1984

σµ

(*E-4)

1988

1992

1996

2000

βv 1.00

30

0.99 20 10

•• •• •••• ••••••••••••• ••••••••••••••••••••••••••• ••••••••• ••• ••• • •••••••••••••••••• • •••••••••••••••••••••••••••••••••• •••••••••••• •

0.98 • ••• • •• • •••••••••••••••••••• • ••••• •••• • •• • •••••••••••• ••••••••••••• •• • • • • • • ••• • • • • ••••• • ••• ••••••••• •••••••••••••••••• ••••••••••••••• • •

••••• •• ••••••••••••••••••••••••••••••••••• ••••••• •• •••••••• •• ••••••••••••••••••••••••••••• ••• •••••• ••••••••• •• • ••••••••••••• • •• • • •• ••• •

0.97 0.96 0.95

0 1976

1980

1984

1988

1992

1996

2000

1976

1980

1984

ρ

1988

1992

1996

2000

σv

0.5

0.18 0.16

0.0 -0.5

• •• • • • •• • •• • • •• • • • • ••• • •••• ••••• • •••• • ••• • ••• •••••• • •••• ••••••• •••• •••• ••••••• ••• • •••••••••••••••••• • •• • • •• • •••• •• ••• ••• •••••• •• • •••• • •

0.14 • 0.12 0.10 0.08

-1.0

•• • •• • ••••••••••• • ••••• • •••••• • •••••••••••••••••••••• •• • • •••• • •• ••••••••••• ••• ••• ••••••••••••••••••••••••• ••••••••••••••• • ••••• ••• ••• •

0.06 1976

1980

1984

1988

1992

1996

2000

1976

1980

1984

1988

1992

1996

2000

Wealth - SP500 - fixed v - parameter learning gamma = 2 4

gamma = 4 4

Market Leverage No Leverage

3

gamma = 6 4


3

3

2

2

2

1

1

1

0

0

0

1974

1979

1984

1989

1994

1999

1974

1979

1984

1989

1994

1999

2.0

2.0

2.0

1.5

1.5

1.5

1.0

1.0

1.0

0.5

0.5

0.5

0.0

0.0 1974

1979

1984

1989

1994

1999


1974

1979

1984

1989

1994

1999

1974

1979

1984

1989

1994

1999

0.0 1974

1979

1984

1989

1994

1999