Have seen that dummy variables are useful when interested in measuring ... Now
see how dummy variables can be used to deal with “seasonality” in data ...
Lecture 14. More on using dummy variables (deal with seasonality) More things to worry about: measurement error in variables (can lead to bias in OLS (endogeneity) )
Have seen that dummy variables are useful when interested in measuring average differences between discrete groups Or for policy evaluation (use of interaction terms lead to the difference-in-difference estimator) Now see how dummy variables can be used to deal with “seasonality” in data
Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data
Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations
Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4X + ut
Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4X + ut Hence the coefficient on the quarterly dummy Q1 =1 if data belong to the 1st quarter of the year (Jan-Mar) = 0 otherwise
Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4X + ut Hence the coefficient on the quarterly dummy Q1 =1 if data belong to the 1st quarter of the year (Jan-Mar) = 0 otherwise gives the level of Y in the 1st quarter of the year relative to the constant (Q4 level of Y) averaged over all Q1 observations in the data set
Using Dummy Variables to capture Seasonality in Data Can also use dummy variables to pick out and control for seasonal variation in data The idea is to include a set of dummy variables for each quarter (or month or day) which will then net out the average change in a variable resulting from any seasonal fluctuations Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4X + ut Hence the coefficient on the quarterly dummy Q1 =1 if data belong to the 1st quarter of the year (Jan-Mar) = 0 otherwise gives the level of Y in the 1st quarter of the year relative to the constant (Q4 level of Y) averaged over all Q1 observations in the data set Series net of seasonal effects are said to be “seasonally adjusted”
It may also be useful to model an economic series as a combination of seasonal and a trend component
It may also be useful to model an economic series as a combination of seasonal and a trend component Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4Trend + ut
It may also be useful to model an economic series as a combination of seasonal and a trend component Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4Trend + ut where Trend
= 1 in year 1
It may also be useful to model an economic series as a combination of seasonal and a trend component Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4Trend + ut where Trend
= 1 in year 1 = 2 in year 2
It may also be useful to model an economic series as a combination of seasonal and a trend component Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4Trend + ut where Trend
=1 in year 1 = 2 in year 2 = T in year T
It may also be useful to model an economic series as a combination of seasonal and a trend component Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4Trend + ut where Trend
=1 in year 1 = 2 in year 2 = T in year T
since dYt/dTrend = b4 given that the coefficient measures the unit change in y for a unit change in the trend variable and the units of measurement in this case are years
It may also be useful to model an economic series as a combination of seasonal and a trend component Yt = b0 + b1Q1 + b2Q2+ b3Q3 + b4Trend + ut where Trend
=1 in year 1 = 2 in year 2 = T in year T
since dYt/dTrend = b4 given that the coefficient measures the unit change in y for a unit change in the trend variable and the units of measurement in this case are years then in the model above the trend term measures the annual change in the Y variable net of any seasonal influences
The
In 2000 the UK department of Transport announced that “By 2010 we want to achieve, compared with the average for 1994-98: •
•
•
a 40% reduction in the number of people killed or seriously injured in road accidents; a 50% reduction in the number of children killed or seriously injured; and a 10% reduction in the slight casualty rate, expressed as the number of people slightly injured per 100 million vehicle kilometres. “
Did they reach this target?
The data set accidents.dta (on the course web site) contains quarterly information on the number of road accidents in the UK from 1983 to 2006
50000
total road accidents : DoT quarterly data 60000 70000 80000 90000
twoway (line acc time, xline(2000) )
1980
1990
time
2000
2010
The graph shows that road accidents vary more within than between years Can see seasonal influence from a regression of number of accidents on 3 dummy variables (1 for each quarter minus the default category – which is the 4th quarter)
. list acc year quart time Q1 Q2 Q3 Q4, clean
1. 2. 3. 4. 5. 6.
acc 67135 76622 82277 82550 69362 79124
year 1983 1983 1983 1983 1984 1984
quart Q1 Q2 Q3 Q4 Q1 Q2
time 1983.25 1983.5 1983.75 1984 1984.25 1984.5
Q1 1 0 0 0 1 0
Q2 0 1 0 0 0 1
Q3 0 0 1 0 0 0
Q4 0 0 0 1 0 0
A regression of road accident numbers on quarterly dummies (q4=winter is default given by constant term at 85249 accidents, on average in the 4th quarter) shows accidents are significantly less likely to happen outside the fourth quarter (October-December). On average there are 14,539 fewer accidents in the first quarter of the year than in the last reg acc Q1 Q2 Q3 Source | SS df MS Number of obs = 108 -------------+-----------------------------F( 3, 104) = 12.44 Model | 2.4702e+09 3 823397888 Prob > F = 0.0000 Residual | 6.8839e+09 104 66191716.9 R-squared = 0.2641 -------------+-----------------------------Adj R-squared = 0.2428 Total | 9.3541e+09 107 87421796.5 Root MSE = 8135.8 -----------------------------------------------------------------------------acc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Q1 | -12933.74 2214.292 -5.84 0.000 -17324.77 -8542.716 Q2 | -7978.926 2214.292 -3.60 0.000 -12369.95 -3587.901 Q3 | -4067.185 2214.292 -1.84 0.069 -8458.21 323.8394 _cons | 81938.52 1565.741 52.33 0.000 78833.6 85043.44 -----------------------------------------------------------------------------Saving residual values after netting out the influence of the seasons is the basis for the production of “seasonally adjusted” data (better guide to underlying trend), used in many official government statistics. Can get a sense of how this works with the following command after a regression . predict rhat, resid /* saves the residuals in a new variable with the name “rhat” */ twoway (line rhat time, xline(2000) )
10000 Residuals -10000 0 -20000 1980
1990
time
2000
2010
Can see the seasonality is reduced and the trend is much clearer Graph of the residuals is much smoother than the original series – it should be since much of the seasonality has been taken out by the dummy variables. The graph also shows that once seasonality accounted for, there is little evidence in a change in the number of road accidents over time until the year 2000 To model both seasonal and trend components of an economic series, simply include both seasonal dummies and a time trend in the regression model Yt = b0 + b1Q1 + b2Q2 + b3Q3 + b4TREND + ut . reg logacc Q1 Q2 Q3 year Source | SS df MS Number of obs = 108 -------------+-----------------------------F( 4, 103) = 42.28 Model | 1.11504664 4 .278761661 Prob > F = 0.0000 Residual | .679143681 103 .006593628 R-squared = 0.6215 -------------+-----------------------------Adj R-squared = 0.6068 Total | 1.79419032 107 .016768134 Root MSE = .0812 -----------------------------------------------------------------------------logacc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Q1 | -.1698453 .0221002 -7.69 0.000 -.2136758 -.1260149
Q2 | -.1005263 .0221002 -4.55 0.000 -.1443567 -.0566958 Q3 | -.0499164 .0221002 -2.26 0.026 -.0937469 -.006086 year | -.0102509 .0010032 -10.22 0.000 -.0122404 -.0082613 _cons | 31.76723 2.002392 15.86 0.000 27.79596 35.7385 -----------------------------------------------------------------------------Can see that there is a downward trend in road accidents (of around 400 a year over the whole sample period) net of any seasonality. Could also use dummy variable interactions to test whether this trend is stronger after 2000. How? Can also use seasonal dummy variables to check whether an apparent association between variables is in fact caused by seasonality in the data . reg acc du Source | SS df MS Number of obs = 71 -------------+-----------------------------F( 1, 69) = 6.19 Model | 236050086 1 236050086 Prob > F = 0.0153 Residual | 2.6325e+09 69 38151620.6 R-squared = 0.0823 -------------+-----------------------------Adj R-squared = 0.0690 Total | 2.8685e+09 70 40978741.5 Root MSE = 6176.7 -----------------------------------------------------------------------------acc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------du | -4104.777 1650.228 -2.49 0.015 -7396.892 -812.662 _cons | 79558.78 768.3058 103.55 0.000 78026.06 81091.51 ------------------------------------------------------------------------------
The regression suggests a negative association between the change in the unemployment rate and the level of accidents (a 1 percentage point rise in the unemployment rate leads to a fall in the number of accidents by 4104 if this regression is to be believed) Might this be in part because seasonal movements in both data series are influencing the results (the unemployment rate also varies seasonally, typically higher in q1 of each year) . reg acc du q2-q4 Source | SS df MS -------------+-----------------------------Model | 2.1275e+09 4 531865433 Residual | 741050172 66 11228032.9 -------------+-----------------------------Total | 2.8685e+09 70 40978741.5
Number of obs F( 4, 66) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
71 47.37 0.0000 0.7417 0.7260 3350.8
-----------------------------------------------------------------------------acc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------du | -1030.818 1009.324 -1.02 0.311 -3045.999 984.3627 q2 | 5132.594 1266.59 4.05 0.000 2603.766 7661.422 q3 | 10093.64 1174.291 8.60 0.000 7749.089 12438.18 q4 | 14353.92 1212.479 11.84 0.000 11933.13 16774.72 _cons | 72488.21 834.607 86.85 0.000 70821.87 74154.56 ------------------------------------------------------------------------------
Can see if add quarterly seasonal dummy variables then apparent effect of unemployment disappears.
Measurement Error Often a data set will contain imperfect measures of the data we would ideally like.
Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment) are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error)
Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment) are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error) Survey Data: (income, health, age) Individuals often lie, forget or round to nearest large number (£102 a week or £100?)
payperiod.dta: hist grossam if pyperiod==1 & grossam0 & grsp==1, bin(100) xline(100 200 300 350 )
Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment) are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error) Survey Data: (income, health, age) Individuals often lie, forget or round to nearest large number (£102 a week or £100?) Proxy Data: (Ability, Intelligence, Permanent Income) Difficult to agree on definition, let alone measure
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1)
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1X + u
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1X + u take the error term on the left to the other side
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1X + u take the error term on the left to the other side y = b0 + b1X + u + e
Measurement Error in Dependent Variable True:
ytrue = b0 + b1X + u
(1)
Observe:
y = ytrue + e
(2)
ie dependent variable measured with error e and e is a random residual term just like u, so E(e)=0 Sub. (2) into (1) y - e = b0 + b1X + u take the error term on the left to the other side y = b0 + b1X + u + e y = b0 + b1X + v
where v = u + e
(3)
Ok to estimate y = b0 + b1X + v by OLS, since
where v = u + e
(3)
Ok to estimate y = b0 + b1X + v by OLS, since E(u) = E(e) = 0
where v = u + e
(3)
Ok to estimate y = b0 + b1X + v
where v = u + e
by OLS, since E(u) = E(e) = 0
(just random residuals so mean is zero)
(3)
Ok to estimate y = b0 + b1X + v
where v = u + e
(3)
by OLS, since E(u) = E(e) = 0
(just random residuals so mean is zero)
Cov(X,u) = 0 (no correlation between original X variable and original error term)
Ok to estimate y = b0 + b1X + v
where v = u + e
(3)
by OLS, since E(u) = E(e) = 0
(just random residuals so mean is zero)
Cov(X,u) = 0 (no correlation between original X variable and original error term) and also Cov(X,e) = 0 (nothing to suggest X variable correlated with meas. error in dependent variable)
Ok to estimate y = b0 + b1X + v
where v = u + e
(3)
by OLS, since E(u) = E(e) = 0
(just random residuals so mean is zero)
Cov(X,u) = 0 (no correlation between original X variable and original error term) and also Cov(X,e) = 0 (nothing to suggest X variable correlated with meas. error in dependent variable) So OLS estimates are unbiased in this case
Ok to estimate y = b0 + b1X + v
where v = u + e
(3)
by OLS, since E(u) = E(e) = 0
(just random residuals so mean is zero)
Cov(X,u) = 0 (no correlation between original X variable and original error term) and also Cov(X,e) = 0 (nothing to suggest X variable correlated with meas. error in dependent variable) So OLS estimates are unbiased in this case but standard errors are larger than would be in absence of meas. error with the associated problems of inference (type II error)
^
True: Var ( β ) =
σ u2 NVar ( X )
from ytrue = b0 + b1X + u
(A)
σ u2
^
True: Var ( β ) =
NVar ( X )
~
Estimate: Var ( β ) =
σ v2 NVar ( X )
from ytrue = b0 + b1X + u
from y = b0 + b1X + v
(A)
σ u2
^
True: Var ( β ) =
NVar ( X ) ~
Estimate: Var ( β ) = But v=u+e
σ v2 NVar ( X )
from ytrue = b0 + b1X + u
from y = b0 + b1X + v
(A)
σ u2
^
True: Var ( β ) =
NVar ( X ) ~
Estimate: Var ( β ) =
σ v2 NVar ( X )
from ytrue = b0 + b1X + u
(A)
from y = b0 + b1X + v
But v=u+e and so σ v 2 = σ u 2 + σ e 2 covariances)
(using
rules
on
σ u2
^
True: Var ( β ) =
NVar ( X ) ~
Estimate: Var ( β ) =
σ v2 NVar ( X )
from ytrue = b0 + b1X + u
from y = b0 + b1X + v
But v=u+e and so σ v 2 = σ u 2 + σ e 2
~
Hence Var ( β ) =
σ u2 + σ e2 NVar ( X )
(using rules on covariances) (B)
(A)
σ u2
^
True: Var ( β ) =
NVar ( X ) ~
Estimate: Var ( β ) =
σ v2 NVar ( X )
from ytrue = b0 + b1X + u
from y = b0 + b1X + v
But v=u+e and so σ v 2 = σ u 2 + σ e 2
~
Hence Var ( β ) =
σ u2 + σ e2 NVar ( X )
[Since var(v) = var(u+e) = var(e)+var(u)+2cov(e,u)
(using rules on covariances) (B)
(A)
and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ]
and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ] so var(v)
= var(e)+var(u)
and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ] so var(v)
= var(e)+var(u)
~
Hence Var ( β ) =
σ u2 + σ e2 NVar ( X )
^
> Var ( β ) =
σ u2 NVar ( X )
and if assume things that cause measurement error in y are unrelated to residual u, so cov(e,u)=0. ] so var(v)
= var(e)+var(u)
~
Hence Var ( β ) =
σ u2 + σ e2 NVar ( X )
^
> Var ( β ) =
σ u2 NVar ( X )
So the residual variance in presence of measurement error in dependent variable now also contains an additional contribution from error in y variable, σ2e so standard errors are larger in models where there is measurement error in the Y variable and the bigger the measurement error the larger the standard errors , the lower the t (and F) values and the greater the risk of Type II error (failing to reject a false null)
Measurement Error in Explanatory Variable True:
ytrue = b0 + b1Xtrue + u
(1)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
(1) (2)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
ie right hand side variable measured with error (w)
(1) (2)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
ie right hand side variable measured with error (w) sub. (2) into (1)
(1) (2)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that Xtrue = X - w
(1) (2)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that Xtrue = X - w ytrue = b0 + b1(X-w) + u
(1) (2)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that Xtrue = X - w ytrue = b0 + b1(X-w) + u ytrue = b0 + b1X - b1w + u
(1) (2)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
(1) (2)
ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that Xtrue = X - w ytrue = b0 + b1(X-w) + u ytrue = b0 + b1X - b1w + u ytrue = b0 + b1X + v
(3)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
(1) (2)
ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that Xtrue = X - w ytrue = b0 + b1(X-w) + u ytrue = b0 + b1X - b1w + u ytrue = b0 + b1X + v
(3)
where now v = - b1w + u (so residual term again consists of 2 components)
Measurement Error in Explanatory Variable True: Observe:
ytrue = b0 + b1Xtrue + u X = Xtrue + w
(1) (2)
ie right hand side variable measured with error (w) sub. (2) into (1) ie use fact that Xtrue = X - w ytrue = b0 + b1(X-w) + u ytrue = b0 + b1X - b1w + u ytrue = b0 + b1X + v
(3)
where now v = - b1w + u (so residual term again consists of 2 components) Hence (3) is the basis for OLS estimation. Does this matter?
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov ( X true (b + b X true + u ) Cov ( X true , u ) o 1 b1 = = = b1 + true true Var ( X Var ( X Var ( X true ) ) ) (4) (just sub. in for ytrue and cancel terms)
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov ( X true (b + b X true + u ) Cov ( X true , u ) o 1 b1 = = = b1 + true true Var ( X Var ( X Var ( X true ) ) ) (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov ( X true (b + b X true + u ) Cov ( X true , u ) o 1 b1 = = = b1 + true true Var ( X Var ( X Var ( X true ) ) ) (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate ytrue = b0 + b1X + v
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate ytrue = b0 + b1X + v not ytrue = b0 + b1Xtrue + u
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate ytrue = b0 + b1X + v not ytrue = b0 + b1Xtrue + u and now rewrite Cov(X,v)
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate ytrue = b0 + b1X + v not ytrue = b0 + b1Xtrue + u and now rewrite Cov(X,v) = Cov(Xtrue+w,
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate ytrue = b0 + b1X + v not ytrue = b0 + b1Xtrue + u and now rewrite Cov(X,v) = Cov(Xtrue+w,- b1w + u) (sub. in for X and v using (2) & (3))
In true (2 variable) model, we know that OLS implies that
^ Cov( X true , y true ) Cov( X true (b + b X true + u ) Cov ( X true , u ) 1 o b1 = = = b1 + true true ) ) Var ( X true ) Var ( X Var ( X (4) (just sub. in for ytrue and cancel terms) Since by assumption Cov(Xtrue,u) = 0 ^ ^ then in (4) E (b1) = b1 and OLS is unbiased but in presence of measurement error, we estimate ytrue = b0 + b1X + v not ytrue = b0 + b1Xtrue + u and now rewrite Cov(X,v) = Cov(Xtrue+w,- b1w + u) (sub. in for X and v using (2) & (3))
Expanding terms using rules on covariances Cov(Xtrue+w,- b1w + u)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X),
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(Xtrue,u)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(Xtrue,u)= Cov(Xtrue,-b1w) = 0
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(Xtrue,u)= Cov(Xtrue,-b1w) = 0 This leaves Cov ( X , v) = Cov ( w,−b1w)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(Xtrue,u)= Cov(Xtrue,-b1w) = 0 This leaves Cov ( X , v) = Cov ( w,−b1w) = −b1Cov ( w, w)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(Xtrue,u)= Cov(Xtrue,-b1w) = 0 This leaves Cov ( X , v) = Cov ( w,−b1w) = −b1Cov ( w, w) = −b1Var ( w)
Cov(Xtrue+w,- b1w + u) = Cov(Xtrue,u)+ Cov(Xtrue,-b1w) + Cov(w,u) + Cov(w,-b1w) u and w are independent errors (caused by different factors), so no reason to expect them to be correlated with each other or the true value of X (this means any error in X should not depend on the level of X), so Cov(w,u)= Cov(Xtrue,u)= Cov(Xtrue,-b1w) = 0 This leaves Cov ( X , v) = Cov ( w,−b1w) = −b1Cov ( w, w) = −b1Var ( w) Hence
Cov ( X , v) ≠ 0
In other words there is now a correlation between the X variable and the error term in (3). So if we estimate (3) by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes) and this is not the case when explanatory variable measured with error.
Given ytrue = b0 + b1X + v if we estimate this by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes) and this is not the case when explanatory variable measured with error. (4) becomes
^ Cov( X , y true ) Cov( X (b + b X true + u ) − b Var ( w) Cov( X , v) o 1 b1 = = = b1 + = b1 + 1 ≠ b1 Var ( X ) Var ( X ) Var ( X ) Var ( X )
Given ytrue = b0 + b1X + v if we estimate (3) by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes) and this is not the case when explanatory variable measured with error. (4) becomes
^ Cov( X , y true ) Cov( X (b + b X true + u ) − b Var ( w) Cov( X , v) o 1 b1 = = = b1 + = b1 + 1 ≠ b1 Var ( X ) Var ( X ) Var ( X ) Var ( X ) ^ ^ so E (b1 ) ≠ b1 and OLS gives biased estimates in presence of measurement error in the explanatory variable
Not only that, can show that OLS estimates are always biased toward zero (Attenuation Bias)
Not only that, can show that OLS estimates are always biased toward zero (Attenuation Bias) ^ if true b1>0 then b1ols < b1
Not only that, can show that OLS estimates are always biased toward zero (Attenuation Bias) ^ if true b1>0 then b1ols < b1 ^ if true b1 b1 ie closer to zero in both cases
Problem is that Cov(X,v)≠0 ie residual and right hand side variable are correlated In such situations the X variable is said to be endogenous
Example of Consequences Of Measurement Error In Data This example uses artificially generated measurement error to illustrate the basic issues. . list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
y_true 75.4666 74.9801 102.8242 125.7651 106.5035 131.4318 149.3693 143.8628 177.5218 182.2748
y_observ 67.6011 75.4438 109.6956 129.4159 104.2388 125.8319 153.9926 152.9208 176.3344 174.5252
/* list observations in data set */ x_true x_observ 80 60 100 80 120 100 140 120 160 140 180 200 200 220 220 240 240 260 260 280
The data set measerr.dta gives true (unobserved) values of y and x and their observed counterparts, (the 1st 5 observations on x underestimate by 20 and the last 5 overestimate the true value by 20) First look at regression of true y on true x . reg y_true x_true Source | SS df MS Number of obs = 10 ---------+-----------------------------F( 1, 8) = 105.60 Model | 11879.9941 1 11879.9941 Prob > F = 0.0000 Residual | 900.003252 8 112.500407 R-squared = 0.9296 ---------+-----------------------------Adj R-squared = 0.9208 Total | 12779.9974 9 1419.99971 Root MSE = 10.607 -----------------------------------------------------------------------------y_true | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------x_true | .5999999 .0583875 10.276 0.000 .465358 .7346417 _cons | 25.00002 10.47727 2.386 0.044 .8394026 49.16064
Now look at consequence of measurement error in dependent variable . reg y_obs x_true Source | SS df MS ---------+------------------------------
Number of obs = F( 1, 8) =
10 77.65
Model | 11880.0048 1 11880.0048 Prob > F = 0.0000 Residual | 1223.99853 8 152.999817 R-squared = 0.9066 ---------+-----------------------------Adj R-squared = 0.8949 Total | 13104.0033 9 1456.00037 Root MSE = 12.369 -----------------------------------------------------------------------------y_observ | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------x_true | .6000001 .0680908 8.812 0.000 .4429824 .7570178 _cons | 24.99999 12.21846 2.046 0.075 -3.175826 53.17581
Consequence: Coefficients virtually identical, (unbiased) but standard errors larger and hence t values smaller and confidence intervals wider. Measurement error in explanatory variable: . reg y_true x_obs Source | SS df MS Number of obs = 10 -------------+-----------------------------F( 1, 8) = 83.15 Model | 11658.3625 1 11658.3625 Prob > F = 0.0000 Residual | 1121.63494 8 140.204367 R-squared = 0.9122 -------------+-----------------------------Adj R-squared = 0.9013 Total | 12779.9974 9 1419.99971 Root MSE = 11.841 -----------------------------------------------------------------------------y_true | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x_observ | .4522529 .0495956 9.12 0.000 .3378852 .5666206 _cons | 50.11701 9.225319 5.43 0.001 28.84338 71.39063 ------------------------------------------------------------------------------
Consequence: both coefficients biased. and slope coefficient is biased toward zero (0.45 compared with true 0.60 ie underestimate effect by 25%) Intercept is biased upward (compare 50.1 with 25.0) Problem is that Cov(X,u)≠0 ie residual and right hand side variable are correlated In such situations the X variable is said to be endogenous
Solution? - Get better data
Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable
Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable
Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable
Phillip Wright
More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) ≠ 0
More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) ≠ 0 correlated with the problem variable
More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) ≠ 0 correlated with the problem variable 2) Cov(Z,u) = 0
More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) ≠ 0 correlated with the problem variable 2) Cov(Z,u) = 0 but uncorrelated with the residual (so does not suffer from measurement error and also is not correlated with any unobservable factors influencing the dependent variable)
Instrumental variable (IV) estimation proceeds as follows:
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u
(1)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u Multiply (1) by the instrument Z
(1)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u Multiply by the instrument Z Zy = Zb0 + b1ZX + Zu
(1)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u Multiply by the instrument Z Zy = Zb0 + b1ZX + Zu Follows that Cov(Z,y)
= Cov[Zb0 + b1ZX + Zu]
(1)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u
(1)
Multiply by the instrument Z Zy = Zb0 + b1ZX + Zu Follows that Cov(Z,y)
= Cov[Zb0 + b1ZX + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u
(1)
Multiply by the instrument Z Zy = Zb0 + b1ZX + Zu Follows that Cov(Z,y)
= Cov[Zb0 + b1ZX + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u)
since Cov(Zb0) = 0
(using rules on covariance of a constant)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u
(1)
Multiply by the instrument Z Zy = Zb0 + b1ZX + Zu Follows that Cov(Z,y)
= Cov[Zb0 + b1ZX + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u)
since Cov(Zb0) = 0
(using rules on covariance of a constant)
and Cov(Z,u) = 0 (if assumption above about the properties of instruments is correct)
Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1X + u
(1)
Multiply by the instrument Z Zy = Zb0 + b1ZX + Zu Follows that Cov(Z,y)
= Cov[Zb0 + b1ZX + Zu] = Cov(Zb0) + Cov(b1Z,X) + Cov(Z,u)
since Cov(Zb0) = 0
(using rules on covariance of a constant)
and Cov(Z,u) = 0 (if assumption above about the properties of instruments is correct) then
Cov(Z,y)
= 0 + b1Cov(Z,X) + 0
Solving Cov(Z,y)
= 0 + b1Cov(Z,X) + 0 for b1
gives the formula to calculate the instrumental variable estimator
Solving Cov(Z,y)
= 0 + b1Cov(Z,X) + 0
for b1 gives the formula to calculate the instrumental variable estimator So b1IV =
Cov( Z , y ) Cov( Z , X )
Solving Cov(Z,y)
= 0 + b1Cov(Z,X) + 0
for b1 gives the formula to calculate the instrumental variable estimator So b1IV =
Cov( Z , y ) Cov( Z , X )
(compare with b1OLS =
Cov ( X , y ) Var ( X )
)
Solving Cov(Z,y)
= 0 + b1Cov(Z,X) + 0
for b1 gives the formula to calculate the instrumental variable estimator So b1IV =
Cov( Z , y ) Cov( Z , X )
(compare with b1OLS =
Cov ( X , y ) Var ( X )
)
In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is said to be consistent –
Solving Cov(Z,y)
= 0 + b1Cov(Z,X) + 0
for b1 gives the formula to calculate the instrumental variable estimator So b1IV =
Cov( Z , y ) Cov( Z , X )
(compare with b1OLS =
Cov ( X , y ) Var ( X )
)
In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is said to be consistent – while the OLS estimator is inconsistent IN THE PRESENCE OF ENDOGENEITY which makes IV a useful estimation technique to employ
However can show that (in the 2 variable case) the variance of the IV estimator is given by
1 s2 IV * Var ( β 1 ) = N *Var ( X ) r 2 X Z ^
where rxz2 is the square of the correlation coefficient between endogenous variable and instrument
However can show that (in the 2 variable case) the variance of the IV estimator is given by
s2 1 IV Var ( β 1 ) = * N *Var ( X ) r 2 X Z ^
where rxz2 is the square of the correlation coefficient between endogenous variable and instrument ^ s2 OLS ) (compared with OLS Var ( β 1 )= N * Var ( X )
However can show that (in the 2 variable case) the variance of the IV estimator is given by
s2 1 IV Var ( β 1 ) = * N *Var ( X ) r 2 X Z ^
where rxz2 is the square of the correlation coefficient between endogenous variable and instrument ^ s2 OLS ) (compared with OLS Var ( β 1 )= N * Var ( X )
Since r2 >0 So IV estimation is less precise (efficient) than OLS estimation May sometimes want to trade off bias against efficiency
Where to find an instrument?
Lecture 15 .Potential solution to endogeneity – instrumental variable estimation . Tests for endogeneity . Other sources of endogeneity . Problems with weak instruments
Problem with measurement error in X variables is that it makes Cov(X,u)≠0 ie residual and right hand side variable are correlated In such situations the X variable is said to be endogenous and OLS will be biased toward zero (inconsistent) in this case
Solution? - Get better data If that is not possible do something to get round the problem. - replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable More formally, an instrument Z for the variable of concern X satisfies 1) Cov(X,Z) ≠ 0 correlated with the problem variable 2) Cov(Z,u) = 0 but uncorrelated with the residual (so does not suffer from measurement error and also is not correlated with any unobservable factors influencing the dependent variable)
So b1IV =
Cov( Z , y ) Cov ( Z , X )
(compare with b1OLS =
Cov ( X , y ) Var ( X )
)
In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is consistent – while the OLS estimator is inconsistent which makes IV a useful estimation technique to employ
However can show that (in the 2 variable case) the variance of the IV estimator is given by
1 s2 IV Var ( β 1 ) = * N *Var ( X ) r 2 X Z ^
where rxz2 is the square of the correlation coefficient between endogenous variable and instrument ^ s2 OLS ) (compared with OLS Var ( β 1 )= N * Var ( X ) So IV estimation is less precise (efficient) than OLS estimation if rXZ2>0 (which it must be to satisfy the other requirement of an instrument) 2
but the greater the correlation between X and Z, rXZ , the smaller is (and hence lower standard errors and higher t values)
^ Var( β IV
)
So why not ensure that the correlation between X and the instrument Z is as high as possible?
So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X – the initial problem is not solved.
So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X – the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems
So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X – the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b1IV =
Cov( Z , y ) Cov( Z , X )
So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X – the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b1IV =
Cov( Z , y ) Cov( Z , X )
sub. in for y = b0 + b1X + u
So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X – the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b1IV =
Cov( Z , y ) Cov( Z , X )
sub. in for y = b0 + b1X + u Cov ( Z , b0 + b1 X + u ) Cov ( Z , X )
So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X – the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems Since can always write the IV estimator as b1IV =
Cov( Z , y ) Cov( Z , X )
sub. in for y = b0 + b1X + u Cov ( Z , b0 + b1 X + u ) Cov ( Z , X ) Cov ( Z , b0 ) + b1Cov( Z , X ) + Cov ( Z , u ) = Cov ( Z , X )
b1IV =
0 + b1Cov ( Z , X ) + Cov ( Z , u ) Cov ( Z , X )
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
b1IV =
0 + b1Cov ( Z , X ) + Cov ( Z , u ) Cov ( Z , X )
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1
b1IV =
0 + b1Cov ( Z , X ) + Cov ( Z , u ) Cov ( Z , X )
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1 So: always check extent of correlation between X and Z before any IV estimation (see later)
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like – though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments).
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like – though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments?
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like – though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments? - difficult
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like – though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments? - difficult - The appropriate instrument will vary depending on the issue under study.
So b1IV = b1 +
Cov( Z , u ) Cov ( Z , X )
So if Cov(X,Z) is small then the IV estimate can be a long way from the true value b1 So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like – though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments). Where to find good instruments? - difficult. The appropriate instrument will vary depending on the issue under study.
In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale.
In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale. Clearly correlated with the original value but because it is a rank should not be affected with measurement error
In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale. Clearly correlated with the original value but because it is a rank should not be affected with measurement error - Though this assumes that the measurement error is not so large as to affect the (true) ordering of the X variable
egen rankx=rank(x_obs)
/* stata command to create the ranking of x_observ
*/
. list x_obs rankx x_observ rankx 1. 60 1 2. 80 2 3. 100 3 4. 120 4 5. 140 5 6. 200 6 7. 220 7 8. 240 8 9. 260 9 10. 280 10
ranks from smallest observed x to largest Now do instrumental variable estimates using rankx as the instrument for x_obs ivreg y_t (x_ob=rankx) Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 10 -------------+-----------------------------F( 1, 8) = 84.44 Model | 11654.5184 1 11654.5184 Prob > F = 0.0000 Residual | 1125.47895 8 140.684869 R-squared = 0.9119 -------------+-----------------------------Adj R-squared = 0.9009 Total | 12779.9974 9 1419.99971 Root MSE = 11.861 -----------------------------------------------------------------------------y_true | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x_observ | .460465 .0501086 9.19 0.000 .3449144 .5760156 _cons | 48.72095 9.307667 5.23 0.001 27.25743 70.18447 -----------------------------------------------------------------------------Instrumented: x_observ Instruments: rankx ------------------------------------------------------------------------------
Can see both estimated coefficients are a little closer to their true values than estimates from regression with measurement error (but not much)In this case the rank of X is not a very good instrumentNote that standard error in instrumented regression is larger than standard error in regression of y_true on x_observed as expected with IV estimation
Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss.
Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if Cov(X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity
Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if Cov(X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity 1. Given y = b0 + b1X + u
(A)
Regress the endogenous variable X on the instrument(s) Z X = d0 + d1Z + v
(B)
Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if Cov(X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity 1. Given y = b0 + b1X + u
(A)
Regress the endogenous variable X on the instrument(s) Z X = d0 + d1Z + v Save the residuals
^ v
(B)
2. Include this residual as an extra term in the original model
Include this residual as an extra term in the original model ie given y = b0 + b1X + u
Include this residual as an extra term in the original model ie given y = b0 + b1X + u estimate y = b0 + b1X +
^ b2 v
+e
and test whether b2 = 0 (using a t test)
Include this residual as an extra term in the original model ie given y = b0 + b1X + u estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u
Include this residual as an extra term in the original model ie given y = b0 + b1X + u estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u
Include this residual as an extra term in the original model ie given y = b0 + b1X + u estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
Include this residual as an extra term in the original model ie given y = b0 + b1X + u estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
because X = d0 + d1Z + v
Include this residual as an extra term in the original model ie given y = b0 + b1X + u estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
because X = d0 + d1Z + v Endogenous X = instrument + something else
Include this residual as an extra term in the original model ie given y = b0 + b1X + u
(A)
estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
because X = d0 + d1Z + v Endogenous X = instrument + something else
and so only way X could be correlated with u in (A) is through v
Include this residual as an extra term in the original model ie given y = b0 + b1X + u
(A)
estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
because X = d0 + d1Z + v Endogenous X = instrument + something else
and so only way X could be correlated with u in (A) is through v (since Z is not correlated with u by assumption) This means the residual u in (A) depends on v + some other residual
Include this residual as an extra term in the original model ie given y = b0 + b1X + u
(A)
estimate
^ y = b0 + b1X + b2 v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
because X = d0 + d1Z + v
and so only way X could be correlated with u is through v This means the residual in (A) depends on v + some other residual u = b 2v + e
Include this residual as an extra term in the original model ie given y = b0 + b1X + u
(A)
estimate
^ y = b0 + b1X + b2 v + e
(B)
and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 ≠ 0 conclude there is correlation between X and u Why ?
because X = d0 + d1Z + v
and so only way X could be correlated with u is through v This means the residual in (A) depends on v + some residual u = b 2v + e So estimate (B) instead and test whether coefficient on v is significant
If it is, conclude that X and error term are indeed correlated; there is endogeneity N.B. This test is only as good as the instruments used and is only valid asymptotically. This may be a problem in small samples and so you should generally use this test only with sample sizes well above 100.
Example: The data set ivdat.dta contains information on the number of GCSE passes of a sample of 16 year olds and the total income of the household in which they live. Income tends to be measured with error. Individuals tend to mis-report incomes, particularly third-party incomes and non-labour income. The following regression may therefore be subject to measurement error in one of the right hand side variables, (the gender dummy variable is less subject to error). . reg nqfede inc1 female Source | SS df MS Number of obs = 252 -------------+-----------------------------F( 2, 249) = 14.55 Model | 274.029395 2 137.014698 Prob > F = 0.0000 Residual | 2344.9706 249 9.41755263 R-squared = 0.1046 -------------+-----------------------------Adj R-squared = 0.0974 Total | 2619.00 251 10.4342629 Root MSE = 3.0688 -----------------------------------------------------------------------------nqfede | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------inc1 | .0396859 .0087786 4.52 0.000 .022396 .0569758 female | 1.172351 .387686 3.02 0.003 .4087896 1.935913 _cons | 4.929297 .4028493 12.24 0.000 4.13587 5.722723
To test endogeneity first regress the suspect variable on the instrument and any exogenous variables in the original regression reg inc1 ranki female Source | SS df MS Number of obs = 252 -------------+-----------------------------F( 2, 249) = 247.94 Model | 81379.4112 2 40689.7056 Prob > F = 0.0000 Residual | 40863.626 249 164.110948 R-squared = 0.6657 -------------+-----------------------------Adj R-squared = 0.6630 Total | 122243.037 251 487.024053 Root MSE = 12.811 -----------------------------------------------------------------------------inc1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ranki | .2470712 .0110979 22.26 0.000 .2252136 .2689289 female | .2342779 1.618777 0.14 0.885 -2.953962 3.422518 _cons | .7722511 1.855748 0.42 0.678 -2.882712 4.427214 ------------------------------------------------------------------------------
1. save the residuals . predict uhat, resid
2. include residuals as additional regressor in the original equation .
reg nqfede inc1 female uhat
Source | SS df MS Number of obs = 252 -------------+-----------------------------F( 3, 248) = 9.94 Model | 281.121189 3 93.7070629 Prob > F = 0.0000 Residual | 2337.87881 248 9.42693069 R-squared = 0.1073 -------------+-----------------------------Adj R-squared = 0.0965 Total | 2619.00 251 10.4342629 Root MSE = 3.0703 -----------------------------------------------------------------------------nqfede | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------inc1 | .0450854 .0107655 4.19 0.000 .0238819 .0662888 female | 1.176652 .3879107 3.03 0.003 .4126329 1.940672 uhat | -.0161473 .0186169 -0.87 0.387 -.0528147 .0205201 _cons | 4.753386 .4512015 10.53 0.000 3.864711 5.642062 ------------------------------------------------------------------------------
Now added residual is not statistically significantly different from zero, so conclude that there is no endogeneity bias in the OLS estimates. Hence no need to instrument. Note you can also get this result by typing the following command after the ivreg command ivendog Tests of endogeneity of: inc1 H0: Regressor is exogenous Wu-Hausman F test: Durbin-Wu-Hausman chi-sq test:
0.75229 0.76211
F(1,248) Chi-sq(1)
P-value = 0.38659 P-value = 0.38267
the first test is simply the square of the t value on uhat in the last regression (since t2 = F) N.B. This test is only as good as the instruments used and is only valid asymptotically. This may be a problem in small samples and so you should generally use this test only with sample sizes well above 100.