Dummy Variables and Omitted Variable Bias

111 downloads 457297 Views 32KB Size Report
Introduction to Econometrics. Notes 5. Dummy Variables and Omitted Variable Bias. These notes provide a summary of the lectures. They are not a complete.
DEPARTMENT OF ECONOMICS Unit ECON 12122 Introduction to Econometrics Notes 5 Dummy Variables and Omitted Variable Bias These notes provide a summary of the lectures. They are not a complete account of the unit material. You should also consult the reading as given in the unit outline and the lectures.

1. Dummy Variables The source of most explanatory variables is data from random samples of economic agents or by aggregate national data produced by statistical offices like the Office of National Statistics. In some circumstances it is convenient to invent an explanatory variable. These invented variables take the values of either zero or one (and no other value). They are useful if it is believed that some outside (usually non-economic) occurrence effected the relationship which is being investigated. Suppose for instance we have the following simple regression model; y t = β0 + β1x t + u t

t=1,2,..,T

Suppose that for two periods (t = 14 and 15 say) there was on outside event which reduced the dependent variable y t below its normal level. This could be the result of war, famine, bad weather, or strikes in time series samples. In cross section samples they are also used to describe demographic characteristics (e.g. married/unmarried) or other features of the agents in the sample that can be measured in a 0/1 manner (e.g. educational attainment.) In our example which uses a time series sample, the two outlying observations may unduly influence the least squares estimates of β0 and β1 . In order to prevent this, it is possible to include a dummy variable into the model. Define D t in the following way; Dt = 0 Dt = 1

t=1,2,..13,16,..,T t=14,15

the dummy variable D t is then included in the regression model. y t = β0 + β1x t + β2 D t + u t

t=1,2,..,T

1

This has the following effect. For the periods when D t = 0, we have the original model y t = β0 + β1x t + u t

t=1,2,...13,16,..T

For the periods when D t = 1 , the model becomes y t = β0 + β2 + β1x t + u t

t=14,15

The constant term has now changed and is now ( β0 + β2 ). It also possible to use dummy variables to allow slope coefficients to change. Define a new dummy variable xD t in the following way; xD t = 0 xD t = x t

t=1,2,..13,16,..,T t=14,15

The model now becomes y t = β0 + β1x t + β2 xD t + u t

t=1,2,..,T

This has the following effect. For the periods when xD t = 0, we have the original model y t = β0 + β1x t + u t

t=1,2,..13,16,..,T

For the periods when xD t = x t , the model becomes y t = β0 + (β1 + β2 )x t + u t

t=14,15

The slope has now changed and is now ( β1 + β2 ). It is logically possible to include both a dummy variable which allows the intercept to change and one which allows the slope to change at the same time. In practice such extraneous events such as wars, famines etc are regarded as more likely to affect the intercept rather than the slope. Dummy variables can also be used to “take out” seasonal effects with quarterly or monthly data. Notice that with one single event, as in the example above, there are two states – e.g. no war and war. To deal with two states, one dummy variable is needed. If dummy variables are being used to deal with four quarterly seasons, three dummy variables will be required (one less than the number states).

2

An Example: We can apply the dummy variable technique to the model and the data of wages and unemployment used in Exercises 5 and 6. Recall that on the scatter diagram of dw t on un t for the sample of annual UK data 1974-1995, there were outliers associated with years of high rates of retail price inflation. The two particularly obvious years were 1975 and 1980. These years had the highest rate of increase of wages (27.5 and 21.9 per cent respectively) and the highest rates of price inflation (19.4 and 15.3 per cent respectively). It is possible to argue that the high rates of increase of wages in these two years was more to do with political factors than economic ones. Thus we could use a dummy variable to model the effect of these two years. We define the dummy variable in the following way Dt = 0 Dt = 1

t=1974, 1976,… ,1979, 1981,… ,1995 t=1975, 1980

and include it the regression. The least squares estimates are (using the Sata computer programme); dw t = 19.480 − 1.709 un t + 13.272 D t + v t (2.010) (0.242) (2.392)

(1)

Standard errors in brackets, R 2 = 0.871, s= 2.987, F = 63.97, T = 22, v t is the least squares residual. These results show that for the years in which the dummy variable is zero, the estimated intercept is 19.48. For 1975 and 1980 the estimate of the intercept is 19.48 + 13.272 = 32.752. The value of the slope coefficient is also higher than when the dummy variable was not included. A t test shows that the dummy variable is significant. The t statistic is; 13.272/2.392 = 5.548 This statistic has a t distribution with 19 degrees of freedom under the null hypothesis that the coefficient of the dummy variable is zero. The 95% critical value is 2.093 and the null can be rejected. Thus the dummy variable is significant.

3

2. Omitted Variable Bias So far we have assumed that the linear regression model is the correct specification of the relationship between the dependent and the explanatory variables. But suppose it is not. If it is not the correct specification, the model is then termed “mis-specified”. There are number of kinds of mis-specification and each kind has different consequences for estimation and hypothesis testing. Here we will deal with one of the most common kinds; the omission of an explanatory variable. The linear regression model can, of course, contain a number of explanatory variables. The number is limited by the number of observations. In general, the number of observations should be several times greater than the number of explanatory variables. Nevertheless, it is still possible for an explanatory variable to be omitted either because its influence on the dependent variable is unknown or because it is difficult or impossible to find data on such a variable. We are interested in the consequences for the least squares estimators of this omission. We will take the simplest possible case. Suppose the true model is; yt =

a 1x t +

a2z t +

ut

t=1,2,… T

(2)

where u t is an unobserved random variable, E( u t x t , z t ) = 0. However for whatever reason the second explanatory variable z t is omitted and the econometrician assumes that the correct model is; yt =

a 1x t +

ut

t=1,2,… T

(3)

The least squares estimate of a1 will be

aˆ1 =

∑ xtyt 2 ∑ xt

(4)

What are the properties of aˆ1 ? As before (see Notes 3), we take the expression for aˆ1 (4) and substitute for y t . On this occasion we substitute not from the false model (3), but from the true model (2). Thus

aˆ1 =

∑ x t ( a 1x t + a 2 z t + u t ) = a1 + 2 ∑ xt

a2 ∑ x tz t ∑

x 2t

+

∑ xtut 2 ∑ xt

We now wish to examine whether aˆ1 is biased or unbiased. To do this we take expectations of (5). a ∑ x z  ∑ x u  E( aˆ1 ) = E( a1 ) + E  2 2t t  + E  t 2 t   ∑x   ∑x   t   t  4

(5)

We will consider the RHS of this equation term by term. Starting with the simplest; E( a1 ) =

a1

∑ x u  because a1 is a constant. Next we will consider E  t 2 t   ∑x   t  ∑ x u  ∑ x t E( u t ) = 0 E t2t  = 2  ∑x  x ∑ t  t  because E( u t x t , z t ) = 0. Thus E( aˆ1 )

a ∑ x z  + E  2 2t t   ∑x  t  

a1

=

However in general,

a ∑ x z  E  2 2t t   ∑x   t 

=

a2 ∑ x tz t 2 ∑ xt

≠ 0

Thus, E( aˆ1 ) =

a1 +

a2 ∑ x tz t 2 ∑ xt



a1

(6)

The least squares estimator of a1 becomes biased. Rearranging (6), the bias is given by E( aˆ1 ) − a1 =

a2 ∑ x tz t

(7)

2 ∑ xt

It is possible that the right hand side (RHS) of (7) could be zero in which case aˆ1 is unbiased. In other cases aˆ1 will be biased but we can deduce the sign of the bias. To investigate these possibilities, it is necessary to look closely at the three components of the RHS of (7).

a 2 - this is the coefficient of the omitted explanatory variable z t . If a 2 is zero, then z t is correctly omitted and (3) becomes the true model. In these circumstances aˆ1 is unbiased. However whenever a 2 ≠ 0 , aˆ1 will be biased.

5

∑ x t z t - this is the covariance between x t and z t . If this covariance (or correlation) is zero, then aˆ1 is unbiased. Thus the omission of an explanatory variable which is not correlated with the included explanatory variables will not bias the least squares estimators. 2 ∑ x t - this is the variance on the included explanatory variable. It is always positive.

Thus the sign of the bias of aˆ1 is given by the signs of a 2 and ∑ x t z t . If a 2 and ∑ x t z t have the same sign, the bias will be positive – on average aˆ1 will be too high. If they have opposite signs, on average aˆ1 will be too low. Sometimes it is possible to know the signs of a 2 and ∑ x t z t , and thus know the sign of the omitted variable bias.

An example: Returning to the example of the change in wages and unemployment considered above, the original simple regression model was; dw t

=

a 0 + a1un t + u t

t= 1,2,..T

(8)

where dw t is the percentage change in wages in manufacturing, un t is the percentage unemployment rate and u t is an unobserved random disturbance, using a sample of UK annual data 1974-1995. The least squares estimate of a1 implies a comparatively large negative effect of unemployment on wages. This estimate may be partly the consequence of a few outlying observations when the rate of inflation was particularly high. Rather than use a dummy variable to account for the years 1975 and 1980, the rate of inflation may be an omitted explanatory variable in the model. To investigate this possibility the least squares estimates of the model with inflation included were computed as follows; dw t = 2.498 − 0.509 un t + 1.175 dp t + e t (5.867) (0.501) (0.292)

(9)

Standard errors in brackets, R 2 = 0.818, s= 3.547, F = 42.63, T = 22, where dp t is the percentage change in retail prices and e t is the least squares residual. We know that inflation fell during this period and that unemployment rose, thus the covariance (or correlation) between the omitted and the included explanatory variables ( ∑ x t z t ) is negative in this sample. Assuming that the coefficient of the omitted variable is positive (the least squares estimate is positive and we would expect the coefficient to

6

be positive for theoretical reasons as well), the sign of the omitted variable bias is, in this example, negative. The consequence of this bias is that the estimated coefficient of unemployment in (8) will be smaller that it should be. You will be able to confirm that the least squares estimate of a1 in (8) is smaller (a larger negative number) than − 0.509. The comparatively large negative effect of unemployment on wages which was found by estimating (8) is now considerably reduced. In addition the estimated coefficient of unemployment (unlike that of inflation) is not significant in (9). We can tell this by calculating the absolute value of its t ratio; 0.509/0.501 = 1.016 This has a t distribution with 19 degrees of freedom. We cannot reject the null hypothesis that the coefficient of unemployment is zero (for the 95% critical value see above). Notice that the standard error here (0.501) is comparatively large. The 95% confidence interval goes from − 1.56 to 0.54. This includes some relatively large negative numbers as well as some smaller positive ones. It is still possible that there maybe a fairly substantial effect of unemployment on percentage wage changes. It would be unwise to conclude from these estimates that there is no unemployment effect. (see Tests of Significance in Notes 4.) It is interesting to compare the results in (9) with those when the dummy variable is included as in (1). Equation (1) has the higher R 2 and thus (1) is the better “fit” to the data. The estimates of the unemployment coefficient differ by a fairly wide margin (− 1.709 , − 0.509). Such a difference could have important policy implications. Which is the better equation will depend on a number of factors, such as the pattern of the residuals, which we have not discussed. However it would seem that there are good theoretical reasons for including retail price inflation in the regression. It appears to capture whatever political effects occurred in 1975 and 1980 (the residuals in (9) for 1975 and 1980 are not particularly large). Where an economic explanatory variable is available it is to be preferred to the “ad hoc” dummy variable. Dummy variables should really be reserved for events which are completely non-economic. Thus I would prefer the estimates in (9) to those in (1). One problem with (9), concerns causation. Was price inflation driving up wages in this period, or were wage changes driving up prices? Or were both variables influencing each other? The answers to these and other questions will be dealt with in subsequent units in Econometrics.

David Winter April 2000

7

Suggest Documents