On Adaptive Linear Regression - NCSU Statistics

8 downloads 0 Views 150KB Size Report
Some Key Words: Adaptive regression, Heavy tailed error, Least absolute deviation re- gression, Mean squared error, Ordinary least squares regression.
On Adaptive Linear Regression Arnab Maity and Michael Sherman Department of Statistics Texas A&M University, TAMU 3143, College Station, TX 77843-3143 [email protected] and [email protected]

Abstract Ordinary Least Squares (OLS) is omnipresent in regression modeling. Occasionally Least Absolute Deviations (LAD) or other methods are used as an alternative when there are outliers. Although some data adaptive estimators have been proposed they are typically difficult to implement. In this note, we propose an easy to compute adaptive estimator which is simply a linear combination of OLS and LAD. We demonstrate large sample normality of our estimator and show that its performance is close to best for both light tailed (e.g., normal and uniform) and heavy tailed (e.g., double exponential and t3 ) error distributions. We demonstrate this through 3 simulation studies and illustrate our method on state public expenditures and lutenizing hormone data sets. We conclude our method is a general and easy to use method which gives good efficiency across a wide range of error distributions.

Some Key Words: Adaptive regression, Heavy tailed error, Least absolute deviation regression, Mean squared error, Ordinary least squares regression.

1

Introduction

Linear regression is a conceptually simple yet very effective tool to explore linear relationships between a response variable and a set of explanatory variables. Generally, the model that relates the responses to the predictors is given by yi = xi β + ²i , for i = 1, . . . , n, where yi denotes the response variable for the ith sample observation, xi is a set of k predictors and β denotes their corresponding effects. The unobserved error terms, denoted by ²i , are typically assumed to have mean zero. The primary goal is typically to estimate the effects, β, and to assess their significance. One popular method of estimation is via ordinary least squares (OLS) where one minimizes the sum of squared errors. The estimate is then given by βbols = argminβ

n X (yi − xi β)2 . i=1

For this method to perform efficiently, however, some assumptions on the underlying distributions of the error terms need to hold. Namely, the ²i ’s have expectation zero and are generated from a Gaussian distribution. Hence, one drawback of OLS is that it can perform poorly if the underlying error distribution is heavy tailed or if the data are prone to outliers. To see the above point, consider the ”state public expenditures data” [available at The Data and Story Library(DASL)]. We take “Per capita state and local public expenditures in dollars” (EX) to be the response variable and “Percent change in population, 1950-1960” (GROW) as the predictor variable. Figure 1 displays the scatterplot of the two variables with the fitted least squares line and Figure 2 shows the plot of residuals against the predictor variable. It is evident from the plot that there are three observations with very high x-values compared to the others. These observations are very likely to affect the least squares estimate of the effect of population growth on per capita expenditure. Looking at the residual plot, we come to the same conclusion that there might be potential outliers with high residual values which may be an indication of heavy tailed behavior for the error distribution. This phenomenon can greatly affect the estimation and performance of the ordinary least squares regression estimators. 1

This kind of situation motivates the use of alternative regression methods such as the Least Absolute Deviation (LAD) regression. In this procedure, one minimizes the sum of the absolute deviations of the errors to get the estimate βblad = argminβ

n X

|yi − xi β|.

i=1

One major advantage of this procedure lies in the fact that the regression line is not affected drastically by the presence of a few outliers or influential observations. A similar kind of problem occurs when the error distribution is heavy tailed in which case LAD performs better than OLS, in general. Hence in the presence of heavy tailed errors, least absolute deviation regression is often preferable to OLS. If, however, the errors are in fact generated from a Gaussian or lighter tailed distribution, LAD performs much worse than OLS. In a real life situation, given a data set the choice between OLS and LAD is often not clear and the decision becomes subjective. In general, the issue of location parameter estimation when the errors are not Gaussian is well studied and many authors have suggested numerous alternatives. For example, McDonald and Newey (1988) propose to minimize a data adaptive parametric loss function where the parameter is chosen based on the data. This amounts to the errors coming from a parametric family of distributions. McDonald and Newey considered a generalized t distribution as the error distribution. Newey (1988) considered using generalized method of moments for adaptive regression. Harter (1974-76) and Sposito (1990) suggested to use Lp -estimators where one minimizes the Lp norm. For p = 1 and p = 2 this corresponds to applying LAD and OLS, respectively. Sposito (1990), modifying the results of Harter, suggested to use a single estimator among midrange, OLS, LAD or L1.5 based on the sample estimate of kurtosis. While the methods proposed above often perform well, they are not “ready to use” due to the fact that one has to implement them by means of new software. The method suggested by Sposito (1990) seems to be the simplest and the extension of his rule to the context of linear regression is obvious. However, one needs to implement a nonstandard Lp regression procedure and, as we will see in Sections 2 and 3, can be improved upon by our proposal. In this paper, we aim to provide an estimator which is ready to use by a general practitioner and does not require any complicated estimation procedure. We propose an estimator 2

that combines βbols and βblad such that the estimator retains the efficiency of the LAD estimator in the presence of outliers or heavy tailed error distributions but does not loose much in efficiency when the conditions of OLS are actually satisfied. In Section 2 we discuss the construction of our estimator as an affine combination of βbols and βblad where the weights are chosen adaptively from data. We will see that implementation of our estimator is straight forward and no new software is needed to do so, and hence is totally automatic. Also, the computational complexity of the proposed estimator is no larger than that of LAD. We also investigate the asymptotic properties of our estimator and provide plug-in and resampling based estimators of the mean square error. In Section 3 we present some simulation results portraying the performance of our estimator compared to OLS and LAD for several error distributions and sample sizes. We will see that our estimator is very close to the efficient estimator between LAD and OLS in all the cases making our estimator a general prescription for linear regression problems, even for moderate sample sizes. We note that our goal is to obtain an estimator with low MSE across a broad range of error distributions. No robustness is claimed in terms of breakdown points or size of singular sets. See, e.g., Ellis (1998) for a paper assessing the robustness of OLS, LAD, and least median squares via their singular sets. For general robust regression techniques, see e.g., Huber (1989), Rousseeuw and Leroy (1987) and Maronna, Martin and Yohai (2006). The LAD estimator, for example, is often motivated by its relative robustness, as is the Sposito proposal. We compare the MSE’s of these estimators with our proposal in Section 3. In Section 3.3 we observe the performance of our estimator when the data are generated from an AR(1) process with different error distributions and for different strengths of correlation. The results show that the same properties hold in this case as well, even for moderate sample sizes. Finally, in Section 4, we revisit the state public expenditures data and give a real life correlated data example as well to see the practical application of our method.

3

2

Methodology

Suppose we observe data (yi , xi ) for i = 1, . . . , n individuals. We model the relation between the variables as yi = xi β + ²i , where we assume only that the random errors ²i ’s are generated from a probability distribution, f (²), which is symmetric about zero. This underlying error distribution may be heavy tailed. It is well known that when the error distribution is Normal(0, σ02 ), OLS is most efficient with n1/2 (βbols − β) → Normal(0, σ02 V −1 ), where V = [limn→∞ n−1

Pn i=1

xT i xi ]. For the LAD estimator, Pollard (1991) showed that

n1/2 (βblad − β) → Normal[0, {2f (0)}−2 V −1 ]. It is readily seen that if f (²) is the double exponential distribution, then LAD is more efficient than OLS. In practical situations, for a given data set, it is often not clear whether the error distribution is Gaussian or not. Thus the choice between LAD and OLS remains subjective to the practitioner and the efficiency of the estimator can heavily depend on this choice. The issue of adequacy of OLS under heavy tailed distributions has been well studied by several authors. As an alternative to OLS, robust regression methods such as Lp -regression procedures can be used, where one solves the minimization problem min β

n X

|yi − xi β|p .

i=1

Note that LAD and OLS correspond to p = 1 and 2, respectively. In general, for location parameter estimation, the properties of Lp -estimators depend heavily on the error distribution (see, e.g., Rice and White, 1964; Ekblom and Henriksson, 1969). Therefore the choice of p becomes an issue of great importance. Harter (1974-1976) and later Sposito (1990), modifying the results of Harter, proposed the following choice of p based on the coefficient of kurtosis, κ, of the error distribution: 4

1. Use L∞ (sample midrange) if κ < 2.2 2. Use least squares estimator if 2.2 ≤ κ ≤ 3 3. Use p = 1.5 if 3 < κ < 6. 3. Use p = 1 if κ ≥ 6. Sposito (1990) provides a simulation study displaying the performance of sample point estimates of kurtosis for different sample sizes and different distributions. It is easy to see that for moderate sample sizes like n = 30 and 50, the sample kurtosis underestimates the true values for heavy tailed distributions such as the Laplace distribution. To see this phenomenon clearly, we conducted a simulation study with sample sizes n = 20, 50 and 100 for different distributions based on 10,000 simulations. The results are displayed in Table 1. Take for example, n = 20. We see that for the double exponential distribution, sample kurtosis is less than the population value of 6 in 90% of the data sets. Following Sposito’s choice, one will end up choosing the “wrong” value of p 90% of the time. On the other hand, when the distribution is Gaussian, 27% of the time the sample kurtosis overestimates the actual kurtosis.

2.1

Adaptive Estimation

The heart of the problem discussed above lies in the fact that Sposito’s rule dictates to choose only one estimator based on the estimated kurtosis. For moderately light tail distributions, this may not pose a serious problem but for heavy tailed errors the issue is serious. To remedy the effect of the performance of the sample estimator of kurtosis, one natural course of action is to combine the estimators, in the form of a weighted average, instead of considering a single estimator. We propose use of the data adaptive weighted estimator βbalr = wβbols + (1 − w)βblad , where the weight w is to be chosen so that it reflects the nature of the error distribution. Ideally, we desire to give more weight to OLS for light tailed distributions while for heavy

5

tailed distributions, LAD should get more weight. To assess the nature of the error distribution, we use the residuals from OLS and LAD and look at their sample kurtosis. We estimate the kurtosis by the average kurtosis of the two sets of residuals: n o κ = kurtosis(residual from OLS) + kurtosis(residual from LAD) /2. We then form smoothly data adaptive weights as ½ 1, if κ ≤ 3 w(κ) = 3/κ, if κ > 3.

(1)

Notice that we use OLS for distributions which have lighter tails than the Gaussian distribution (kurtosis = 3) but give more weight to LAD as kurtosis increases. We will see in Section 3 that βbalr performs nearly as well as OLS for light tailed distributions but performs closely to LAD when errors come from a heavy tailed distribution. Remark 1 It is important to note that the weight w(κ) is chosen automatically from the data, and hence the practitioner does not have to make any subjective decisions on which method to choose making our estimator completely data driven and automatic. Also, to compute our estimator one has only to run OLS and LAD once, and hence the computational complexity of our estimator is no larger than that of LAD. Remark 2 Instead of using w(k), as defined in (1), one may try to use different weight functions. Two such examples are given below: ½ ½ 1, if κ ≤ 3 1, if κ ≤ 3 w1 (κ) = w2 (κ) = (3/κ)2 , if κ > 3; (3/κ)1/2 , if κ > 3,

(2)

see Figure 3 for a graph of these functions. We compare the performance with the three weight functions, w(κ), w1 (κ) and w2 (κ) in Section 3.

2.2

Asymptotic Theory

We now investigate the asymptotic properties of our estimator. Note that the estimated kurtosis κ → κ0 , the actual kurtosis as n → ∞ and hence w(κ) → w(κ0 ). Hence we present our results for a fixed weight w0 = w(κ0 ). We use well established results on OLS and the results given in Phillips (1999) concerning the LAD estimator to derive the following result: 6

Result 1 Assume that the errors ²i are i.i.d. and the density f (·) is symmetric about zero and is analytic at zero. Then, n

1/2

(βbalr − β0 ) = V −1 n−1/2

n X

h

i xi w0 ²i + sgn(²i )(1 − w0 )/{2f (0)} + op (1)

i=1

→ Normal(0, Valr ) as n → ∞, where sgn(a) = a/|a| and Valr = V −1 [w02 σ02 + (1 − w0 )2 /{2f (0)}2 + w0 (1 − w0 )E(|²|)/f (0)]. A proof of this result is given in the Appendix. Remark 3 It may seem odd that Sposito’s rule is deficient due to the difficulty of estimating the actual kurtosis while our proposal also uses the sample kurtosis to form the estimator. The problem is that Sposito chooses a single estimator based on sample kurtosis while our estimator simply adjusts the relative weights on the OLS and LAD estimators. The numerical outcomes of this can be seen in Section 3.

2.3

MSE Estimation

Here we discuss the issue of estimation of the mean square error of our estimator. Note that using Result 1, one can form plug-in estimators of the MSE of βbalr by substituting sample based estimates for the population parameters in the expression for Valr . For example, one P 2 can estimate V by Vb = n−1 ni=1 xT i xi , σ0 by the sample variance of estimated residuals and E(|²|) by the sample mean of absolute residuals. Estimation of f (0), however, is difficult. Many authors have discussed this important problem as this is closely related to estimation of the variance of the LAD estimator. For an overview of several estimators see, e.g., Furno (1998). For example, one can estimate f (0) by 1.4826(mediani |ˆ ei −medianj eˆj |) (see equation (5) of Furno, 1998), where eb denote the residuals. As an alternative, to estimate Valr , one can adopt resampling methods that are commonly used for LAD. This can be done by bootstrapping pairs (covariates and responses together) and calculating the bootstrap estimate of variance. Specifically, one proceeds as follows: • Form the response-covariate pairs (yi , xi ), i = 1, . . . , n, 7

• Resample from the response-covariate pairs and denote the bootstrap resample by (yi∗ , x∗i ), i = 1, . . . , n, ∗ • Form βbALR using the resample,

• Repeat the steps above B times and estimate Valr using the sample MSE of the B ∗ replicates of βbALR about βbALR estimated from the original data.

For small sample sizes, where approximate normality does not hold, the bootstrap can be used to generate confidence intervals using, for example, Bias Corrected and Accelerated (BCA) intervals, as described in Efron and Tibshirani (1993). In Section 3, we present a simulation study to evaluate the performance of the plug-in method and the resampling method.

3

Simulation Study

3.1

Simulation 1: Simple Linear Regression

In this section, we present simulation results of our estimator. We first consider the model yi = xi β + ²i , where the xi ’s are univariate random variables generated from a Uniform(0, 1) distribution. We take the true value of β to be 1.0. The following error distributions are considered for ²: 1. Uniform(-1,1) 2. Normal(0,1) 3. Logistic(µ = 0, σ 2 = π 2 /3) 4. Double exponential(µ = 0, σ 2 = 2) 5. t(df=3)

8

We consider sample sizes n = 20, 50 and 100. For each error distribution and sample size we generate 10, 000 data sets, compute βbols , βblad and βbalr and compute n×(mean squared error) for each of the OLS, LAD and ALR estimators. We also formed an estimator using a slightly modified Harter-Sposito rule: βbhs = 1(κ ≤ 3)βbols + 1(3 < κ < 6)βb1.5 + 1(κ ≥ 6)βblad , where 1(·) denotes the indicator function and βb1.5 is the Lp regression estimator with p = 1.5. The results are displayed in Table 2. It evident from the results that for light tailed distributions, in terms of MSE, OLS performs much better than LAD but ALR is very close to OLS in these situations. Also, while HS and ALR are similar, ALR still has a slightly lower MSE. However, for heavy tailed distributions like the Laplace or t(df=3), LAD and ALR perform much better than OLS and HS. It is important to note that ALR performs nearly as well as the best estimate in all cases, even for small sample sizes such as n = 20. In fact, it is interesting to note that, for the logistic and t(df=3) distributions, ALR performs slightly better than both OLS and LAD for n = 50 and n = 100. 3.1.1

MSE Estimation

We now present a simulation study to evaluate the performance of the plug-in estimator and the bootstrap based estimator of the MSE of βbalr , as discussed in Section 2.3. We take the same setup as in Section 3.1. For each of the 10,000 generated data sets, we estimate Valr using the plug-in method and the resampling method with B = 500 and report the median and median absolute deviation of the estimates across the simulations, along with the MSE of βbalr across all simulations. The results are displayed in Table 3. It is evident that the bootstrap estimator performs better than the plug-in estimator when the errors are generated from heavy tailed distributions. In view of this, we recommend the bootstrap estimator of MSE if the user can afford the extra computation. We also form estimators using weights w1 (κ) and w2 (κ) from Section 2.1, and compare the resulting MSE with the estimator using w(κ). Results are displayed in Table 4. It is evident that w1 (κ) is performing slightly better for very heavy tailed distributions (4 and 5)

9

while w2 (κ) performs slightly better for light tailed distributions (1 and 2) with w(k) being in between the two, and hence w(κ) seems to be the best choice overall.

3.2

Simulation 2: Multiple Linear Regression

Next, we move to multiple regression where we assume the model yi = x1i β1 + x2i β2 + ²i , where x1i and x2i are generated from independent Uniform(0, 1) distributions. The true values of the parameters are β1 = β2 = 1. We measure the performance of OLS, LAD and ALR by comparing n ∗ {M SE(βb1 ) + M SE(βb2 )} for the three estimators. The results are given in Table 5. It is clear that the conclusions drawn from the last simulation are still valid for multiple regression: in each case, the ALR estimator performs closely to the estimator (OLS or LAD) which is appropriate for that particular distribution.

3.3

Simulation 3: Regression in AR(1) Time Series

In this section, we observe the performance of our estimator when the data is generated from an AR(1) time series. We consider the model yi = ρyi−1 + ²i ,

(3)

where ²i is generated from a distribution symmetric about zero. Here our parameter of interest is ρ. We consider the standard normal and double exponential distributions for ², two different sample sizes, n = 50 and n = 100, and two strengths of dependence: ρ = 0.2, 0.5. We report the MSE values of OLS, LAD and ALR estimates in Table 6. We can see that OLS, closely followed by ALR, performs better than LAD in the standard normal setup, while LAD and ALR perform better in the double exponential case.

10

Remark 4 In view of the simulation studies presented in sections 3.1 - 3.3, we can conclude that our ALR estimator performs well. In the case when the errors are generated from light tailed distributions, while LAD performs poorly, ALR performs nearly as good as the OLS estimator, loosing very little in efficiency. In the presence of heavy tailed errors, the ALR estimator performs much better than OLS and nearly as well as the LAD estimator. These results are observed even for the moderate sample size of n = 20. We suggest that our ALR estimator provides a general prescription for linear regression in practice.

4

Data Examples

4.1

State Expenditure Data

We now revisit our data example discussed in Section 1. Recall that we take “Per capita state and local public expenditures in dollars” (EX) to be our response variable and “Percent change in population, 1950-1960” (GROW) as the predictor variable. The data set has 48 observations. Figure 1 displays the scatterplot of the two variables and the ordinary least squares regression line. We estimate the parameters by OLS, LAD and ALR and use B = 1, 000 bootstrap replications to estimate the standard errors of the estimates. The estimates for effect of GROW and the estimated standard errors are 1.26 and 0.47 for OLS, 2.05 and 0.80 for LAD, and 1.46 and 0.52 for ALR, respectively. The residual kurtosis is estimated as 3.97. We see that in this case w(3.97) = 0.76 and so the ALR estimator is closer to that of OLS than to that of LAD.

4.2

Hormone Level Data

This data set contains 48 observations of hormone level of one individual given in Efron and Tibshirani (1993). The data are taken hourly. Figure 4 displays a plot of the hormone levels, centered about their mean, over time while Figure 5 plots the centered hormone level at time point t + 1 versus the centered hormone level at time point t, for t = 1, . . . , 47. We use the model given by (3). The OLS, LAD and ALR estimates of ρ are 0.58, 0.64 11

and 0.59, respectively. The corresponding estimated standard errors are 0.11, 0.21 and 0.13, respectively. Again, ALR is close to OLS.

References Efron, B. and Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman Hall. Ekblom, H., Henriksson, S. (1969). Lp criteria for estimation of location parameters. SIAM Journal of Applied Mathematics 17, 1130–1141. Ellis, S. P. (1998). Instability of least squares, least absolute deviation and least median of squares linear regression. Statistical Science, 13, 337–344. Furno, M. (1998). Estimating the variance of the LAD regression coefficients. Computational Statistics and Data Analysis, 27, 11–26. Harter, H. L. (1974-1976). The method of least squares and some alternatives - Parts I-VI. Int. Stat. Rev., 42, 141–174, 259–264, 282; 43, 1–44, 125–190, 269–278; 44, 111-157. Huber, P. J. (1990). Robust Statistical Procedures. SIAM, Philadelphia. Lawrence, K. D. and Arthur, J. L. (1990). Robust Regression: Analysis and Applications. Dekker, New York. McDonald, J. B. and Newey, W. K. (1988). Partially Adaptive Estimation of Regression Models via the Generalized t Distribution, Econometric Theory, 4, 428-457. Maronna, R. A., Martin, D. R. and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, London. Newey, W.K. (1988). Adaptive Estimation of Regression Models via Moment Restrictions, Journal of Econometrics, 38, 301-339. Phillips, P. C. B. (1991). A Shortcut to LAD Estimator Asymptotics. Econometric Theory, 7, 450–463. Pollard, D. (1991).

Asymptotics for Least Absolute Deviation Regression Estimators.

Econometric Theory, 7, 186–199. 12

Rice, J. R., White, J. S. (1964).

Norms for smoothing and estimation.

SIAM rev., 6,

243–256. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. WileyInterscience, New York. Sposito, V. A. (1990). Some properties of Lp estimators. in Robust Regression: Analysis and Applications, editors Lawrence, K. D. and J. L. Arthur, Dekker, New York. State Public Expenditures Story, The Data and Story Library (DASL). Available at: http://lib.stat.cmu.edu/DASL/Stories/stateexpend.html

Appendix: Proof of Result 1 It is well established that the OLS estimator has the following asymptotic expansion: n X −1/2 b −1 −1/2 n (βols − β0 ) = V n xi ²i + op (1). i=1

Using results from Phillips (1999) we obtain that for the LAD estimator, n X −1/2 b −1 −1/2 n (βlad − β0 ) = V n xi sgn(²i )/{2f (0)} + op (1). i=1

Combining the two expansions and using the definition of βbalr we obtain n X −1/2 b −1 −1/2 n (βalr − β0 ) = V n xi [w0 ²i + sgn(²i )(1 − w0 )/{2f (0)}] + op (1). i=1

Normality of βbalr now follows from the Central Limit Theorem. To compute the asymptotic variance Valr , we first note that E(²) = E{sgn(²)} = 0, E(²2 ) = σ02 , E{sgn(²)2 } = 1 and E{² sgn(²)} = E(|²|). Hence using the expansion of βbalr , we obtain var{n−1/2 (βbalr − β0 )} n ³ ´2 X T −1 −1 xi xi E [w0 ²i + sgn(²i )(1 − w0 )/{2f (0)}] V −1 + op (1) =V n i=1

i = V −1 V w02 ²2 + (1 − w0 )2 sgn(²)2 /{2f (0)}2 + w0 (1 − w0 )sgn(²)²/f (0) V −1 + op (1) i h = V −1 w02 σ02 + (1 − w0 )2 /{2f (0)}2 + w0 (1 − w0 )E|²|/f (0) + op (1). h

Hence the result follows. 13

Sample size n = 20 Error distribution

actual(κ) estimated(b κ)

% of κ b