Testing heteroscedasticity in nonparametric regression ... - CiteSeerX

3 downloads 0 Views 104KB Size Report
Testing heteroscedasticity in nonparametric regression models based on residual analysis. ZHANG Lei1. MEI Chang-lin2. Abstract. The importance of detecting ...
Appl. Math. J. Chinese. Univ. 2008, 23(3): 265-272

Testing heteroscedasticity in nonparametric regression models based on residual analysis ZHANG Lei1

MEI Chang-lin2

Abstract. The importance of detecting heteroscedasticity in regression analysis is widely recognized because efficient inference for the regression function requires that heteroscedasticity should be taken into account. In this paper, a simple test for heteroscedasticity is proposed in nonparametric regression based on residual analysis. Furthermore, some simulations with a comparison with Dette and Munk’s method are conducted to evaluate the performance of the proposed test. The results demonstrate that the method in this paper performs quite satisfactorily and is much more powerful than Dette and Munk’s method in some cases.

§1

Introduction

In regression analysis, it is a common assumption that the error terms of a regression model have equal variance. But in many real-world problems, this assumption cannot be guaranteed. Therefore, detecting heteroscedasticity in the error terms is an important issue because efficient inference requires that the heteroscedasticity should be taken into account. The problem of testing heteroscedasticity of error term in both parametric and nonparametric regression models has been discussed by many researchers (see, for example, [1, 3-6, 8, 10-12]). Most of the graphical procedures and formal tests are based on the residuals after fitting a model with a parametrically specified regression and variance function (see, [1-4, 8, 10]). Especially in nonparametric regression analysis, diagnostic test under the assumptions of smoothness on the regression function and of normal distribution on the error has been proposed by Eubank and Thomas[6] . Dette and Munk[5] proposed a test based on an estimator of the best L2 -approximation of the variance function with a constant and You and Chen[13] extended this method to the partially linear regression models. In this paper, a simple test for heteroscedasticity in nonparametric regression is proposed based on the analysis of the squared residuals. Some simulations are conducted to evaluate the Received: 2006-06-16 MR Subject Classification: 62J02, 62G05 Keywords: heteroscedasticity, nonparametric regression, residual analysis Digital Object Identifier(DOI): 10.1007/s11766-008-1648-0 Supported by the National Natural Science Foundation of China (10531030)

266

Appl. Math. J. Chinese. Univ.

Vol. 23, No. 3

performance of the test and the results are satisfactory. Furthermore, a simulation comparison of our method with Dette and Munk’s method[5] is made and the results demonstrate that our method is more powerful in some cases. The remainder of this paper is organized as follows. In the next section, the testing procedure is derived. Some simulations with a comparison with Dette and Munk’s method are conducted in §3 to evaluate the performance of the test. The paper is then concluded with a brief summary.

§2

Procedure of testing heteroscedasticity

In this paper, we consider the nonparametric regression model yi = m(xi ) + σ(xi )εi ,

i = 1, 2, · · · , n,

(1)

where εi is the random error with zero mean and unit variance and cov(εi , εj ) = 0 for i 6= j. The hypotheses to be tested are H0 : σ(xi ) = σ0

versus H1 : σ(xi ) is the non-constant function of xi ,

(2)

where σ0 is a constant. Here, we first use the local linear fitting method to smooth the regression function m(x) in (1) and then we test the hypotheses in (2) based on the resulting residuals by employing the theory of ordinary linear regression analysis. Suppose that the regression function m(x) has the second continuous derivative in the domain, say D, of x. For any given point x0 ∈D, we know from Taylor’s expansion that m(x) can be approximated in the neighborhood of x0 by m(x)≈m(x0 ) + m′ (x0 )(x − x0 ). According to the local linear fitting procedure, the following weighted least-squares problem is formulated. P Namely, minimize ni=1 (yi − m(x0 ) − m′ (x0 )(xi − x0 ))2 Kh (xi − x0 ) with respect to m(x0 ) and m′ (x0 ), where Kh (·) = K(·/h)/h and K(·) is a given kernel function and h is the bandwidth which can be selected by, for example, the cross-validation method (see [7, 9]). That is, the optimal value of the bandwidth h is chosen to minimize the expression n 1 X CV (h) = (yi − yˆ(−i) (h))2 n i=1 with respect to h, where yˆ(−i) (h) is the estimator of yi obtained by the above local linear fitting procedure with the ith observation removed (for details, see [9], section 4.2.1). Let   1 x1 − x0    1 x2 − x0  ,  X0 =  . ..   ..  . 1 xn − x0 T Y = (y1 , y2 , · · · , yn ) and W0 = diag(Kh (x1 − x0 ), · · · , Kh (xn − x0 )). (3) Then, by solving the above weighted least-squares problem, we can express the fitted value T −1 m(x) ˆ of m(x) at x0 with the notation of matrix as m(x ˆ 0 ) = eT X0T W0 Y , where 1 (X0 W0 X0 ) e1 denotes a two-dimensional vector with its first element being 1 and the other being zero. Setting x0 = x1 , x2 , · · · , xn respectively, we can obtain the estimators of m(x) at all of the design points as T −1 XiT Wi Y , i = 1, 2, · · · , n, m(x ˆ i ) = eT 1 (Xi Wi Xi )

ZHANG Lei, et al. Testing heteroscedasticity in nonparametric regression models based on residual· · ·

267

where Xi and Wi are the corresponding matrices respectively obtained by substituting x0 with xi in the matrix X0 and W0 in (3). Then the residuals of the above fitting are computed by εˆi = yi − m(x ˆ i ), i = 1, 2, · · · , n. Furthermore, we have the squared residuals εˆ2i = (yi − m(x ˆ i ))2 , i = 1, 2, · · · , n. If H0 is true, which means that the variance of the error is homoscedastic, εˆ2i , i = 1, 2, · · · , n, should not have any trend. Therefore, we can test the heteroscedasticity of the error by analyzing the trend of εˆ2i through fitting a simple linear regression model. Based on the new observations (ˆ ε2i , xi ), i = 1, 2, · · · , n, we fit such a linear regression model that εˆ2i = β0 + β1 xi + ηi , i = 1, 2, · · · , n, (4) where ηi , i = 1, 2, · · · , n, are random errors which are assumed to be independent and identically distributed as N (0, τ 2 ). As aforementioned, if H0 is true, the least-squares estimator of the slope parameter β1 in (4) should be close to zero. Otherwise, it will be significantly different from zero. Therefore, to test the hypotheses in (2) can be carried out by testing the following hypotheses in linear regression model (4): H10 : β1 = 0

versus

H11 : β1 6= 0.

At this point, the conventional t-test in the framework of ordinary linear regression analysis can be employed to achieve this task. Considering that the t-test in ordinary linear regression analysis has been well known and is a common output in many statistical softwares, we hence omit the details of this test procedure.

§3

Simulation studies

We conduct some simulations in this section to examine the performance of the proposed test method. There are several purposes for conducting the simulations. Firstly, the validity of this test procedure will be investigated. Secondly, the influence of bandwidth choice on the validity and power of the test will be examined. Thirdly, we want to evaluate the influence of different types of the error distribution in model (1) on the power of the test. And lastly, we make a simulation comparison of our method with Dette and Munk’s method.

3.1

Design of the experiments

In our simulations, the model for generating the data is yi = 1.5 + 2x3i + 3sin5xi + σ(xi )εi ,

i = 1, 2, · · · , n.

(5)

With the above considerations in mind, we take different types of the variance function σ(xi ) as follows: (1) σ(x) = exp(ax); (2) σ(x) = 1 + ax; (3) σ(x) = 1 + aexp(−2x)cos(4πx), where the constant a will take different values to evaluate the exactness and power of the test. Furthermore, for each kind of σ(x), two types of distribution of εi are considered. That is, (1) ε1 , ε2 , · · · , εn are independently drawn from the standard normal distribution N (0, 1); √ √ (2) ε1 , ε2 , · · · , εn are independently drawn from the uniform distribution U [− 3, 3]. The observations of the explanatory variable x are equidistantly taken on interval [0,1], that is xi = i/n, i = 1, 2, · · · , n, with n = 100 and 150 respectively. The Gaussian kernel function

268

Vol. 23, No. 3

Appl. Math. J. Chinese. Univ. Table 1

Distribution of εi

a 0.00 0.25 N (0, 1) 0.50 0.75 1.00 0.00 0.25 √ √ U [− 3, 3] 0.50 0.75 1.00

The rejection frequencies of 500 replications with σ(x) = exp(ax) under the significance level α = 0.05

h=0.05 0.068 0.162 0.454 0.758 0.944 0.078 0.226 0.740 0.936 0.992

n = 100 h=0.10 h=0.15 0.056 0.052 0.146 0.128 0.452 0.444 0.788 0.788 0.950 0.954 0.072 0.068 0.262 0.248 0.794 0.780 0.968 0.968 0.996 0.996

h=0.20 0.036 0.116 0.430 0.792 0.948 0.046 0.196 0.738 0.958 0.996

CV 0.050 0.128 0.440 0.772 0.950 0.068 0.248 0.750 0.960 0.996

h=0.05 0.066 0.204 0.624 0.922 0.990 0.042 0.404 0.910 0.998 1.000

n = 150 h=0.10 h=0.15 0.060 0.058 0.204 0.188 0.614 0.620 0.930 0.932 0.996 0.998 0.042 0.040 0.434 0.418 0.942 0.940 1.000 0.998 1.000 1.000

h=0.20 0.058 0.170 0.588 0.924 0.996 0.030 0.358 0.926 0.994 1.000

CV 0.060 0.200 0.616 0.930 0.994 0.042 0.406 0.920 0.998 1.000

√ K(x) = (1/ 2π)exp(−x2 /2) is used throughout the simulations for fitting the regression function, and the value of bandwidth h is selected in two ways, that is, fixing h = 0.05, 0.10, 0.15, 0.20 and selecting its value by the cross-validation procedure, to evaluate the influence of different bandwidth values on the performance of the test. For each combination of the different types of σ(x), the distribution of εi , the ways for selecting the bandwidth value of h and sample sizes n = 100 and 150, 500 replications are run and the frequencies of rejecting the corresponding null hypothesis under the significance level α = 0.05 are reported.

3.2

Simulation results with analysis

According to the aforementioned design of the experiments, the simulation results are summarized in Tables 1 to 3 respectively for the three forms of variance function σ(x). We see from the results that, in all of the situations, the rejection frequencies are definitely higher than the significance level α = 0.05 if the error term is heteroscedastic (that is, a 6= 0) and are close to α = 0.05 if it is in fact homoscedastic (that is, a = 0). When the error term is indeed of heteroscedasticity, the corresponding rejection frequencies increase quickly with the value of a becoming larger. As expected, with the sample sizes increasing, the corresponding rejection frequencies increase when a 6= 0. Furthermore, we see from the tables that the rejection frequencies of the test are quite stable for different fixed values of the bandwidth and there is no evident difference between the rejection frequencies obtained by taking the bandwidth values to be fixed and those computed by choosing the bandwidth value with the cross-validation procedure. This suggests that it may be feasible to choose the optimal value of the bandwidth with the cross-validation method when the proposed test is used in practical problems. In a real-world problem it may still be a better practice to perform the test with several values of bandwidth to make a decision of rejecting or not rejecting the null hypothesis. In consideration of the influence of the error distribution on the performance of the test,

ZHANG Lei, et al. Testing heteroscedasticity in nonparametric regression models based on residual· · · Table 2

Distribution of εi

a 0.00 N (0, 1) 0.25 0.50 0.75 1.00 0.00 √ √ U [− 3, 3] 0.25 0.50 0.75 1.00

Table 3

269

The rejection frequencies of 500 replications with σ(x) = 1 + ax under the significance level α = 0.05

h=0.05 0.068 0.138 0.334 0.504 0.660 0.078 0.176 0.530 0.792 0.882

n = 100 h=0.10 h=0.15 0.056 0.052 0.118 0.108 0.306 0.294 0.504 0.502 0.692 0.686 0.072 0.068 0.210 0.180 0.586 0.580 0.848 0.846 0.934 0.938

h=0.20 0.036 0.094 0.280 0.484 0.682 0.046 0.150 0.544 0.818 0.936

CV 0.050 0.106 0.298 0.488 0.652 0.068 0.180 0.552 0.818 0.918

h=0.05 0.066 0.166 0.442 0.690 0.840 0.042 0.326 0.784 0.944 0.990

n = 150 h=0.10 h=0.15 0.060 0.058 0.150 0.154 0.438 0.422 0.698 0.698 0.852 0.854 0.042 0.040 0.352 0.332 0.822 0.822 0.958 0.968 0.988 0.988

h=0.20 0.058 0.132 0.410 0.684 0.830 0.030 0.284 0.780 0.958 0.984

CV 0.060 0.146 0.416 0.694 0.842 0.042 0.318 0.796 0.958 0.988

The rejection frequencies of 500 replications with σ(x) = 1 + aexp(−2x)cos(4πx) under the significance level α = 0.05

Distribution of εi a 0 1 N (0, 1) 2 3 4 0 1 √ √ U [− 3, 3] 2 3 4

h=0.05 0.068 0.144 0.328 0.546 0.760 0.078 0.174 0.502 0.776 0.902

n = 100 h=0.10 h=0.15 0.056 0.052 0.126 0.126 0.338 0.338 0.560 0.580 0.768 0.778 0.072 0.068 0.146 0.148 0.530 0.544 0.802 0.822 0.938 0.952

h=0.20 0.036 0.120 0.348 0.582 0.782 0.046 0.164 0.550 0.858 0.962

CV 0.050 0.118 0.310 0.532 0.748 0.068 0.148 0.480 0.770 0.918

h=0.05 0.066 0.202 0.526 0.826 0.924 0.042 0.240 0.742 0.954 0.996

n = 150 h=0.10 h=0.15 0.060 0.058 0.198 0.196 0.542 0.554 0.816 0.814 0.936 0.936 0.042 0.040 0.260 0.278 0.780 0.788 0.968 0.972 0.998 0.998

h=0.20 0.058 0.200 0.556 0.818 0.938 0.030 0.288 0.790 0.970 1.000

CV 0.060 0.186 0.532 0.804 0.930 0.042 0.258 0.758 0.960 0.996

we see that, under the null hypothesis that the error term is homoscedastic, the rejection frequencies with normally distributed error are closer to the significance level than those with the uniformly distributed error. But under the alternative hypothesis that the error term is of heteroscedasticity, the power with the uniformly distributed error is somewhat higher than that with the normally distributed error for each given value of a. The reason may be that when the error term is distributed as the uniform distribution, the errors are consistently limited within an interval which may lead to a more evident trend for the squared residuals. This demonstrates that different distributions of error term have some effect on both the accuracy and the power of the test. On the whole, the test shows a satisfactory performance in checking the heteroscedasticity of the error term in nonparametric regression models. We also conduct some simulations for other forms of the regression function and distributions of the error term such as t-distribution and Laplace distribution, the results are all similar to those in the aforementioned cases, which are omitted here because of limited space.

270

Appl. Math. J. Chinese. Univ. Table 4

Comparison of the rejection frequencies obtained by our method and Dette and Munk’s test Model

a

m(x) = 1 + sinx σ(x) = σexp(ax) m(x) = 1 + x σ(x) = σ(1 + asin(10x))2 m(x) = 1 + x σ(x) = σ(1 + ax)2

m(x) = 1 + sinx σ(x) = σexp(ax) m(x) = 1 + x σ(x) = σ(1 + asin(10x))2 m(x) = 1 + x σ(x) = σ(1 + ax)2

3.3

Vol. 23, No. 3

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

α = 0.025 Dette and Our Munk test 0.038 0.024 0.055 0.136 0.095 0.476 0.031 0.028 0.197 0.040 0.272 0.032 0.034 0.028 0.073 0.318 0.136 0.666

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

α = 0.025 Dette and Our Munk test 0.028 0.028 0.064 0.332 0.153 0.894 0.026 0.028 0.333 0.036 0.447 0.044 0.028 0.028 0.105 0.724 0.221 0.984

n = 50 α = 0.05 Dette and Our Munk test 0.056 0.052 0.084 0.228 0.148 0.602 0.053 0.054 0.276 0.074 0.365 0.068 0.054 0.054 0.113 0.446 0.198 0.802 n = 100 α = 0.05 Dette and Our Munk test 0.057 0.054 0.097 0.452 0.215 0.956 0.049 0.056 0.433 0.082 0.557 0.066 0.053 0.056 0.158 0.818 0.304 0.996

α = 0.10 Dette and Our Munk test 0.101 0.094 0.132 0.324 0.223 0.740 0.100 0.094 0.390 0.140 0.481 0.126 0.097 0.094 0.185 0.572 0.291 0.902 α = 0.10 Dette and Our Munk test 0.093 0.090 0.151 0.580 0.313 0.982 0.089 0.090 0.568 0.140 0.674 0.144 0.100 0.090 0.233 0.902 0.412 1.000

Comparison with Dette and Munk’s method

Recently, Dette and Munk[5] also proposed a test for heteroscedasticity detection in nonparametric regression, in which the test was formulated based on an estimator for the best L2 -approximation of the variance function by a constant and proved the asymptotic normality of the null distribution of the test statistic. As a part of the simulations conducted by Dette and Munk[5] , they considered the following three types of regression function and variance function: (1) m(x) = 1 + sin(x), (2) m(x) = 1 + x,

σ(x) = σexp(ax);

(monotone variance function) 2

σ(x) = σ(1 + asin(10x)) ;

(high frequency variance function)

2

(3) m(x) = 1 + x, σ(x) = σ(1 + ax) ; (unimodal variance function) 2 where they took a = 0, 0.5, 1.0 and σ = 0.25. In each corresponding regression model yi = m(xi ) + σ(xi )εi , εi (i = 1, 2, · · · , n) were independently drawn from the standard normal distribution and xi = (i − 1)/(n − 1) (i = 1, 2, · · · , n) for n = 50, 100 and 200 respectively. For each case, they estimated the rejection probability by rejection frequency of 5000 replications under significance level α = 0.025, 0.05 and 0.10 respectively. With the above experimental design in [5], we conducted the simulations with our method in which the bandwidth h was selected by the cross-validation procedure in each replication. In consideration of the similar results for various sample sizes and limited space, we listed our results with the corresponding results of [5] in Table 4 only for n = 50 and 100 for a comparison.

ZHANG Lei, et al. Testing heteroscedasticity in nonparametric regression models based on residual· · ·

271

We see from Table 4 that, under H0 (that is, a = 0 in the variance functions), the rejection frequencies of both methods are all close to the corresponding significance level for both sample sizes. However, two methods perform quite differently for different types of variance functions under the alternative hypothesis. Our method is much more powerful than Dette and Munk’s method for monotone and unimodal variance functions, but it is less sensitive to the high frequency variance function which has a constant amplitude. This is reasonable because our method is formulated mainly based on the trend of the squared residuals but a high frequency variance function leads to the trend of the squared residuals unobvious. Anyway, with its simplicity in implementation, our method is useful for detecting heteroscedasticity at least for the variance functions with some trends.

§4

Conclusion

As mentioned in the introduction section, detecting the heteroscedasticity is one of the important issues in regression analysis. In this paper, a simple procedure is proposed for checking the heteroscedasticity of the error term in nonparametric regression models. The simulations with comparison with Dette and Munk’s method show that the proposed method is of satisfactory performance for many kinds of variance functions and distributions of the error term. Especially, the proposed test method is very easy in implementation with many existing statistical softwares and indeed can be extended straightforwardly to other kinds of regression models provided that the residuals can be obtained. From the practical point of view, the proposed method provides a useful and simple way for detecting the heteroscedasticity in regression analysis.

References 1 Breusch T S, Pagan A R. A simple test for heteroscedasticity and random coefficient variation, Econometrica, 1979, 47: 1287-1294. 2 Carrol R J, Ruppert D. Transformation and Weighting in Regression, New York: Chapman and Hall, 1988. 3 Cook R D, Weisberg S. Diagnostics for heteroscedasticity in regression, Biometrika, 1983, 70: 1-10. 4 Diblasi A, Bowman A. Testing for constant variance in a linear model, Statist Probab Lett, 1997, 33: 95-103. 5 Dette H, Munk A. Testing heteroscedasticity in nonparametric regression, J Roy Statist Soc B, 1998, 60: 693-708. 6 Eubank R L, Thomas W. Detecting heteroscedasticity in nonparametric regression, J Roy Statist Soc B, 1993, 55:145-155. 7 Fan J, Gijbels I. Local Polynomial Modelling and Its Applications, London: Chapman and Hall, 1996.

272

Appl. Math. J. Chinese. Univ.

Vol. 23, No. 3

8 Harrison M J, McCabe B P M. A test for heteroscedasticity based on regression least squares residuals, J Amer Statist Assoc, 1979, 74: 494-500. 9 Hart J D. Nonparametric Smoothing and Lack-of-fit Tests, New York: Springer-Verlag, 1997. 10 Koenker R, Bassett G. Robust test for heteroscedasticity based on regression quantiles, Econometrica, 1981, 50: 43-61. 11 M¨ uller H G, Stadtm¨ uller U. Estimation of heteroscedasticity in regression analysis, Ann Statist, 1987, 15: 610-625. 12 M¨ uller H G, Zhao P L. On a semi-parametric variance function model and a test for heteroscedasticity, Ann Statist, 1995, 23: 946-967. 13 You J, Chen G. Testing heteroscedasticity in partially linear regression models, Statist Probab Lett, 2005, 73: 61-70.

1 School of Science, Xi’an Jiaotong University, Xi’an 710049, China; Xinhua News Agency, Beijing 100803, China. 2 School of Science, Xi’an Jiaotong University, Xi’an 710049, China.

Suggest Documents