On Model Selection in Robust Linear Regression - CiteSeerX

0 downloads 0 Views 433KB Size Report
optimal model is de ned as a correct model in Ac with the smallest dimension. For .... (2.5) if is a correct model and the M-estimator has a bounded in uence. ..... Corresponding to ^ , let T( F ) be the functional de ned as the solution of the .... an easy calculation from (3.4) the following equation ..... 41.12 36.75 37.03 39.31.
On Model Selection in Robust Linear Regression by Guoqi Qian 1 2 and Hans R. Kunsch

Research Report No. 80 November 1996 Seminar fur Statistik Eidgenossische Technische Hochschule (ETH) CH-8092 Zurich Switzerland

1 2

Research supported by Swiss National Science Fund 2100-043210.95. Seminar f ur Statistik, ETH, and School of Statistics, La Trobe University, Melbourne 3083, Australia

On Model Selection in Robust Linear Regression Guoqi Qian zx and Hans R. Kunsch Seminar fur Statistik ETH Zentrum CH-8092 Zurich, Switzerland November 1996

Abstract

Several model selection criteria which generally can be classied as the penalized robust method are studied in this paper. Particularly we derive a criterion based on Rissanen's stochastic complexity. Some asymptotic properties concerning strong consistency of selecting the optimal model by these criteria are given under general conditions. Other features like robustness against outliers and eect of signal-to-noise ratio are also discussed. Finally, examples and simulations are presented to evaluate their nite sample performance. The robust procedure used in this paper considers the gross error in both the response and the independent variables through a generalized Huberization.

Key words and phrases: model selection, robust regression, stochastic complexity, consistency, AIC, BIC, cross validation, signal-to-noise ratio.

1 Introduction Choosing a model is often an important goal of a statistical analysis. Clearly, exploratory investigations and subject knowledge are important parts for the formulation of model classes. But once the possible model classes have been set up, it is most useful to have an e cient and objective criterion to derive model selection from the data alone. In this paper we study the model selection problem in robust linear regression. Compared with the rich literature on robust estimation and testing, only few papers have been devoted to this so far. See Ronchetti (1985), Hampel (1983), Machado (1993), Ronchetti and Staudte (1994), and Ronchetti, Field and Blanchard (1996) where robust versions of AIC, BIC, Cp and cross-validation are proposed. It is widely known that in linear regression model selection based on least squares, AIC and the cross-validation method with a xed size of the associated validation sample are not consistent in selecting the true model if it can be nitely parameterized: they tend to choose a model with too many independent variables. On the other hand, BIC and the cross-validation method with an increasing validation sample size (cf. Shao (1993)) are consistent if a true model with a nite number of independent variables exists. However, the penalty term in both BIC and AIC depends only on the number of independent 1

variables. They ignore for instance how large the contribution of an independent variable to the predictor is and how closely related the independent variables are. We think that when proposing a model selection procedure in robust linear regression, we should consider at least three issues. First, the criterion proposed should take into account the possibility that both response and predictors of some observations may contain gross errors. Therefore the criterion should not choose a complicated model in order to t also a small number of outliers. The second issue is that the criterion proposed should be consistent if a nite dimensional true model exists, and it should possess some asymptotic optimality properties if the true model is innite dimensional. The third one concerns the eect of the signal-to-noise ratio on the empirical performance of the criterion. Here, the eect of the signal-to-noise ratio means that a model selection criterion is not likely to pick out a term in a regression model whose coe cient is relatively small compared to the dispersion parameter of the model. In this paper, we consider model selection procedures based on penalizing robust tting errors. We will call them penalized robust deviance procedures. They include many of the procedures considered previously as well as a new one based on the ideas of stochastic complexity of Rissanen (1983, 1989, 1996). In order to describe them, we introduce the following framework. Suppose the observations (xt1 y1)     (xtn yn) with xi 2 Rp and yi 2 R are i.i.d. following an unknown distribution F (x y), such that the usual regression model is satised: yi = xti + ri with E (ri jxi) = 0: (1.1) Thus i = E (yi jxi) = xti , and the ri 's are i.i.d. with E (ri ) = 0 and cov(xi  ri ) = 0. Note that we do not assume that ri is independent of xi. In particular, the variance of ri may depend on xi . For simplicity of presentation, we assume that the p components of x contains all independent variables available in the data so that many components of  may be zero. Then (1.1) also gives the full model. Clearly the set of all possible models can be identied with A = f : any non-empty subset of f1     pgg. Each , of size p , in A corresponds to a predictor xt  and vice versa. Here x is a sub-vector of x indexed by  and  is similarly dened. Given a vector  , A can be divided into two subsets: 1. Ac = f : i = 0 for any i 62 g 2. Aw = f : i 6= 0 for some i 62 g. Clearly Aw represents all the wrong predictors and Ac represents all the correct predictors. But many models in Ac include irrelevant variables and thus are too complicated. The optimal model is dened as a correct model in Ac with the smallest dimension. For simplicity we assume that such an optimal model is unique. It is easy to see that this is the case if the components of x are linearly independent. For the above setup, we study procedures to select a predictor of the following form (X ) n w ( x i) t ^ ^ = arg min f (y ; x  )g + C (n ) : (1.2) 

i=1



i

i 

Here C (n ) is a penalty term measuring the complexity of a model , w(x) 2 (0 1] is a weight function measuring the outlyingness of the independent variable x,  measures the scale of w(xi)ri and ^ is the M-estimator corresponding to the robust deviation (). It is dened by n X w(xi) (y ; xt  )g: (1.3)  f ^ = arg min p  2R  i i i=1

2

Choosing this estimator implies that the rst term in (1.2) is the robust tting error of a model . In particular, it guarantees that this robust tting error decreases when additional independent variables are included in the model. The problem is then to choose a penalty term C (n ) as the model complexity so that the resulting model selection criterion (1.2) performs satisfactorily with respect to the three issues we just discussed. In practice,  will be estimated with a robust estimate obtained by using essentially Huber's proposal 2 (Huber 1981, p.137) or Hampel's median absolute deviation (Hampel, 1974, p.388) in the full model. The weights w(xi) will also be based on the full model, see section 3 below. The robust function () used in this paper will be Huber's function dened as (12 0g > 0. In addition, E (P (jw(xi)rij  0jxi) min(jhij hi2)) < 1 by the Cauchy-Schwarz inequality and condition (A.1). Thus by the strong law of large numbers, we have a.s. 0

0

0

0

lim T = E (P (jw(xi)rij  0jxi) min(jhij hi2)) > 0: n 2 0

0

Hence we have proved (4.3).

2

Proof of Theorem 4.2: By using Lemma A.1, we get c fw(xi )(yi ; xti ^ )g ; c fw(xi)ri g  c fw(xi )rigw(xi )xti( ; ^ ) for any correct model  2 Ac . Note that xti = xti  if  2 Ac . From this inequality and equation (1.3) and (3.1), it follows that 0 

n X i=1

c fw(xi)ri g ;

n X

n X i=1

c fw(xi)(yi ; xti ^ )g

c fw(xi)ri gw(xi)xti (^ ;  ) i=1 n h i X = c fw(xi )rig ; c fw(xi )(yi ; xti ^ )g w(xi)xti (^ ;  )



i=1

=

n X i=1

 fw(xi )ri w(xi)xti(^ ;  )g(w(xi)xti(^ ;  ))2

 B 2 jj^ ;  jj2

n X i=1

 fw(xi )ri  w(xi)xti(^ ;  )g

where B = supi w(xi)jjxijj < 1 by (A.6). The last inequality is a consequence of the Cauchy-Schwarz inequality. Applying similar arguments as used for Lemma A.2, we see that under conditions (A.2), (A.3), (A.6) and (A.7), n 1X t ^ n  fw(xi )ri  w(xi)xi ( ;  )g ! E !P (jw(x1)r1 j  cjx1)] a.s.. i=1

q log log n

In the same way we have also jj^ ;  jj = O( (A.4) and (A.6) to (A.7). This implies that 0

n X i=1

c fw(xi)ri g ;

n X i=1

n

) a.s. under conditions (A.2) to

c fw(xi)(yi ; xti^ )g = O(log log n) a.s.,

which proves (4.4).

2

Acknowledgment: The rst author would like to thank Professors C. Field, G. Gabor,

F. Hampel and W. Stahel for their stimulating discussions and useful references before and during the writing of this paper.

22

References !1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Proc. 2nd Int. Symp. Info. Theory, 267-281, eds. B.N. Petrov and F. Csaki, Akademia Kiado, Budapest. !2] Akaike, H. (1974). A new look at statistical model identication. IEEE Trans. Automatic Control 19, 716-723. !3] Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc. 78 , 316-331. !4] Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81, 461-470. !5] Hampel, F.R. (1974). The inuence curve and its role in robust estimation. J. Am. Statist. Assoc. 69 383-393. !6] Hampel, F.R. (1983). Some aspects of model choice in robust statistics. Proceedings of the 44th Session of ISI, Book 2, Madrid, 767-771. !7] Hampel, F.R., Ronchetti, E. M.,Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Inuence Functions. Wiley, New York. !8] Hill, R.W. (1977). Robust regression when there are outliers in the carriers. Ph.D. thesis, Harvard University, Cambridge, Mass.. !9] Huber, P.J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35, 73-101. !10] Huber, P.J. (1981). Robust Statistics. Wiley, New York. !11] Machado, J.A.F. (1993). Robust Model Selection and M -estimation. Econ. Theory 9, 478-493. !12] Mallows, C. L. (1973). Some comments on Cp . Technometrics 15, 661-675. !13] Maronna, R.A. and Yohai, V.J. (1981). Asymptotic behavior of general M-estimates for regression and scale with random carriers. Z. Wahrsch. Verw. Gebiete. 58 7-20. !14] Qian, G. (1994). Statistical modeling by stochastic complexity. Ph.D. thesis, Dalhousie University, Halifax, Canada. !15] Qian, G. and Kunsch, H. (1996). Some Notes On Rissanen's Stochastic Complexity, preprint. !16] Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist. 11, 416-431. !17] Rissanen, J. (1986). Order estimation by accumulated prediction errors. In Essays in Time Series and Allied Processes (J. Gani and M.B. Priestley, eds.). J. Appl. Probab. 23A 55-61. !18] Rissanen J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientic Publishing Co. Pte. Ltd., Singapore. 23

!19] Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Trans. Inform. Theory. 42, 40-47. !20] Ronchetti, E. (1985). Robust model selection in regression. Stat. Prob. Lett. 3 21-23. !21] Ronchetti, E., Field, C. and Blanchard, W. (1996). Robust linear model selection by cross-validation. To appear in J. Amer. Statist. Assoc. !22] Ronchetti, E. and Staudte, R.G. (1994). A robust version of Mallows's Cp . J. Amer. Statist. Assoc. 89, 550-559. !23] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464. !24] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88, 486-494. !25] Staudte, R.G. and Sheather, S.J. (1990). Robust Estimation and Testing. Wiley, New York. !26] Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. J. R. Statist. Soc. B 39, 44-47. !27] Wei, C.Z. (1992). On the predictive least squares principle. Ann. Statist. 20, 1-42.

24

Non-robust Stochastic Complexity Criterion

0

frequencies 20 40

b0+b2x1^2 b0+b1x1+b2x1^2 b0+b1x1 b0

outlier values y1, from -154 to 146, with width=3

0

frequencies 20 40

AIC

outlier values y1, from -154 to 146, with width=3

0

frequencies 20 40

BIC

outlier values y1, from -154 to 146, with width=3

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y1, from -154 to 146, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y1, from -154 to 146, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y1, from -154 to 146, with width=3

0

frequencies 20 40

Robust BIC

outlier values y1, from -154 to 146, with width=3

Figure 3. Empirical Distributions of Model Selection for Model (i)

25

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y1, from -152 to 148, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y1, from -152 to 148, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y1, from -152 to 148, with width=3

0

frequencies 20 40

Robust BIC

outlier values y1, from -152 to 148, with width=3

Figure 4. Empirical Distributions of Model Selection for Model (ii)

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y16, from -140.5 to 159.5, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y16, from -140.5 to 159.5, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y16, from -140.5 to 159.5, with width=3

0

frequencies 20 40

Robust BIC

outlier values y16, from -140.5 to 159.5, with width=3

Figure 5. Empirical Distributions of Model Selection for Model (iii)

26

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y1, from -130 to 170, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y1, from -130 to 170, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y1, from -130 to 170, with width=3

0

frequencies 20 40

Robust BIC

outlier values y1, from -130 to 170, with width=3

Figure 6. Empirical Distributions of Model Selection for Model (iv)

27

Non-robust Stochastic Complexity Criterion

0

frequencies 20 40

b0+b1x1+b3x1x2 b0+b1x1+b2x2+b3x1x2 b0+b1x1+b2x2 b0+b1x1 other models

outlier values y1, from -150 to 150, with width=3

0

frequencies 20 40

AIC

outlier values y1, from -150 to 150, with width=3

0

frequencies 20 40

BIC

outlier values y1, from -150 to 150, with width=3

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y1, from -150 to 150, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y1, from -150 to 150, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y1, from -150 to 150, with width=3

0

frequencies 20 40

Robust BIC

outlier values y1, from -150 to 150, with width=3

Figure 7. Empirical Distributions of Model Selection for Model (v)

28

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y1, from -149.5 to 150.5, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y1, from -149.5 to 150.5, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y1, from -149.5 to 150.5, with width=3

0

frequencies 20 40

Robust BIC

outlier values y1, from -149.5 to 150.5, with width=3

Figure 8. Empirical Distributions of Model Selection for Model (vi)

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y16, from -136.5 to 163.5, with width=3

0

frequencies 20 40

Rrobust AIC

outlier values y16, from -136.5 to 163.5, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y16, from -136.5 to 163.5, with width=3

0

frequencies 20 40

Robust BIC

outlier values y16, from -136.5 to 163.5, with width=3

Figure 9. Empirical Distributions of Model Selection for Model (vii)

29

0

frequencies 20 40

Stochastic Complexity Criterion

outlier values y1, from -126 to 174, with width=3

0

frequencies 20 40

Ronchetti’s Robust AIC

outlier values y1, from -126 to 174, with width=3

0

frequencies 20 40

Hampel’s Robust AIC

outlier values y1, from -126 to 174, with width=3

0

frequencies 20 40

Robust BIC

outlier values y1, from -126 to 174, with width=3

Figure 10. Empirical Distributions of Model Selection for Model (viii)

differences -1.0 0.0

Comparison for Model (i)

MFPE(RAIC)-MFPE(SC) MFPE(HAIC)-MFPE(SC) MFPE(RBIC)-MFPE(SC)

-150

-100

-50 0 outlier values y1, from -154 to 146, with width=3

50

100

150

differences -1 0 1 2

Comparison for Model (ii)

-150

-100

-50 0 50 outlier values y1, from -152 to 148, with width=3

100

150

-1.0

differences 0.5

Comparison for Model (v)

-150

-100

-50 0 50 outlier values y1, from -150 to 150, with width=3

100

150

100

150

differences -1 0 1 2

Comparison for Model (vi)

-150

-100

-50 0 50 outlier values y1, from -149.5 to 150.5, with width=3

Figure 11. Comparison in Terms of Mean Final Prediction Error

30

Table 4: Marginal Frequencies of Model Selection Based on 5050 Simulations non-robust robust methods Selected Model NSC AIC BIC RSC RAIC HAIC RBIC Model (i): E (y) = 2 + 3x1 0 + 1 x1 94 72 93 3193 3270 3334 3679 0 + 1 x1 + 2x21 4896 4928 4898 1061 694 626 275 2 0 + 2 x1 60 50 59 796 1086 1090 1096 Model (ii): E (y) = 2 + 3x1 + 0:5x21 0 + 1 x1 102 76 100 1392 1955 2055 2554 0 + 1 x1 + 2x21 4894 4924 4896 2616 1508 1298 288 2 0 + 2 x1 54 50 54 1042 1587 1697 2208 Model (iii): E (y) = 2 + 3x1, x116 = 2:54 0 3347 2529 3143 0 0 0 0 0 + 1 x1 1690 2426 1892 3764 4277 4326 4495 0 + 1 x1 + 2x21 11 24 11 1094 298 245 69 0 + 2 x21 2 71 4 192 475 479 486 Model (iv): E (y) = 2 + 3x1, x11 = 6:0 0 + 1 x1 70 62 79 3046 3269 3347 3711 0 + 1 x1 + 2x21 4884 4903 4849 1347 712 633 265 2 0 + 2 x1 96 85 122 657 1069 1070 1074 Model (v): E (y) = 2 + 3x1 + 4x2 other models 3807 3386 3796 0 0 0 0 0 + 1 x1 0 0 0 0 0 0 0 0 + 1 x1 + 2x2 227 163 223 4021 3694 3792 4214 0 + 1 x1 + 2x2 + 3 x1x2 491 1127 469 455 833 717 231 0 + 1 x1 + 3x1x3 525 374 562 574 523 541 605 Model (vi): E (y) = 2 + 1:5x1 + 1:5x2 other models 4463 4265 4500 42 7 8 33 0 + 1 x1 32 2 15 364 47 57 240 0 + 1 x1 + 2x2 122 116 128 2757 2710 2787 2943 0 + 1 x1 + 2x2 + 3 x1x2 170 476 145 175 692 599 241 0 + 1 x1 + 3x1x3 263 191 262 1712 1594 1599 1593 Model (vii): E (y) = 2 + 3x1 + 4x2,x116 = 2:54 other models 3891 3401 3776 0 0 0 0 0 + 1 x1 395 586 436 0 0 0 0 0 + 1 x1 + 2x2 676 1002 802 4521 4379 4454 4747 0 + 1 x1 + 2x2 + 3 x1x2 13 20 6 359 535 445 112 0 + 1 x1 + 3x1x3 75 41 30 170 136 151 191 Model (viii): E (y) = 2 + 3x1 + 4x2 , x11 = 6:0 other models 3794 3523 3865 0 0 0 0 0 + 1 x1 3 0 0 0 0 0 0 0 + 1 x1 + 2x2 253 190 247 4022 3728 3845 4269 0 + 1 x1 + 2x2 + 3 x1x2 174 657 232 534 857 720 234 0 + 1 x1 + 3x1x3 826 680 706 494 465 485 547

31

Table 5:

Eects of Some Thick-tail Error Distributions for model y = 2 + 3x1 + 4x2 + r based on 1000 simulations Distributions Candidate Models Normal t(3) Cauchy Log N Slash "-Normal Robust Stochastic Complexity Criterion 0 0 0 10 0 35 0 0 + 1 x1 0 0 33 0 103 0 0 + 1 x1 + 2x2 887 832 652 843 534 819 0 + 1 x1 + 2x2 + 3 x1x2 86 53 33 81 7 88 0 + 1 x1 + 3x1x2 27 115 226 76 237 93 0 + 2 x2 0 0 1 0 2 0 0 + 2 x2 + 3x1x2 0 0 17 0 20 0 0 + 3 x1x2 0 0 28 0 62 0 RAIC 0 0 0 2 0 9 0 0 + 1 x1 0 0 13 0 41 0 0 + 1 x1 + 2x2 850 809 635 810 594 767 0 + 1 x1 + 2x2 + 3 x1x2 128 96 116 125 81 155 0 + 1 x1 + 3x1x2 22 95 197 65 220 78 0 + 2 x2 0 0 0 0 3 0 0 + 2 x2 + 3x1x2 0 0 22 0 35 0 0 + 3 x1x2 0 0 15 0 17 0 HAIC 0 0 0 3 0 12 0 0 + 1 x1 0 0 16 0 46 0 0 + 1 x1 + 2x2 868 821 649 825 600 785 0 + 1 x1 + 2x2 + 3 x1x2 108 80 98 108 66 130 0 + 1 x1 + 3x1x2 24 99 196 67 218 85 0 + 2 x2 0 0 0 0 4 0 0 + 2 x2 + 3x1x2 0 0 22 0 32 0 0 + 3 x1x2 0 0 16 0 22 0 RBIC 0 0 0 8 0 25 0 0 + 1 x1 0 0 26 0 79 0 0 + 1 x1 + 2x2 941 865 690 892 602 850 0 + 1 x1 + 2x2 + 3 x1x2 27 27 41 37 18 53 0 + 1 x1 + 3x1x2 32 108 190 71 202 97 0 + 2 x2 0 0 1 0 2 0 0 + 2 x2 + 3x1x2 0 0 21 0 22 0 0 + 3 x1x2 0 0 23 0 50 0

32

Table 6:

Eects of Some Thick-tail Error Distributions for model y = 2 + 3x1 + r based on 1000 simulations Distributions Candidate Models Normal t(3) Cauchy Log N Slash "-Normal Stochastic Complexity Criterion 0 0 0 2 0 24 0 0 + 1 x1 754 738 660 740 600 666 0 + 1 x1 + 2x21 221 208 150 202 142 249 0 + 2 x21 25 54 188 58 234 85 RAIC 0 0 0 2 0 7 0 0 + 1 x1 855 840 682 842 648 714 0 + 1 x1 + 2x21 69 57 84 58 60 117 2 0 + 2 x1 76 103 232 100 285 169 HAIC 0 0 0 2 0 8 0 0 + 1 x1 868 847 697 854 658 730 0 + 1 x1 + 2x21 55 50 69 46 46 100 0 + 2 x21 77 103 232 100 288 170 RBIC 0 0 0 2 0 21 0 0 + 1 x1 910 879 744 889 673 793 0 + 1 x1 + 2x21 9 18 19 5 18 35 0 + 2 x21 81 103 235 106 288 172

33