choice of reference subclass in regression models

0 downloads 0 Views 2MB Size Report
We show that a judicious choice of reference subclass can improve certain properties of the regression model. CASI, Templepatrick, N. Ireland, May 14-16th, ...
Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

CHOICE OF REFERENCE SUBCLASS IN REGRESSION MODELS Gilbert MacKenzie1,2 & Defen Peng 2,3 2 Centre

1 ENSAI, Rennes & of Biostatistics University of Limerick, Ireland 3 UBC,Vancouver, Canada.

CASI, Templepatrick, Northern Ireland, May 14-16, 2014

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

1

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

ENSAI Building 2nd Int. BIO-SI W/S Oct 6/7th, 2011

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

2

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Outline

This talk is about choice of reference subclass in parametric regression models with categorical variables - mainly in observational studies Introduction Linear Model Setting Precision & Multi-collinearity Extensions to GLMs Conclusions

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

3

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Introduction

A quotation: ‘There is no statistical justification for choosing one reference category or another. The choice is usually made on subject matter grounds to make the interpretations easier and the choice can easily vary from data analyst to data analyst. So, the need for a reference category can complicate interpretations and the results . . .’ R. Berk (2008). We show that a judicious choice of reference subclass can improve certain properties of the regression model.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

4

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Secondary Criteria

Many model properties are invariant to the choice of reference subclass so we need secondary criteria: Precision of estimates - Total Variance, Tˆr = tr [V (βˆr )]. ˆr Multicollinearity - Condition Number, K Logical Considerations NB: The third can be evaluated in terms of the first two. ˆr ) Interested in the pair (Tˆr , K We illustrate in terms of the Linear Model - only 15 minutes.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

5

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Linear Model Setting We consider the general linear model Y = Xβ + 

(1)

where: Y is a continuous response variable, X is an n × p design matrix, β is a p × 1 column vector of regression parameters, E() = 0 and E(0 ) = σ 2 In . We will also assume that i ∼ N(0, σ 2 ) when required, for i = 1, . . . , n. It follows immediately that βˆ = (X 0 X )−1 X 0 Y (2) and that ˆ = σ 2 (X 0 X )−1 V (β)

(3)

which implies, under the the Gaussian assumption, that the Fisher information matrix is I(β) = (X 0 X )/σ 2

(4)

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

6

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Form of Design Matrix If the design matrix X encodes a single categorical variable with p = (k + 1) subclasses, X 0 X , may take one of of two main forms X 0 X = diag(n1 , n2 , . . . , nk +1 )

(5)

or, 

n n1 n2 n1 n1 0   X 0 X = n2 0 n2  .. .. .. . . . nk 0 0

··· ··· ··· ···

 nk 0  0 . ..  . nk

(6)

In (5) we have included exactly p = (k + 1) binary indicator variables and in (6) we have included an intercept term and exactly k binary indicator variables. CASI, Templepatrick, N. Ireland, May 14-16th, 2014

7

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Precision Suppose we have a sample allocation (n1 , n2 , · · · , np ), where, at least, one of the allocated numbers is different from the others. Let r , denote the reference category which may be chosen freely from (1, . . . , p). Then   1 −1 −1 ··· −1 −1 1 + nr /n1  1 ··· 1  1  −1  0 −1 1 1 + nr /n2 · · · 1 (X X ) =   (7)  nr  .. .. .. ..  .  . . . −1

1

1

···

1 + nr /nk

P where nr = n − kj=1 nj is the allocated number of the reference category.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

8

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Example 1 - Binary Covariate With p = 2, X ← (x0 , x1 ) implies that category 2 is the reference ! 1 P y i i[r ] n P r P βˆr = (8) − n1r i[r ] yi + n1s i[s] yi βˆs =

− n1s

1 P ns

P

i[s] yi P 1 i[s] yi + nr i[r ] yi

! .

(9)

the intercepts differ , but, βˆ1,r = −βˆ1,s . On the diagonal of the (2 × 2) variance-covariance matrices 1 1 σ2 diagV (βˆr ) = [ , σ 2 ( + )], nr nr ns σ2 1 1 diagV (βˆs ) = [ , σ 2 ( + )], ns ns nr thus, Var(βˆ1,r ) = Var(βˆ1,s ). Therefore, the precision of the regression coefficient is invariant to switching the reference CASI, Templepatrick, N. Ireland, May 14-16th, 2014 category.

9

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Example 2 - Two binary covariates With p = 3, X ← (x0 , x1 , x2 ) implies that category 3 is the reference First the regression coefficients are different (not shown) Then the diagonals of the V-C matrices are  1 diag V (βˆr =3 ) = σ 2 , n3  1 diag V (βˆr =2 ) = σ 2 , n2  1 diag V (βˆr =1 ) = σ 2 , n1

1 1 1 1 0 + ), ( + ) , n3 n1 n3 n2 1 1 1 1 0 ( + ), ( + ) , n2 n1 n2 n3 1 1 1 1 0 ( + ), ( + ) n1 n2 n1 n3 (

So in LMs Tˆr is minimised when nr = nmax

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

10

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Multi-collinearity ˆr , to measure multi-collinearity. We use the condition number, K Belsley( 2004) defines the condition number of a square matrix, M, as p K (M) = λmax /λmin = νmax /νmin , where λmax = maximum(λj ), λmin = minimum(λj ), and λj , j = 1, 2, · · · , p, are the eigenvalues of M and the νs are the Singular Value Decomposition (SVD) numbers. The threshold values for K (M = X 0 X ) are 10 and 30 indicating medium and serious degrees of multi-collinearity. We use Kr to denote K (Mr ) where M = X 0 X and where r indicates reference subclass dependence.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

11

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

MC LM Binary Covariate The eigenvalues λ of M = X 0 X based on determinant det(X 0 X − λI) = λ2 − (n + n1 )λ + nn1 − n12 are λmax = λmin =

q n + n1 1 + (n − n1 )2 + 4n12 , 2 2 q n + n1 1 − (n − n1 )2 + 4n12 , 2 2

where I is 2 × 2 identity matrix. The condition number is then v q u u 1 + ρ1 + (1 − ρ1 )2 + 4ρ2 1 u q Kr (X 0 X ) = t , (10) 1 + ρ1 − (1 − ρ1 )2 + 4ρ21 where ρ1 = n1 /n. CASI, Templepatrick, N. Ireland, May 14-16th, 2014

12

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

ˆr Relationship between Tˆr and K We have examined this in a variety of cases - analytically and via simulation - in LMs and GLMs and the results are similar. ˆr ) is typically 0.95, showing a The correlation between (Tˆr , K strong linear relationship. ˆr . This means that minimizing Tˆr also minimises K Thus the stability of the model is improved by selecting nr = nmax in LMs There is no loss of information by switching reference subclass as contrasts of interest are invariant to this switch. In GLMs things are more complicated when minimizing Tˆr , but the principle is the same.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

13

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Lung Cancer Study

Survival Study of lung cancer in NI (Wilkinson, 1992). A total of 855 incident cases followed for 2 years. 50% dead by 6 months. Interested in who gets active treatment (and why)? Some 51.5% received no active treatment! Leads to a standard MLF analysis with Y=1 for treatment else Y=0. Some 5 covariates WHO, Age, Cell type, Metastases and Albumen. See example in next slide.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

14

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

15

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Conclusions

There is more see than suggested by Berk. Maximising the Precision minimizes the Multi-collinearity. Must be useful in sparse data situations with many categorical covariates. No loss of information on contrasts of interest For LMs and GLMs (and beyond) results are similar. Overall we have created some useful tools. We hope their use will improve practice.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

16

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Acknowledgements The work in this paper was supported by two Science Foundation Ireland (SFI, www.sfi.ie) project grants. Professor MacKenzie was supported under the Mathematics Initiative, II, via the BIO-SI (www.ul.ie/bio-si) research programme in the Centre of Biostatistics, University of Limerick, Ireland: grant number 07/MI/012. Professor Peng is also supported via a Research Frontiers Programme award, grant number 05/RF/MAT 026.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

17

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

References A LTMAN, D. G. & R OYSTON, P. (2006). Statistics notes - The cost of dichotomising continuous variables. British Medical Journal, 332, 1080. B ELSLEY, D. A., K UH, E. & W ELSCH, R. E. (2004). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, First edition. B ERK, R. (2008). Statistical learning from regression perspective. Springer, New York. C OHEN, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,NJ: Lawrence Erlbaum. C OX & S NELL (1989). Analysis of Binary Data. Chapman and Hall, Second edition. CRAN . R - PROJECT , (2009). R project. Retrieved 2010, from Package ’pwr’: http://cran.r-project.org/web/packages/pwr/pwr.pdf

E LWOOD, J. H., M AC K ENZIE, G. & C RAN, G. (1974). Observations on single births to women resident in Belfast 1962-66: Part I - Factors associated with perinatal mortality. J. Chron. Dis, 27, 517-535.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

18

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

References F ELDSTEIN, M. S., (1966). A binary variable multiple regression method of analysing factors affecting Peri-Natal mortality and other outcomes of pregnancy. Journal of the Royal Statistical Society. A 129, 61-73. F RØSLIE, K. F, R ØISLIEN, J., L AAKE, P., H ENRIKSEN, T., Q VIGSTAD, E. and V EIERØD, M. B.(2010). Categorisation of continuous exposure variables revisited. A response to the Hyperglycaemia and Adverse Pregnancy Outcome (HAPO) Study. BMC Medical Research Methodology, 10, 1471-2288. I SHAM, (1991). Statistical theory and modelling by edited by D.V. Hinkley, N. Reid, and E.J. Snell. Chapman and Hall. M AC K ENZIE, G. & P ENG, D. (2010). Properties of estimators in interval censored PH regression survival models. Submitted. Journal of the Royal Statistical Society. C. N IJENHUIS, A. & W ILF, H. S. (1978). Combinatorial Algorithms for Computers and Calculators. Academic Press, Second edition. P ENG , D. & M AC K ENZIE G. (2014). Discrepancy and choice of reference subclass in categorical regression models. In: Statistical Modelling in Biostatistics and Bioinformatics. Springer, Munich, 260 pages. CASI, Templepatrick, N. Ireland, May 14-16th, 2014

19

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

References P OCOCK, S. J., C OLLIER, T. J., DANDREO, K. J., D E S TAVOLA, B. L., G OLDMAN, M. B., K ALISH, L. A., L INDA, E. K. & VALERIE, A. M. (2004). Issues in the reporting of epidemiological studies: a survey of recent practice. British Medical Journal, 329, 883-887. R AO, C. R. & R AO, M. B. (1998). Matrix Algebra and its Applications to Statistics and Econometrics. World Scientific Publishing, Singapore, First edition. S HAPIRO, S. S. (1980). How to test normality and other distributional assumptions. statistical techniques, 3, 1-78. S MITH, O. K. (1961). Eigenvalues of a symmetric 3 × 3 matrix. Communications of the ACM. 4, 168. W ISSMANN, M., TOUTENBURG, H. & S HALABH, (2007). Role of Categorical Variables in Multicollinearity in Linear Regression Model. Technical Report. Department of Statistics University of Munich, Germany. Number 008. W ILLIAM, G. J. (2005). Regression III: Advanced methods. Lecture notes. Department of Political Science Michigan State University, America. http://polisci.msu.edu/jacoby/icpsr/regress3.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

20

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Minimizing the Total Variance

Proof. Let nr = max(n1 , · · · , np ), and s ∈ {1, 2, · · · , p} (s 6= r ) be another choice of reference category, where nr > ns , then, from (17), the corresponding total variances are Vr = 1/nr + (1/nr + 1/ns ) +

p X

(1/nr + 1/nj )

j6=r ,s

and Vs = 1/ns + (1/ns + 1/nr ) +

p X

(1/ns + 1/nj ).

j6=r ,s

Since 1/nr < 1/ns , we have Vr < Vs , i.e., choosing nr = nmax minimizes the total variance.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

21

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Canonical GLMs Canonical GLM for independent responses Yi with E(Yi ) = µi = g(θi ), θi =

k X

xui βu

u=0

is the linear predictor, βu = 0, . . . , k , represents the p regression parameters. Then the observed information matrix for β is Io (β) = (Oβ θT )(Oθ Oθ K )(Oβ θT )T = (Oβ µT )(Oθ Oθ K )−1 (Oβ µT )T .(11) When β0 is the intercept, we can re-express as the (p × p) matrix  P  P 0 w P i xci w0i Io (β0 , βc ) = (X 0 WX ) = P i i , (12) i xci xci wi i xci wi CASI, Templepatrick, N. Ireland, May 14-16th, 2014

22

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Structural Weights

Table 2: Structural weights Distribution

Normal Exponential IG

Density(Mass) Function f (y ; µ, σ) = √ 1 f (y ; λ) = λe

2πσ −λy

−(y −µ)2 e 2σ2

1 f (y ; µ, λ) = ( λ 3 ) 2 e 2πy λy y!

−λ

Poisson

f (y ; λ) =

Binomial

f (y ; n, p) =

Geometric

f (y ; p) = (1 − p)y −1 p

Link function

wi

xβ = µ = θ

σ −2

−1

xβ = µ −λ(y −µ)2 2µ2 y

e   n py (1 − p)n−y y



xβ = µ−2 = θ xβ = log(µ) = θ µ xβ = log( (1−µ )=θ µ xβ = log( (1−µ )=θ

(xi0 β)−2 λ (x 0 β)−3/2 4 i exp(xi0 β) exp(x 0 β) i (1+exp(x 0 β))2 i 1 1+exp(x 0 β) i

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

23

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Extension to Canonical GLMs For a single categorical variate across GLMs we can Show that the optimal choice depends on nr × ϕ(βˆ0 ). Show that we should choose the subclass where nr × ϕ(βˆ0 ) is max. Show that choosing nr = nmax is usually good. Show that when the observed allocation is uniform (n1 = n2 = · · · = np ) or near uniform the choice of reference subclass does not matter. Show there is an index to tell you when you need to worry about lack of uniformity. These results extend to GLMs with multiple categorical covariates. CASI, Templepatrick, N. Ireland, May 14-16th, 2014

24

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Contrasts of Interest

Generally, such contrasts are conducted among the k regression coefficients. Then we have V (Z ) = C 0 V (βr )C V (Z ) = σ

2

k X

cj2 /nj

j=1

where c0 = 0 and c 0 1 = 0. Then Z does not depend on β0 and accordingly such contrasts are invariant to the choice of reference subclass.

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

25

Outline

Intro

LMs

Precision

MC

T-MC

Conc

Ref

App

Generalization of V-C matrix The generalised Variance covariance matrix for GLMs is 1 −1  −1 1  I −1 (β0 , βc ) = nr ϕ(β0 )   ..  . −1 

−1 1 + (τ1 × 1 .. . 1

nr ) n1

−1 1 1 + (τ2 × .. . 1

nr ) n2

··· ··· ···

···

−1 1 1 .. . 1 + (τk ×



nr nk

    , (13)    )

where i[j] means subject i ∈ jth category, whence xji = 1 for i ∈ jth category, and τj = ϕ(β0 )/ϕ(β0 + βj ), nr and nj are the allocated numbers in the reference subclass and the jth subclass respectively, j = 1, 2, · · · , k . This matrix structure recurs in other settings (MacKenzie & Peng, 2013: Peng & MacKenzie, 2014).

CASI, Templepatrick, N. Ireland, May 14-16th, 2014

26