A note on logistic regression and logistic kernel machine models Ru Wang1 , Jie Peng1 , Pei Wang2 1

2

Department of Statistics, University of California, Davis, CA, 95616 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109.

arXiv:1103.0818v1 [stat.AP] 4 Mar 2011

This is a note on logistic regression models and logistic kernel machine models. It contains derivations to some of the expressions in [1]. Logistic regression models and score tests For the ith individual (i = 1,· · · , n), let response yi be 0 if unaffected, and 1 if affected. Let Xi be a q × 1 covariates vector (including an intercept term), zi be a p × 1 vector of SNP genotypes (or summary scores) for a given gene (SNP set) under testing, and si be the environment covariate which is also included in Xi . We consider the logistic regression model with gene-environment interactions, logit(pi ) = XiT β + aT zi + si · bT zi ,

i = 1, · · · , n,

(1)

where pi = P r(yi = 1|Xi , zi ). The goal is to test the null hypothesis H0 : a = b = 0. Consider the score statistic, 22 Z(Y − µ0 ) 0 T T 0 T T SS = (Y − µ ) Z , (S · Y − S · µ ) Z I , (2) Z(S · Y − S · µ0 )

where Y = (y1 , · · · , yn )T , S = (s1 , · · · , sn )T , Z T = (z1T , · · · , znT ), and “·” stands for the element-wise multiplication. In addition, µ0 = (p01 , · · · , p0n )T ,

where p0i ’s are the fitted values of pi ’s under H0 . The information matrix −1 I 22 = (I22 − I21 I11 I12 )−1 ,

where T

T

T

I11 = XD1 X , I12 = (ZD1 X , ZD2 X ) =

T I21 ,

I22 =

ZD1 Z T ZD2 Z T

ZD2 Z T ZD2 Z T

,

and D1 =diag(p0i (1 − p0i )), D2 = diag(si · p0i (1 − p0i )). Under H0 , SS ∼ χ2(n−v) , where ν is the rank of the matrix (Z, S · Z), where S · Z := (s1 · z1T , · · · , sn · znT )T . Logistic kernel machine models Following [2, 3], we now extend (1) to a semiparametric logistic regression model logit(pi ) = XiT β + h(zi ) + si · g(zi ),

i = 1, · · · , n,

(3)

where h(·) and g(·) belong to reproducing kernel Hilbert spaces HK and HKe generated by kernels K(·, ·) and e ·), respectively. Considering penalized likelihood, h(·) and g(·) can be estimated by K(·, ) ( n X 1 1 p i 2 2 ˆ gˆ) = argmax (4) ) + log(1 − pi )) − khkHK − kgkH e . (yi log( (h, h∈HK ,g∈HK e K e 1 − pi λ λ i=1 Following [2], the above solutions have the same form as the Penalized Quasi-Likelihood estimators from the logistic mixed model: logit(pi ) = XiT β + hi + si · gi , i = 1, · · · , n, (5) 1

e ˜ and hi ’s and gi ’s are independent. Denote τ = 1/λ and τ˜ = 1/λ. where hi ∼i.i.d. Nn (0, λ1 K), gi ∼i.i.d. Nn (0, λ1˜ K), Now, testing the null hypothesis of no genetic effects H0 : h(·) = g(·) = 0 in (3) can be reformulated as testing the absence of the variance components H0 : τ = τ˜ = 0 in model (5). As in [2, 3], we consider the following test statistic based on the score statistic of (τ, τ˜): Q=

Qτ Qτ˜

=

− µ0 )T K(Y − µ0 ) , 1 0 T 0 ˜ 2 (Y − µ ) (sK)(Y − µ ) 1 2 (Y

(6)

˜ := (K(z ˜ i , zj )), sK ˜ := (si sj K ˜ ij ). Under H0 , the mean and covariance where the n × n matrices K := (K(zi , zj )), K of Q are, 1 µτ 2 tr(P0 K) , µ= = 1 ˜ µτ˜ 2 tr(P0 sK)

IQ =

στ2 στ˜τ

στ τ˜ στ2˜

=

1 2 tr(P0 KP0 K) 1 ˜ 2 tr(P0 sKP0 K)

1 ˜ 2 tr(P0 KP0 sK) 1 ˜ ˜ 2 tr(P0 sKP0 sK)

where P0 := W0 − W0 X(X T W0 X)−1 X T W0 , and W0 = diag(p0i (1 − p0i )). We then linearly transform Q to make its two components uncorrelated: ∗ Qτ − 21 ∗ = I Q = Q Q Q∗τ˜ −1

and µ∗ = IQ 2 µ. Since the components of Q∗ are quadratic forms, they can be approximated by scaled chiquare distributions κ∗τ χ2(ν ∗ ) , κ∗τ˜ χ2(ν ∗ ) , respectively [2]. Through matching the means and variances, we have kτ∗ = 1/(2µ∗τ ), τ

τ ˜

∗ ∗2 kτ∗˜ = 1/(2µ∗τ˜ ) and ντ∗ = 2µ∗2 τ , ντ˜ = 2µτ˜ . Finally, we construct a combined test statistic

Q∗max = max(

Q∗τ Q∗τ˜ , ) kτ∗ kτ∗˜

(7)

The corresponding p-value is then p-value = 1 − Fχ2 (Q∗max , ντ∗ ) · Fχ2 (Q∗max , ντ˜∗ ), where Fχ2 (·, ν) is the cumulative distribution function of a chisquare distribution with ν degrees of freedom. Note, ˜ are linear kernels, i.e., K(zi , zj ) = K(z e i , zj ) = z T zj , models (1) and (3) have the same form. when both K and K i However, they are treated differently and consequently the corresponding test statistics are different.

References [1] R Wang, J Peng, P Wang: SNP Set Analysis for Detecting Disease Association Using Exon Sequence Data. submitted to BMC proceedings (2011). [2] D Liu, D Ghosh, X Lin: Estimation and Testing for the Effect of a Genetic Parthway on a Disease Outcome using Logistic Kernel Machine Regression via Logistic Mixed Models. BMC Bioinformatics (2008), 9: 292. [3] M Wu, P Kraft, M Epstein, D Taylor, S Chanock, D Hunter, X Lin: Powerful SNP-Set Analysis for CaseControl Genome-wide Association Studies. The American Journal of Human Genetics (2010), 86:929-942.

2

2

Department of Statistics, University of California, Davis, CA, 95616 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109.

arXiv:1103.0818v1 [stat.AP] 4 Mar 2011

This is a note on logistic regression models and logistic kernel machine models. It contains derivations to some of the expressions in [1]. Logistic regression models and score tests For the ith individual (i = 1,· · · , n), let response yi be 0 if unaffected, and 1 if affected. Let Xi be a q × 1 covariates vector (including an intercept term), zi be a p × 1 vector of SNP genotypes (or summary scores) for a given gene (SNP set) under testing, and si be the environment covariate which is also included in Xi . We consider the logistic regression model with gene-environment interactions, logit(pi ) = XiT β + aT zi + si · bT zi ,

i = 1, · · · , n,

(1)

where pi = P r(yi = 1|Xi , zi ). The goal is to test the null hypothesis H0 : a = b = 0. Consider the score statistic, 22 Z(Y − µ0 ) 0 T T 0 T T SS = (Y − µ ) Z , (S · Y − S · µ ) Z I , (2) Z(S · Y − S · µ0 )

where Y = (y1 , · · · , yn )T , S = (s1 , · · · , sn )T , Z T = (z1T , · · · , znT ), and “·” stands for the element-wise multiplication. In addition, µ0 = (p01 , · · · , p0n )T ,

where p0i ’s are the fitted values of pi ’s under H0 . The information matrix −1 I 22 = (I22 − I21 I11 I12 )−1 ,

where T

T

T

I11 = XD1 X , I12 = (ZD1 X , ZD2 X ) =

T I21 ,

I22 =

ZD1 Z T ZD2 Z T

ZD2 Z T ZD2 Z T

,

and D1 =diag(p0i (1 − p0i )), D2 = diag(si · p0i (1 − p0i )). Under H0 , SS ∼ χ2(n−v) , where ν is the rank of the matrix (Z, S · Z), where S · Z := (s1 · z1T , · · · , sn · znT )T . Logistic kernel machine models Following [2, 3], we now extend (1) to a semiparametric logistic regression model logit(pi ) = XiT β + h(zi ) + si · g(zi ),

i = 1, · · · , n,

(3)

where h(·) and g(·) belong to reproducing kernel Hilbert spaces HK and HKe generated by kernels K(·, ·) and e ·), respectively. Considering penalized likelihood, h(·) and g(·) can be estimated by K(·, ) ( n X 1 1 p i 2 2 ˆ gˆ) = argmax (4) ) + log(1 − pi )) − khkHK − kgkH e . (yi log( (h, h∈HK ,g∈HK e K e 1 − pi λ λ i=1 Following [2], the above solutions have the same form as the Penalized Quasi-Likelihood estimators from the logistic mixed model: logit(pi ) = XiT β + hi + si · gi , i = 1, · · · , n, (5) 1

e ˜ and hi ’s and gi ’s are independent. Denote τ = 1/λ and τ˜ = 1/λ. where hi ∼i.i.d. Nn (0, λ1 K), gi ∼i.i.d. Nn (0, λ1˜ K), Now, testing the null hypothesis of no genetic effects H0 : h(·) = g(·) = 0 in (3) can be reformulated as testing the absence of the variance components H0 : τ = τ˜ = 0 in model (5). As in [2, 3], we consider the following test statistic based on the score statistic of (τ, τ˜): Q=

Qτ Qτ˜

=

− µ0 )T K(Y − µ0 ) , 1 0 T 0 ˜ 2 (Y − µ ) (sK)(Y − µ ) 1 2 (Y

(6)

˜ := (K(z ˜ i , zj )), sK ˜ := (si sj K ˜ ij ). Under H0 , the mean and covariance where the n × n matrices K := (K(zi , zj )), K of Q are, 1 µτ 2 tr(P0 K) , µ= = 1 ˜ µτ˜ 2 tr(P0 sK)

IQ =

στ2 στ˜τ

στ τ˜ στ2˜

=

1 2 tr(P0 KP0 K) 1 ˜ 2 tr(P0 sKP0 K)

1 ˜ 2 tr(P0 KP0 sK) 1 ˜ ˜ 2 tr(P0 sKP0 sK)

where P0 := W0 − W0 X(X T W0 X)−1 X T W0 , and W0 = diag(p0i (1 − p0i )). We then linearly transform Q to make its two components uncorrelated: ∗ Qτ − 21 ∗ = I Q = Q Q Q∗τ˜ −1

and µ∗ = IQ 2 µ. Since the components of Q∗ are quadratic forms, they can be approximated by scaled chiquare distributions κ∗τ χ2(ν ∗ ) , κ∗τ˜ χ2(ν ∗ ) , respectively [2]. Through matching the means and variances, we have kτ∗ = 1/(2µ∗τ ), τ

τ ˜

∗ ∗2 kτ∗˜ = 1/(2µ∗τ˜ ) and ντ∗ = 2µ∗2 τ , ντ˜ = 2µτ˜ . Finally, we construct a combined test statistic

Q∗max = max(

Q∗τ Q∗τ˜ , ) kτ∗ kτ∗˜

(7)

The corresponding p-value is then p-value = 1 − Fχ2 (Q∗max , ντ∗ ) · Fχ2 (Q∗max , ντ˜∗ ), where Fχ2 (·, ν) is the cumulative distribution function of a chisquare distribution with ν degrees of freedom. Note, ˜ are linear kernels, i.e., K(zi , zj ) = K(z e i , zj ) = z T zj , models (1) and (3) have the same form. when both K and K i However, they are treated differently and consequently the corresponding test statistics are different.

References [1] R Wang, J Peng, P Wang: SNP Set Analysis for Detecting Disease Association Using Exon Sequence Data. submitted to BMC proceedings (2011). [2] D Liu, D Ghosh, X Lin: Estimation and Testing for the Effect of a Genetic Parthway on a Disease Outcome using Logistic Kernel Machine Regression via Logistic Mixed Models. BMC Bioinformatics (2008), 9: 292. [3] M Wu, P Kraft, M Epstein, D Taylor, S Chanock, D Hunter, X Lin: Powerful SNP-Set Analysis for CaseControl Genome-wide Association Studies. The American Journal of Human Genetics (2010), 86:929-942.

2