A note on logistic regression and logistic kernel machine models

A note on logistic regression and logistic kernel machine models Ru Wang1 , Jie Peng1 , Pei Wang2 1

2

Department of Statistics, University of California, Davis, CA, 95616 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109.

arXiv:1103.0818v1 [stat.AP] 4 Mar 2011

This is a note on logistic regression models and logistic kernel machine models. It contains derivations to some of the expressions in [1]. Logistic regression models and score tests For the ith individual (i = 1,· · · , n), let response yi be 0 if unaffected, and 1 if affected. Let Xi be a q × 1 covariates vector (including an intercept term), zi be a p × 1 vector of SNP genotypes (or summary scores) for a given gene (SNP set) under testing, and si be the environment covariate which is also included in Xi . We consider the logistic regression model with gene-environment interactions, logit(pi ) = XiT β + aT zi + si · bT zi ,

i = 1, · · · , n,

(1)

where pi = P r(yi = 1|Xi , zi ). The goal is to test the null hypothesis H0 : a = b = 0. Consider the score statistic, 22 Z(Y − µ0 ) 0 T T 0 T T SS = (Y − µ ) Z , (S · Y − S · µ ) Z I , (2) Z(S · Y − S · µ0 )

where Y = (y1 , · · · , yn )T , S = (s1 , · · · , sn )T , Z T = (z1T , · · · , znT ), and “·” stands for the element-wise multiplication. In addition, µ0 = (p01 , · · · , p0n )T ,

where p0i ’s are the fitted values of pi ’s under H0 . The information matrix −1 I 22 = (I22 − I21 I11 I12 )−1 ,

where T

T

T

I11 = XD1 X , I12 = (ZD1 X , ZD2 X ) =

T I21 ,

I22 =

ZD1 Z T ZD2 Z T

ZD2 Z T ZD2 Z T

,

and D1 =diag(p0i (1 − p0i )), D2 = diag(si · p0i (1 − p0i )). Under H0 , SS ∼ χ2(n−v) , where ν is the rank of the matrix (Z, S · Z), where S · Z := (s1 · z1T , · · · , sn · znT )T . Logistic kernel machine models Following [2, 3], we now extend (1) to a semiparametric logistic regression model logit(pi ) = XiT β + h(zi ) + si · g(zi ),

i = 1, · · · , n,

(3)

where h(·) and g(·) belong to reproducing kernel Hilbert spaces HK and HKe generated by kernels K(·, ·) and e ·), respectively. Considering penalized likelihood, h(·) and g(·) can be estimated by K(·, ) ( n X 1 1 p i 2 2 ˆ gˆ) = argmax (4) ) + log(1 − pi )) − khkHK − kgkH e . (yi log( (h, h∈HK ,g∈HK e K e 1 − pi λ λ i=1 Following [2], the above solutions have the same form as the Penalized Quasi-Likelihood estimators from the logistic mixed model: logit(pi ) = XiT β + hi + si · gi , i = 1, · · · , n, (5) 1

e ˜ and hi ’s and gi ’s are independent. Denote τ = 1/λ and τ˜ = 1/λ. where hi ∼i.i.d. Nn (0, λ1 K), gi ∼i.i.d. Nn (0, λ1˜ K), Now, testing the null hypothesis of no genetic effects H0 : h(·) = g(·) = 0 in (3) can be reformulated as testing the absence of the variance components H0 : τ = τ˜ = 0 in model (5). As in [2, 3], we consider the following test statistic based on the score statistic of (τ, τ˜): Q=

Qτ Qτ˜

=

− µ0 )T K(Y − µ0 ) , 1 0 T 0 ˜ 2 (Y − µ ) (sK)(Y − µ ) 1 2 (Y

(6)

˜ := (K(z ˜ i , zj )), sK ˜ := (si sj K ˜ ij ). Under H0 , the mean and covariance where the n × n matrices K := (K(zi , zj )), K of Q are, 1 µτ 2 tr(P0 K) , µ= = 1 ˜ µτ˜ 2 tr(P0 sK)

IQ =

στ2 στ˜τ

στ τ˜ στ2˜

=

1 2 tr(P0 KP0 K) 1 ˜ 2 tr(P0 sKP0 K)

1 ˜ 2 tr(P0 KP0 sK) 1 ˜ ˜ 2 tr(P0 sKP0 sK)

where P0 := W0 − W0 X(X T W0 X)−1 X T W0 , and W0 = diag(p0i (1 − p0i )). We then linearly transform Q to make its two components uncorrelated: ∗ Qτ − 21 ∗ = I Q = Q Q Q∗τ˜ −1

and µ∗ = IQ 2 µ. Since the components of Q∗ are quadratic forms, they can be approximated by scaled chiquare distributions κ∗τ χ2(ν ∗ ) , κ∗τ˜ χ2(ν ∗ ) , respectively [2]. Through matching the means and variances, we have kτ∗ = 1/(2µ∗τ ), τ

τ ˜

∗ ∗2 kτ∗˜ = 1/(2µ∗τ˜ ) and ντ∗ = 2µ∗2 τ , ντ˜ = 2µτ˜ . Finally, we construct a combined test statistic

Q∗max = max(

Q∗τ Q∗τ˜ , ) kτ∗ kτ∗˜

(7)

The corresponding p-value is then p-value = 1 − Fχ2 (Q∗max , ντ∗ ) · Fχ2 (Q∗max , ντ˜∗ ), where Fχ2 (·, ν) is the cumulative distribution function of a chisquare distribution with ν degrees of freedom. Note, ˜ are linear kernels, i.e., K(zi , zj ) = K(z e i , zj ) = z T zj , models (1) and (3) have the same form. when both K and K i However, they are treated differently and consequently the corresponding test statistics are different.

References [1] R Wang, J Peng, P Wang: SNP Set Analysis for Detecting Disease Association Using Exon Sequence Data. submitted to BMC proceedings (2011). [2] D Liu, D Ghosh, X Lin: Estimation and Testing for the Effect of a Genetic Parthway on a Disease Outcome using Logistic Kernel Machine Regression via Logistic Mixed Models. BMC Bioinformatics (2008), 9: 292. [3] M Wu, P Kraft, M Epstein, D Taylor, S Chanock, D Hunter, X Lin: Powerful SNP-Set Analysis for CaseControl Genome-wide Association Studies. The American Journal of Human Genetics (2010), 86:929-942.

2

A note on logistic regression and logistic kernel machine models

A note on logistic regression and logistic kernel machine models

Suggest Documents

Logistic Regression

Logistic Regression

A note on Verhulst's logistic equation and related logistic maps

Logistic Regression

Logistic Regression

A Logistic Regression Approach

13 generalized linear models and logistic regression

Maximum Likelihood Estimation of Logistic Regression Models ...

Incorporating Multiple Observations into Logistic Regression Models ...

Sequential Bayesian Computation of Logistic Regression Models

Protein Function Prediction Based on Kernel Logistic Regression with ...

Efficient simulation of Bayesian logistic regression models

path analysis with logistic regression models - scipress

Prospective Evaluation of Logistic Regression Models

Experimental Sequential Designs for Logistic Regression Models

PATH ANALYSIS WITH LOGISTIC REGRESSION MODELS

On almost unbiased ridge logistic estimator for the logistic regression ...

On almost unbiased ridge logistic estimator for the logistic regression ...

Section 3: Logistic Regression

Logistic Regression - CMU Statistics

Binary Logistic Regression

LOGISTIC REGRESSION ANALYSIS

Logistic regression - NYU

Multinomial Logistic Regression