Modified estimators in semiparametric regression

0 downloads 0 Views 3MB Size Report
Feb 20, 2018 - Theorem 3.1: Theobald [37] let ˆβ1 and ˆβ2 be estimator vectors of a ...... An application of the Rao–Blackwell theorem in preliminary test estima ...
Journal of Statistical Computation and Simulation

ISSN: 0094-9655 (Print) 1563-5163 (Online) Journal homepage: http://www.tandfonline.com/loi/gscs20

Modified estimators in semiparametric regression models with right-censored data Dursun Aydin & Ersin Yilmaz To cite this article: Dursun Aydin & Ersin Yilmaz (2018): Modified estimators in semiparametric regression models with right-censored data, Journal of Statistical Computation and Simulation, DOI: 10.1080/00949655.2018.1439032 To link to this article: https://doi.org/10.1080/00949655.2018.1439032

Published online: 20 Feb 2018.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gscs20

JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018 https://doi.org/10.1080/00949655.2018.1439032

Modified estimators in semiparametric regression models with right-censored data Dursun Aydin and Ersin Yilmaz Faculty of Science, Department of Statistics, Mugla Sitki Kocman University, Muğla, Turkey ABSTRACT

ARTICLE HISTORY

In this work we introduce different modified estimators for the vector parameter β and an unknown regression function g in semiparametric regression models when censored response observations are replaced with synthetic data points. The main idea is to study the effects of several covariates on a response variable censored on the right by a random censoring variable with an unknown probability distribution. To provide the estimation procedure for the estimators, we extend the conventional methodology to censored semiparametric regression using different smoothing methods such as smoothing spline (SS), kernel smoothing (KS), and regression spline (RS). In addition to estimating the parameters of the semiparametric model, we also provide a bootstrap technique to make inference on the parameters. A simulation study is carried out to show the performance and efficiency properties of the estimators and analyse the effects of the different censoring levels. Finally, the performance of the estimators is evaluated by a real right-censored data set.

Received 26 September 2017 Accepted 6 February 2018 KEYWORDS

Modified estimators; smoothing spline; kernel smoothing; regression spline; right-censored data

1. Introduction In statistics literature, a problem commonly faced by statisticians is the analysis of censored survival data. Examples of this data arise in different applied fields such as medicine, biology, public health, epidemiology, engineering, economics, and demography. Observations in these fields are usually incomplete, especially in medical studies. For instance, some patients may still be alive, disease-free or die at the termination of a medical study. There are two main traditional statistical methods, parametric and nonparametric, used in analysis of the relationship between covariates and the censored response variable, known as lifetime. Instead, we focus on semiparametric estimation methods which do not require knowledge of the underlying distribution of the response variable. Let {(yi , xi , ti ), i = 1, 2, . . . , n} be independent and identically distributed (i.i.d) random variables satisfying the semiparametric regression model yi = xi β + g(ti ) + εi ,

CONTACT Ersin Yilmaz [email protected] Kocman University, Muğla 48000, Turkey © 2018 Informa UK Limited, trading as Taylor & Francis Group

(1)

Faculty of Science, Department of Statistics, Mugla Sitki

2

D. AYDIN AND E. YILMAZ

where the yi ’s are values of the response variable, xi = (xi1 , . . . , xip ) are known p-vectors of explanatory variables with p ≤ n, the ti ’s are values of an extra univariate explanatory variable, β = (β1 , β2 , . . . , βp ) is an unknown vector of regression parameters, g is an unknown smooth regression function in R, and εi ’s are random error terms with E[εi |xi , ti ] = 0 and variance σ 2 and are independent of the data (xi , ti ). To be specific, this model is also called a partially linear model due to the connection with the classical linear model. In vector and matrix form, model (1) can be rewritten as Y = Xβ + g + ε,

(2)

where Y = (y1 , . . . , yn ) , X = [x1 , . . . , xn ], g = (g(t1 ), . . . , g(tn )) and ε = (ε1 , . . . , εn ) . As in most literature, it is supposed that X has full column rank. Note also that the function g symbolizes the smooth part of the model and assume that it shows the unparameterized functional relationship. The model (1) was discussed by Engle et al. [1] based on the assumption of the completely observed yi . For more details on the model (1), see Wahba [2,3], Speckman [4], Green and Silverman [5] among others. In our study, we are interested in estimating the vector parameter β and an unknown function g (.) in semiparametric model (1) when theyi ’s are observed incompletely and right censored by a random censoring variable ci , i = 1, 2, . . . , n, but xi and ti are observed completely. Therefore, instead of observing yi , we now observe the pair of values (zi , δi ), i = 1, . . . , n where  1; if yi ≤ ci (ith response is observed), (3) zi = min(yi , ci ) and δi = 0; if yi > ci (ith response is censored). Here, we assume that ci ’s are i.i.d random variables with a common distribution G (i.e. the distribution of the censoring observations ci ). Notice also that zi and ci are referred to as the lifetimes and the censoring time, respectively, for the ith survival subject. zi ’s are the observed lifetimes, while δi stores the information on whether an observation is censored or uncensored. If an observation is not censored, we choose zi = yi and δi = 1; otherwise, we take zi = ci and δi = 0. The model (1) is also said to be a right-censored semiparametric regression model when the response variable is observed incompletely and right censored by a random variable. The useful special cases of the censored partial linear model can be obtained using different methods and conditions. For example, if g(t) = 0 in Equation (1), the censored partially linear model reduces to the linear regression model with censored data. A number of authors have studied the case of the linear regression model with censored data. Examples of such studies include Miller [6], Buckley and James [7], Koul et al. [8], Zheng [9], Leurgans [10] and Lai et al. [11]. On the other hand, if β = 0 in model (1), the mentioned model reduces to the nonparametric regression model with censored data. See studies by Dabrowska [12], Zheng [13], Fan and Gijbels [14], Guessoum and Ould-Saïd [15] for examples. In this paper, the key idea is to estimate the vector parameter β, an unknown function g (.), and the mean vector μ = Xβ + g. Because the values of Y are the censored observations, the censoring distribution G is usually unknown. For this reason, traditional methods for estimating β and g (.) cannot be applied directly here. To overcome this problem, Koul et al. [8] suggested replacing incomplete observations with synthetic data. In our

JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION

3

context, we introduce differently modified estimators for the components of the semiparametric model with right-censored data, especially when the censored response variable is replaced by synthetic data. The modified estimators are based on a generalization of the ordinary SS, KS, and RS methods in the case of unknown censoring distribution G. We also provide the bootstrapped confidence intervals for the estimators. It is worth noting that some authors have studied semiparametric regression with censored data. For example, the censored partial linear model in which the censoring distribution is supposed to be known is considered by Qin and Lawless [16] and Qin and Cai [17]. Asymptotic properties for the right-censored semiparametric regression are discussed by Wang and Zheng [18]. In the last decade, Qin and Jing [19] discussed the asymptotic properties for estimation of partial linear models with censored data. Orbe et al. [20] examined censored partial regression and proposed an estimation procedure based on penalized weighted least squares, using Kaplan–Meier weights. The rest of the paper is outlined as follows. In Section 2, we present the censored partial regression model and provide the details for the estimation procedure. Section 4 describes a method that uses bootstrap resample techniques to make an inference. Section 6 provides some simulation results that support the adequacy of the methodology in different situations. Section 5 presents the results of the application of real data and, finally, Section 7 presents the conclusions.

2. Preliminaries and computation of estimators Let F and G be the probability distribution functions of yi and ci , respectively. That is, the unknown distribution function of the response is F(t) = P(yi ≤ k) and the censoring times are G(t) = P(ci ≤ k). In order to ensure that the model is identifiable, one needs to make some specification assumptions on the response, censoring and explanatory variables and their dependence relationships. For this purpose, we have followed the identifiability assumptions of Stute [21,22]: Assumption A: yi and ci are i.i.d, conditional on (xi , ti ) Assumption B: P(yi ≤ ci |yi , xi , ti ) = P(yi ≤ ci |yi ) These assumptions are used in survival analysis applications. If we use the Kaplan–Meier estimator, assumption A is a standard independence condition to ensure identifiability of the model with censored data. If assumption A is violated, then we need more information about the censoring structure to construct a proper model. Assumption B will be required to allow for a dependency between (xi , ti ) and ci . In other words, assumption B says that given time of death, covariates do not ensure any further information as to whether the observation is censored or not. See Stute [22,23,24], Tsiatis [25], Wei et. al. [26] and Zhou [27] for additional details on assumptions of lifetime data analysis. As expressed in the previous section, a formal connection between a linear and partially linear model can be constructed through a right-censored response variable y. If g(t) = 0 in model (1), this model transforms into the linear model yi = xi β + εi , i= 1, . . . ,n.

(4)

Note that we assume only the response variable is censored. Since the yi ’s are observed incompletely, standard methods which require knowledge of the completely observed yi ’s cannot be used for the censored data. Under censorship, rather than a random variable

4

D. AYDIN AND E. YILMAZ

yi , we observe {(zi , δi ), i = 1, 2, . . . , n}, as defined in (3). Under these conditions, Koul et al. [8] discussed that if G is continuous and known, it is possible to adjust the lifetime observations zi to yield an unbiased modification δi z i , 1 − G(zi )

yiG =

i = 1, 2, . . . , n.

(5)

The above assumptions A and B are also used to ensure that E[yiG |xi ] = E[yi |xi ] = xi β. Hence, the ordinary least squares (OLS) estimator of β in model (4) is defined by βˆ = (X X)−1 X YG ,

(6)

where YG = (y1G , . . . , ynG ) is the n x 1 vector of adjusted responses. However, in most applications, G is usually unknown. To overcome this problem, Koul et al. [8] proposed ˆ given by replacing G in Equation (6) by its Kaplan–Meier [28] estimator G, I[z(i) ≤k,δ(i) =0] n   n − i ˆ 1 − G(k) = , n−i+1

(k ≥ 0),

(7)

i=1

where z(1) ≤ z(2) ≤ · · · ≤ z(n) are the ordered observations of z and δ(i) is the corresponding censoring indicator associated with z(i) . Hence, a feasible estimator of β can be obtained as βˆ OLS = (X X)−1 X YGˆ ,

(8)

where YGˆ = (y1Gˆ , . . . , ynGˆ ) =

δi z i = yiGˆ , ˆ i) 1 − G(z

i = 1, . . . , n.

(9)

Here, it should be noted that yiGˆ ’s are also called synthetic observations since these values are synthesized from the data (zi , δi ) to fit the semiparametric model E[yiG |xi , ti ] = xi β + g(ti ). In this case, in a similar fashion to the linear model based on synthetic data, the assumptions A and B provide that E[yiG |xi , ti ] = E[yi |xi , ti ] = xi β + g(ti ). A number of authors have studied synthetic data in dealing with the censored data problem (see, for example, Zheng [29], Leurgans [10], Qin and Jing [19], Delecroix et al. [30], and Guessoum and Ould-Saïd [15], Lemdani and Ould-Saïd [31]. The generalization of the well-known censored linear model to the partially linear model censored from the right will now be stated and discussed. As can be seen from (1), the partially linear models combine both parametric and nonparametric components; so they are much more flexible than the ordinary linear models. In this study, we adapted three different methods to fit (1) when the response variable is replaced by synthetic data. The first is smoothing spline method, which is suggested to estimate the vector of parameters β from partial residuals and unknown g function. The second method is kernel smoothing based on the Nadaraya [32] and Watson [33] kernel weighted average with kernel function. Finally, the third method is regression spline method which is based on penalized spline functions.

JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION

5

2.1. Smoothing spline We first introduce the penalized least square estimates for β and g in the model (2) with right-censored data. Let the ordered distinct values among t1 , t2 , . . . , tn be indicated by r1 < r2 < · · · < rq . The connection between t1 , t2 , . . . , tn and r1 , r2 , . . . , rq is provided via the n × q incidence matrix N, with elements Nij = 1 if ti = rj and Nij = 0 if ti = rj . It can be seen that q ≥ 2 from the assumption that the ti ’s are not all identical. Let g = g(rj ) = (a1 , a2 , . . . , aq ) be a vector. Then, the estimates of the β and g, based on the synthetic observations (9), are obtained by minimizing the penalized residuals sum of squares criterion 

f1 (β; g) = (YGˆ − Xβ − Ng) (YGˆ − Xβ − Ng) + λ



b a

g  (t)2 dt.

(10)

The first term on the right-hand side measures the goodness of fit to data, while the second term penalizes curvature in the function g. Note that the curvature (or smoothness of the function) is controlled by smoothing parameter λ > 0. The idea behind penalized least square is to provide a minimum through the combination of the goodness of fit and the penalty  terms. Using the properties of cubic splines, the resulting value of penalty is g K g = g  (t)2 dt (see [5]). Thus, the criterion (10) can be rewritten as f2 (β; g) = (YGˆ − Xβ − Ng) (YGˆ − Xβ − Ng) + λ g K g,

(11)

where K is a symmetric q × q positive definite penalty matrix with a solution λ K = S−1 λ − I, and Sλ is a well-known positive-definite linear smoother matrix which depends on λ, as defined in Equation (16). Also, the elements of the matrix K are obtained by means of the knot points r1 , . . . , rq , and defined by K = Q R−1 Q, where hj = rj+1 − rj , j = 1, 2, . . . , q − 1, Q is a tri-diagonal (q − 2) × q matrix with entries Qjj = 1/hj , Qj,j+1 = −(1/hj + 1/hj+1 ), Qj,j+2 = 1/hj+1 , and R is a symmetric tri-diagonal (q − 2) × (q − 2) matrix with Rj−1, j = Rj,j−1 = hj /6, Rjj = (hj + hj+1 )/3. By taking simple algebraic operations it is seen that the solution to the minimization problem f2 (β; g) in Equation (11) satisfies the block matrix equation: 

X X N X

     X β X N = YGˆ .  g (N N + λK) N

(12)

For some constant λ > 0, the corresponding estimators for β and g, based on model (2) with censored data, can be easily obtained by (subscripts SS denotes the smoothing spline) βˆ SS = [X (I − Sλ )X]−1 X (I − Sλ )YGˆ

(13)

gˆ SS = (N N + λK)−1 N (YGˆ − Xβˆ SS ).

(14)

and

6

D. AYDIN AND E. YILMAZ

Using Equations (13)–(14), the vector of fitted values is ˆ ˆ = E[Y|X, t]) μSS = (Xβˆ SS + gˆ SS ) = (HSS ˆ ) = (YG λ YG

(15)

 −1  HSS λ = Sλ + (I − Sλ )X[X (I − Sλ )X] X (I − Sλ ),

(16)

for where Sλ = N(N N + λK)−1 N . Specifically, if ti ’s are distinct and ordered, then the incidence matrix N = I. In this case, the matrix Sλ reduces to Sλ = (I + λK)−1 . Note also that βˆ SS and are called as modified smoothing spline regression estimators of the vectors β and g, respectively. Details on the derivation of the Equations (13–16) can be found in the Appendix A1. 2.2. Kernel smoothing As shown in the previous sections, when the response variable is censored by a random variable, the model (1) reduces to the censored model. For simplicity, we shall use the yiGˆ defined in Section 2. Before starting, let us define εiGˆ = yiGˆ − (xi β + g(ti )), i = 1, . . . , n. From this, we have yiGˆ = xi β + g(ti ) + εiGˆ , i = 1, . . . , n,

(17)

where εiG  s are identical but not independent random error observations with unknown constant variance. Conceptually, as n → ∞, E(εGˆ ) ∼ = 0. This information will help us to define estimates for β and g. For convenience, we assume that β is known. In this case, the relationship between (yiGˆ − xi β) and ti can be denoted by (yiGˆ − xi β) = g(ti ) + εiGˆ ,

i = 1, . . . , n.

This can be considered as equivalent to the first part of (17). Then, kernel smoothing can be used as an alternative nonparametric approach to spline to get a reasonable estimate of the function g(.). In analogy with (14), this leads to the Nadarya–Watson estimator proposed by Nadaraya [32] and Watson [33] (subscripts KS denotes the smoothing spline) gˆ KS =

n 

wjλ (tj )((yjGˆ − xj β)) = Wλ (YGˆ − Xβ),

(18)

j=1

where Wλ is a kernel smoother matrix with jth entries wjλ , given by  wjλ (tj ) = K

t − tj λ

  n j=1

 K

t − tj λ

 = K(u)/



K(u),

(19)

where λ is a smoothing (bandwidth) parameter as noted earlier and K(u) is a kernel  or weight function such that K(u)du = 1, and K(u) = K(−u). The kernel function is selected to give the most weight to observations close to t and least weight to observations far from t. While the kernel K(u) determines the shape of the regression curves, the smoothing parameter λ determines their width. In this study, the selection of the λ

JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION

7

is obtained by minimizing the generalized cross validation (GCV) criterion (see Craven and Wahba [34]). Using the matrix and vector form of Equation (17), we can obtain the following partial residuals in matrix form, ˜ ε Gˆ = YGˆ − Xβ − gˆ KS = (I − Wλ )(YGˆ − Xβ) = Y˜ Gˆ − Xβ,

(20)

where ˜ = (I − Wλ )X = xi − X

n 

wjλ (tj )xj = x˜ i ,

j=1

Y˜ G = (I − Wλ )YGˆ = yiGˆ −

n 

wjλ (tj )yjGˆ = y˜ iGˆ .

j=1

Thus, we obtain a transformed set of data based on kernel residuals. Considering these partial residuals for the vector β yields the following weighted least squares criterion: f3 (β) = ||(I − Wλ )(YGˆ − Xβ)||2 .

(21)

The solution to the criterion f3 (β) in Equation (21) is observed as n n   ˜  X) ˜ −1 X ˜  Y˜ ˆ . x˜ i y˜ ˆ x˜ i2 = (X βˆ KS = G

iG

i=1

(22)

i=1

Replacing β in Equation (18) with βˆ KS in Equation (22), we obtain an estimator of g gˆ KS =

n 

wjλ (tj )((yjGˆ − xj βˆ KS )) = Wλ (YGˆ − Xβˆ KS )

(23)

j=1

and hence, from Equation (22) and (23) the vector of fitted values is ˆ ˆ = E[Y|X, t]), μKS = (Xβˆ KS + gˆ KS ) = (HKS ˆ ) = (YG λ YG

(24)

  −1  2 HKS λ = Wλ + (I − Wλ )X(X (I − Wλ ) (I − Wλ )X) X (I − Wλ ) .

(25)

where

The vectors βˆ KS and gˆ KS are known as modified kernel regression estimators of the vectors β and g, respectively. The implementation details of Equations (22)–(25) are given in Appendix A2. 2.3. Regression spline Smoothing spline becomes less practical when the sample size n is large because it uses n knot points. A regression spline is a piecewise polynomial function whose highest order non-zero derivative takes jumps at fixed ‘knots’. Usually, regression splines are smoothed by deleting non-essential knots. When the knots are selected, regression spline can be fitted

8

D. AYDIN AND E. YILMAZ

by ordinary least squares. For further discussion on the selection of knots, see the study of Ruppert et al. [35]. As noted in previous sections, we fit partially linear model (1) with randomly rightcensored data. For this purpose, regression spline can be used as an alternative approach to the others described above. By using the synthetic data in Equation (9) we will estimate components of the model (1) so that sum of squares of the differences between the censored response observations yiGˆ and (xi β + g(ti )) is a minimum. Here, the unknown regression function g(ti ) is approximated by a qth degree regression spline with a truncated power basis q

g(ti ) = b0 + b1 ti1 + . . . + bq ti1 +

K 

q

bq+k (ti − κk )+ + εi ,

i = 1, 2, . . . , n,

(26)

k=1

where b = (b0 , b1 , . . . , bq , bq+1 , . . . , bq+K ) is a vector of unknown coefficients to be estimated, q ≥ 1 is an integer that indicates the degree of regression spline and (t − κk )+ = ti when (t − κk ) > 0 and (t − κk )+ = 0 otherwise. Also, κ1 < · · · < κK are the specifically selected knots {min(ti ) ≤ κ1