Change-point Estimation via Empirical Likelihood ... - Semantic Scholar

2 downloads 0 Views 346KB Size Report
Keywords: Empirical likelihood ratio, Gumbel extreme value distribution, ... likelihood function, k is the number of free parameters in the model, and n is the ...
Change-point Estimation via Empirical Likelihood for a Segmented Linear Regression Zhihua Liu and Lianfen Qian



Department of Mathematical Science, Florida Atlantic University, Boca Raton, FL 33431, USA Abstract For a segmented regression system with an unknown change-point over two domains of a predictor, a new empirical likelihood ratio statistic is proposed to test the null hypothesis of no change. Under the null hypothesis of no change, the proposed test statistic is empirically shown asymptotically Gumbel distributed with robust location and scale parameters against various parameter settings and error distributions. Under the alternative hypothesis with a change-point, the test statistic is utilized to estimate the change point between the two domains. The power analysis shows that the proposed test is tractable. An empirical example on analyzing the plasma osmolality data is given.

Keywords: Empirical likelihood ratio, Gumbel extreme value distribution, segmented linear regression, change-point.

1

Introduction

In the classical regression setting, the regression model is usually assumed to be of a single parametric form on the whole domain of predictors. However, a piecewise regression model ∗

Corresponding author. Email: [email protected]

1

is used to show that the parameters of the model can be different on different domains of the predictors. In the last thirty years, a considerable body of techniques have been developed for hypothesis testing, parameter estimation and related computing program on detecting change points for piecewise regression models. One special and commonly used piecewise regression model is the two-phase linear regression model. The regression function of this model is a piecewise linear function. One can define this more precisely as follows. Let Y be the response variable and X be a univariate predictor so that (X, Y ) is a bivariate random vector with E|Y | < ∞. Suppose that {(Xi , Yi )}ni=1 is a sequence of independent observations of (X, Y ) satisfying the following model: Yi = (α0 + α1 Xi )I(Xi ≤ τ ) + (β0 + β1 Xi )I(Xi > τ ) + ei

(1)

where {ei }ni=1 are independent error terms with mean zero. Let {X(i) }ni=1 be the order statistics of {Xi }ni=1 . If there is an unknown time k ∗ such that X(k∗ ) ≤ τ < X(k∗ +1) , then we shall call k ∗ the time of the change and τ the change point. Wide applications of two-phase linear regression models have appeared in diverse research areas. For example in environmental sciences, Piegorsch and Bailer [20] in their section 2.2 illustrate the usefulness of two-phase linear regression models with a series of examples. Lund and Reeves [14] utilize the model to detect undocumented change points. Qian and Ryu [24] fit the model with termite survival as Y and tree resin dosage as X. In the biological sciences, Vieth [31] applies the model to determine the osmotic threshold by fitting arginine vasopressin (AVP) concentration against plasma osmolality in plasma of conscious dogs. In medical science, Smith and Cook [28] use piecewise linear regression model to fit some renal transplant data. Other applications can be found in epidemiology (Ulm [30], Pastor and Guallar [19]), software engineering (Qian and Yao [22]), econometrics (Chow [3], Koul and Qian [13], Fiteni [8], Zeileis [32]) and so on. Hawkins [10] classifies the two-phase linear regression model (1) into two types of models: the continuous and the discontinuous. By continuous, it means that the regression function is

2

continuous at the change point τ ; that is, the change point τ satisfies the following equation: α0 + α1 τ = β0 + β1 τ.

(2)

If equation (2) is not satisfied, the model is discontinuous. The continuous model is also called the segmented linear regression model (Feder [6, 7]). Before applying the model (1), it is usual to test for the existence of a change point. There are two existing types of likelihood based approaches: The Schwartz Information Criteria (SIC) method (Chen [2]) and the classical parametric likelihood approach (Quandt [25, 26]). The SIC method, proposed by Schwartz [27], is a model selection criteria, defined as ˆ + k log n, SIC = −2 log L(θ) ˆ is the maximum where θˆ is the maximum likelihood estimator of the parameter vector, L(θ) likelihood function, k is the number of free parameters in the model, and n is the sample size. Chen [2] changes the task of hypothesis testing into model selection process by applying SIC method. See more details in Section 3. The classical parametric likelihood approach was first proposed by Quandt [25, 26] to detect the presence of a change point in a simple linear regression model. Quandt assumes that the error terms {ei }ni=1 are normally and independently distributed with mean zero and standard deviations σ1 if i ≤ k ∗ and σ2 if i > k ∗ . The likelihood ratio test statistic is Λ = max {λ(k)}, with λ(k) = −2 log 3≤k≤n−3

³σ ˆ k (k)ˆ σ n−k (k) ´ 1

2

σ ˆn

,

where σ ˆ is the estimator of the standard deviation of the errors for simple linear regression based on all observations; σ ˆ1 (k) and σ ˆ2 (k) are the estimators of σ1 and σ2 for fixed k, respectively. Large values of Λ suggest the existence of a change point. Quandt [25] conjectures that the asymptotic distribution of λ(k) is χ24 under the null hypothesis of no change (H0 ) for all k between 2 and n − 2. Under H0 assuming σ1 = σ2 for the segmented (continuous two-phase) linear regression model, Hinkley [11] and [12] claim that the asymptotic distribution of Λ is χ21 and χ23 , respectively. Feder [6, 7] comments that the distribution of Λ is not asymptotically χ2 . However, Feder indicates that there is 3

evidence for the existence of a limiting distribution of the likelihood ratio. Lund and Leeves [14] point out that the components of {λ(k)} are not independent. In fact, λ(k) and λ(k − 1) are correlated for a fixed k. This correlation makes the proof of the asymptotic distribution difficult. Lund and Leeves conjecture that the asymptotic distribution of Λ is related to the Gumbel extreme value distribution. In this paper, we address the afore-mentioned asymptotic distribution problem using a recently developed nonparametric empirical likelihood approach. Empirical likelihood (EL) as a nonparametric data-driven technique is first proposed by Owen [17]. EL employs the likelihood function without specifically assuming the distribution of the data. It incorporates the side information, through constraints or prior distribution, which maximizes the efficiency of the method (Owen [18]). First, we propose an EL based test statistic for testing the null hypothesis of no change. Through simulation studies, we have confirmed, under null hypothesis, that Lund and Leeves’s conjecture is correct for the empirical likelihood based method, though the original conjecture is for classical likelihood method, which is not solved yet. Then, if the null hypothesis is rejected, we construct an EL based estimator of the change point for the model (1) under the continuity constraint (2). The rest of the paper is organized as follows. Section 2 proposes the empirical likelihood ratio test statistic for the segmented linear regression model and defines the estimator of the time of the change if it exists. Section 3 shows empirically that the limiting null distribution of the proposed test statistic is the Gumbel extreme value distribution. It observed that the location and scale parameters of this asymptotic distribution were insensitive to the different settings of parameter vector and changes in the error distribution. The critical values for various significance levels, the size and the power performance of the test are reported. Furthermore, a comparison between the proposed EL based method and Chen’s [2] Schwartz information criteria (SIC) method is conducted. Section 4 presents an empirical example on analyzing the plasma osmolality data using the proposed ELR method.

4

2

EL Ratio Test and its Computing Algorithm

Assuming a known change point, Dong [5] derives an empirical likelihood type Wald (ELw) statistic to test the equality of two coefficient vectors from two linear regression models. To be more precise, let α ˆ and βˆ be the least squares estimators of the regression coefficient vectors, respectively. Under the normality assumption of the errors, Dong’s ELw test statistic has the form ˆ 0 [˜ ˆ ELw = (ˆ α − β) σ12 (X01 X1 )−1 + σ ˜22 (X02 X2 )−1 ]−1 (ˆ α − β) where Xi is the design matrix and σ ˜i2 is the EL estimator of σi2 , the variance of the errors, for the ith regression model (i = 1, 2). Dong concludes that the ELw test is asymptotically χ2p distributed under the null hypothesis H0 : α = β ∈ Rp . Instead of assuming a known change point, we first derive an empirical likelihood based test statistic for testing the null hypothesis of no change. If a change point does exist, we construct the EL based estimator for the time of the change (k ∗ ), and hence for the change point τ . Unlike the method used by Dong, we neither require the assumption of normality on the errors nor do we need the time of the change between the two phases to be known. However, we do require that the two phases be continuous at τ ∈ [X(k∗ ) , X(k∗ +1) ). We address the following two important research issues for model (1) with continuity constraint (2): • To test simple linear regression versus two-phase linear regression with one single unknown change point. • To estimate the time of the change if it exists. Let α = (α0 , α1 )> and β = (β0 , β1 )> in model (1). Then the test of no change is equivalent to the test H0 : α = β. Throughout the rest of the paper, we assume that {Xi }ni=1 are already ordered. For a fixed k, we can separate the data into two groups: {(Xi , Yi )}ki=1 and {(Xi , Yi )}ni=k+1 . For each group, we apply simple linear regression to fit the data points by using the ordinary least squares (OLS) method. Let α ˆ (k) = (ˆ α0 (k), α ˆ 1 (k))> be the ˆ = (βˆ0 (k), βˆ1 (k))> be the OLS OLS estimator of α computed from {(Xi , Yi )}ki=1 and β(k)

5

estimator of β computed from {(Xi , Yi )}ni=k+1 . Then, the estimated errors are   Yi − [ˆ α0 (k) + α ˆ 1 (k)Xi ], i = 1, . . . , k; eˆi (k) =  Y − [βˆ (k) + βˆ (k)X ], i = k + 1, . . . , n. i 0 1 i Under the null hypothesis H0 : α = β, α ˆ 0 (k) and βˆ0 (k) should be close to each other, and similarly for α ˆ 1 (k) and βˆ1 (k). Therefore, we propose to switch the rules of the estimated regression coefficient vectors in estimating the errors for these two phases. That is, the estimated errors under H0 can be represented as follows:   Yi − [βˆ0 (k) + βˆ1 (k)Xi ], i = 1, . . . , k; e˜i (k) =  Y − [ˆ α (k) + α ˆ (k)X ], i = k + 1, . . . , n. i

0

1

(3)

i

Notice that under H0 , E[˜ ei (k)] = 0 for all k. Following Owen (1991), when a change does occur, we should reject H0 if the empirical likelihood ratio (ELR) R(k) = sup

n nY

n n ¯X o X ¯ nwi ¯ wi e˜i (k) = 0, wi ≥ 0, wi = 1

i=1

i=1

(4)

i=1

is small. The corresponding logarithm of ELR is −2 log R(k) = −2 sup

n nX

n X

log(nwi )|

i=1

wi e˜i (k) = 0, wi ≥ 0,

i=1

n X

o wi = 1 .

(5)

i=1

By the Lagrange multiplier method, we write G=

n X

log(nwi ) − nλ

n X

i=1

wi e˜i (k) − γ

i=1

à n X

! wi − 1 ,

i=1

where λ and γ are Lagrange multipliers. Take the derivative of G with respect to wi , set it equal to zero and solve to obtain wi (k) =

1 . n[1 + λ˜ ei (k)]

It follows that −2 log R(k, λ) = 2

n nX

o log [1 + λ˜ ei (k)] .

i=1

Define the score function n

−2∂ log R(k, λ) X e˜i (k) = . φ(k, λ) = 2∂λ 1 + λ˜ ei (k) i=1 6

(6)

Then, we have the profile logarithm of ELR for a fixed k: ˆ ˆ −2 log R(k) = −2 log R(k, λ), ˆ is determined by φ(k, λ) ˆ = 0. where λ Notice that the true time of the change k ∗ is unidentifiable under the null hypothesis H0 . ˆ Large values of −2 log R(k) correspond to a two-tailed alternative hypothesis being true. Therefore, we propose the following test statistic: ˆ Mn = max {−2 log R(k)}. 3≤k≤n−3

(7)

ˆ When −2 log R(k) is small for each possible k, Mn will be small, as is the case under the null ˆ ∗ ) and Mn should be statistically large, hypothesis. If a change occurs at k ∗ , then −2 log R(k ˆ thus we reject H0 . Notice that −2 log R(k) is an asymptotic χ21 statistic for each fixed k and n−3 ˆ the components of {−2 log R(k)} k=3 are not independent. In this paper, we show, through p simulation study, that the limiting null distribution of Zn = Mn is Gumbel extreme value

distribution which is similar to the parametric likelihood ratio result, see Cs¨org˝o and Horv´ath [4]. If the null hypothesis is rejected, we need to estimate the change point by maximizing ˆ ˆ −2 log R(k). Simulation studies show that −2 log R(k) is sensitive to outliers when k is too small or too close to the sample size n. This phenomena exists for the parametric likelihood ratio approach. In order to overcome this situation, we adopt the “trimmed” test statistic defined below: ˆ Mn0 = max {−2 log R(k)}, L≤k≤U

(8)

where the choice of L and U are arbitrary. Anything ranges from [ln n] to n1/2 has been used in the literature for the parametric likelihood ratio approach. For empirical likelihood method, we have tested various trimmed portion. For the sample sizes used in the simulation, too small tail portions do not work well. Hence in this paper, we choose L = [ln n]2 and U = n − L, where [x] means the smallest integer larger than x. Thus the empirical likelihood estimator of k ∗ is defined by ˆ kˆ∗ = min {k : Mn0 = −2 log R(k)} 7

(9)

and hence the empirical likelihood estimator of τ is defined as τˆ = Xkˆ∗ . To test whether H0 is true, we need to compute Mn . Without loss of generality, let the ordered predictor values be X1 ≤ X2 ≤ . . . ≤ Xn . The algorithm for computing Mn contains the following steps: 1. For a fixed k, k = 3, 4, . . . , n − 3, split the data into two groups referred to as the left-phase group {(Xi , Yi )}ki=1 , and the right-phase group {(Xi , Yi )}ni=k+1 . 2. For each k, fit the points into a linear model to obtain α ˆ 0 (k), α ˆ 1 (k) from the left-phase group and βˆ0 (k), βˆ1 (k) from the right-phase group. 3. Calculate e˜i (k) = Yi − [βˆ0 (k) + βˆ1 (k)Xi ] for i = 1, . . . , k and e˜i = yi − [ˆ α0 (k) + α ˆ 1 (k)Xi ] for i = k + 1, . . . , n. 4. Use {˜ ei (k)}ni=1 as the input in the el.test function in R package (emplik) to compute ˆ −2 log R(k). n−3 ˆ 5. For each possible k, repeat step 1 to 4 to obtain a sequence of {−2 log R(k)} k=3 . The p maximum of this sequence is Mn . If Zn = Mn is larger than the critical value Gα

where P (Zn ≥ Gα ) ≤ α, H0 is rejected. So the corresponding minimum argument of Mn0 , kˆ∗ , is the EL estimator of k ∗ if we reject the null hypothesis of no change. Remark: We note that the mean response in a segmented regression model is a single linear piece under the null hypothesis. Hence, step 3 utilizes the single linear piece property to calculate the residuals. While step 4 combines the residuals from step 3 and sets the expected value of the overall residuals equal to zero through el.test.

3

Empirical Distribution of Zn

The algorithm presented above enables us to conduct simulation studies to show that the empirical distribution of Zn under the null hypothesis is the Gumbel extreme value distribution,

8

a subfamily of the Generalized Extreme Value (GEV) distribution. The GEV distribution has the following cumulative distribution function: n F (x; µ, σ, ζ) = exp

h ³ x − µ ´i−1/ζ o − 1+ζ , for x ∈ R and 1 + ζ(x − µ)/σ > 0, σ

where µ ∈ R is the location parameter, σ > 0 is the scale parameter and ζ ∈ R is the shape parameter. The shape parameter ζ dominates the tail behavior of the distribution. When ζ → 0, the limiting distribution of the GEV distribution is the Gumbel (G) extreme value distribution, given by n FG (x; µ, σ) = exp

− exp

h −(x − µ) io σ

.

(10)

For a Gumbel extreme value distribution, the mean and the variance are µ + σa and σ 2 π 2 /6, respectively, where a is the Euler-Mascheroni constant 0.57721.... We use the fgev function in R package (evd) to estimate the location µ, the scale σ and the critical values of FG . The function fgev uses the maximum-likelihood fitting of the GEV distribution to estimate µ, σ and ζ. We can obtain estimates of µ and σ for the Gumbel extreme value distribution, by setting ζ = 0. Four simulation studies are reported in this section. We simulate 1000 samples with the sample values of the random predictor X generated from N (0, 1). The sample size n ranges from 30 to 500. Simulation I is to test the robustness of the proposed EL based test statistic and computes the critical values of Zn for the most popular nominal levels 0.10, 0.05 and 0.01, under the null hypothesis of no change. Data are generated from the simple linear regression Yi = γ0 + γ1 Xi + ei , i = 1, ..., n

(11)

with γ = (γ0 , γ1 )> = (1, 1)> and three types of error terms are considered: (i) Normal errors: {ei }ni=1 ∼ N (0, 0.12 ); (ii) Log-normal errors (heavy-tailed): {ei }ni=1 ∼ log N (0, 0.12 ) and (iii) [n/2]

Non-homogeneous errors: {ei }i=1 ∼ N (0, 0.12 ) and {ei }ni=[n/2]+1 ∼ N (0, 1.02 ). Table 1 shows that the estimated location and scale parameters of the asymptotic distribution increase as the sample size increases. More importantly, one observes that both the 9

Table 1: Robustness analysis for the estimated location and scale parameters, µ and σ respectively, of the limiting distribution of Zn under three types of error distributions. The type of error distribution (i) Normal

(ii) Log-normal

(iii) Non-homogeneous

n

µ

σ

µ

σ

µ

σ

30

8.511

2.453

8.531

2.624

8.299

3.181

50

12.169 3.338

12.212

3.277

11.554

3.981

100

18.625 4.535

18.424

4.464

17.674

5.744

500

46.000 9.105

46.950

9.301

44.985

10.064

location and scale parameters are robust against changes to the distributions of the errors. This is consistent with the well-known property of the empirical likelihood method being of a nonparametric nature. The histograms and Q-Q plots of Zn for the 1000 replicates are shown in Figures 2-4 corresponding to three types of error distributions. The left panel shows the histograms of Zn where the solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density with Gaussian kernel using density function in R. The right panel shows the Q-Q plots of the Gumbel distributions. Simultaneously, the critical values of Zn with the nominal levels α = 0.10, 0.05 and 0.01 can also be derived by computing the quartiles of the simulated Gumbel extreme value distribution. The critical values of Zn are reported in Table 2. Simulation II was performed to show that the asymptotic distribution is robust against changes in the settings of parameter vector γ = (γ0 , γ1 )> in model (11). This simulation was carried out with five different settings of γ, a sample size of n = 100 and {ei }ni=1 ∼ N (0, 0.12 ). Table 3 indicates that the estimated location and scale parameters are robust to the settings of γ. Simulation III was carried out to compare the performance of the proposed EL based

10

Table 2: The critical values of Zn with α = 0.10, 0.05, and 0.01. The type of error distribution (i) Normal @

@

n

α @

0.10

0.05

(ii) Log-normal 0.01

0.10

0.05

0.01

(iii) Non-homogeneous 0.10

0.05

0.01

@ @

30

14.031 15.797

19.795

14.436 16.325

20.602

15.457 17.747

22.932

50

19.681 22.084

27.524

19.586 21.945

27.287

20.513 23.378

29.867

100

28.830 32.095

39.487

28.470 31.683

38.959

30.600 34.735

44.097

500

66.490 73.044

87.884

67.880 74.575

89.735

67.632 74.877

91.280

estimator and Chen’s SIC estimator of the true time of the change k ∗ . Chen’s SIC under H0 is SIC(n) = −2 log L0 (ˆ γ, σ ˆ 2 ) + 3 log n, where L0 (ˆ γ, σ ˆ 2 ) is the estimated maximum likelihood function under the null hypothesis of no change, γˆ is the estimator of γ and σ ˆ is the estimator of the standard deviation of the errors in model (11). For k ranging from 2 to n − 2, Chen’s SIC under H1 is SIC(k) = −2 log L1 (ˆ α0 (k), α ˆ 1 (k), βˆ0 (k), βˆ1 (k), σ ˆ 2 ) + 5 log n, where L1 (ˆ α0 (k), α ˆ 1 (k), βˆ0 (k), βˆ1 (k), σ ˆ 2 ) is the estimated maximum likelihood function under H1 . Therefore, the decision rule for selecting one of the n − 3 regression models is: select the model with no change if SIC(n) ≤ SIC(k) for all k; select a model with a change at kˆ∗ if SIC(kˆ∗ ) = min{SIC(k) : 2 ≤ k ≤ n − 2} < SIC(n).

Let ξ be the change in slope between the two phases for the segmented linear regression model (1) with the continuity constraint (2). We examine the effect of three different values of the change in slope on these errors settings: N (0, 0.52 ) and N (0, 0.12 ). The three different 11

Table 3: Robustness analysis of µ and σ with respect to γ when n = 100. γ > = (γ0 , γ1 ) (-1,1)

(2,-1)

(-3,-3)

(8,3)

(-3,5)

µ

18.625

18.824

18.629

18.500

18.931

σ

4.535

4.157

4.263

4.140

4.789

values of the change in slope are (a) small change of slopes with ξ = 0.5; (b) moderate change of slopes with ξ = 2; (c) large change of slopes with ξ = 4. When the X values are too close together, it is hard to detect the true time of the change. The acceptable deviation of kˆ∗ from k ∗ depends on the sample size and the range of the X values. In order to compare these two methods, we propose the following fine tuned acceptable deviation D:

¸ U −L , D= A ·

where A is the range of the X values, L = [ln n]2 and U = n − L are the fine tuning portion in the definition of the trimmed test statistic Mn0 . For X ∼ N (0, 1.02 ), we take A = 6. Then when n = 50, D = 3 with L = 16 and U = 34. For the purpose of illustration, we report the simulation results for the sample size n = 50 and k ∗ = 25 for the errors generated from N (0, 0.52 ) and N (0, 0.12 ). Let d = |kˆ∗ − k ∗ | be the absolute value of the bias between the true time of the change k ∗ and the estimate kˆ∗ , and RF be the relative frequency of the deviation d no more than the acceptable deviation D = 3. Table 4 reports the frequency distribution of d and the relative frequency of d ≤ 3. The simulation result indicates that the proposed method works slightly better than the SIC method to capture the true time of the change (d = 0). The relative frequency of both methods increases as the standard deviation (σ) of the errors decreases. When σ = 0.5 (the ratio of signal to noise is 2), both methods are not working well, though the proposed method works much better than SIC method for small to moderate change of slopes. When σ = 0.1 (the ratio of signal to noise is 10), these two methods are comparable. One also notices that 12

Table 4: The frequency distribution of d = |kˆ∗ − k ∗ | and the relative frequency RF = # of {d ≤ 3} % for ELR and SIC methods, when sample size n = 50 and the true time of 10 the change k ∗ = 25. N (0, 0.52 ) ξ = 0.5

ξ=2

N (0, 0.12 ) ξ=4

ξ = 0.5

ξ=2

ξ=4

d

ELR

SIC

ELR

SIC

ELR

SIC

ELR

SIC

ELR

SIC

ELR

SIC

0

26

14

57

51

91

81

172

169

193

189

301

299

1

40

28

155

90

170

165

298

269

317

301

316

317

2

46

24

118

99

203

198

167

189

198

217

187

196

3

55

29

102

128

165

172

101

116

119

128

71

68

4

41

30

84

113

103

128

81

79

78

89

64

55

5

85

26

82

94

98

103

86

85

54

59

34

32

6

74

17

79

88

89

64

79

69

30

11

12

24

≥7

633

832

323

337

81

89

16

24

11

6

15

9

RF (%)

16.7

9.5

43.2

36.8

62.9

61.6

73.8

74.3

82.7

83.5

87.5

88.0

as the change of slopes increases, the absolute value of the bias is getting smaller and the relative frequency is increasing. The simulation results for various sample sizes ranging from 30 to 500 show the similar pattern. From Table 4, one notices that the acceptable deviation also depends on the signal to noise ratio and change of slopes. The detection is easier for larger change of slopes than small to moderate, as does the signal to noise ratio.

Simulation IV is performed to study the size and the power performance of the proposed test for two types of error distributions. Table 5 shows the size and the power of the proposed test for a variety of sample sizes and true times of the change. The size is computed by using the estimated critical value of Gumbel distribution. We simulated 1000 samples from the model (11) with parameter vector γ = (1, 1)> , and the size is estimated by the proportion of

13

Table 5: The size and the power of Zn . The type of error distribution (i) Normal @

n

@

(ii) Log-normal

30

50

100

500

30

50

100

500

0%

0.045

0.049

0.050

0.050

0.054

0.054

0.048

0.050

20%

0.884

0.896

0.969

0.987

0.992

0.993

0.994

0.996

30%

0.897

0.897

0.968

0.987

0.997

0.995

0.995

0.994

40%

0.892

0.897

0.969

0.987

0.995

0.990

0.995

0.996

50%

0.908

0.898

0.960

0.987

0.996

0.982

0.995

0.996

k∗ n

@

@ @

samples resulting rejection of the null hypothesis falsely, which means test statistic is larger than the critical value. This study indicates that the proposed test statistic is able to control the size and attains a high power when the sample size is large.

4

Applications

This section applies the proposed ELR method for the segmented linear regression to plasma osmolality data set. The data set was collected to show arginine vasopressin (AVP) concentration in plasma as a function of plasma osmolality in conscious dogs. Using parametric maximum likelihood method under normality assumption for errors, Vieth[31] utilizes the segmented linear model to fit arginine vasopressin (AVP) concentration against plasma osmolality in plasma of conscious dogs to determine the osmotic threshold. Our proposed ELR method does not require the normality assumption of the errors in nature. Figure 1(a) is the scatter plot, overlaid with the fitted segmented regression function, of the data with the estimated change point at τˆ = 302, corresponding to the osmotic threshold and indicated by the vertical dash line. We used the ELR method to plot −2 log ELR(k) for each possible time of the change between L and U; shown in Figure 1(b). The estimated

14

(b)

100

0

50

5

AVP(pg/ml)

−2logELR(k)

10

150

15

(a)

290

295

300

305

310

315

20

Plasma osmolality(mOsm/kg)

30

40

50

60

k

Figure 1: (a) the scatter plot of AVP versus plasma osmolality with fitted segmented linear regression. (b) The plot of −2 log ELR(k) versus all the possible time of the change k in [L, U]. time of the change is kˆ∗ = 42 highlighted by the solid dot. Then the estimated change point is τˆ = 302, and the least squares fitted segmented linear regression is AVP = −0.002+0.01∗plasma osmolality+0.52∗(plasma osmolality−302)+ , where (a)+ = max(0, a). The corresponding R2 is 73% with estimated standard deviation of 1.60.

5

Conclusion

This paper proposes a nonparametric empirical likelihood based test statistic for the detection of potential change points in segmented linear regression models. If the change point is identified, then an empirical likelihood based change point estimator is defined along with the estimator of the regression coefficients. Under the null hypothesis of no change, the simulation studies show that the proposed test statistic is asymptotically Gumbel extreme value distributed. The asymptotic distributions of the estimated location and scale parameters were shown to be robust under the different 15

settings of the parameter vector and the different types of error terms. The location and scale parameters are increasing functions of the sample size. Then, the simulation under the alternative hypothesis shows that the proposed test is able to control the size and attain a high power when the sample size is large. Finally, it is shown that the proposed empirical likelihood method performs better than SIC method in accurately detecting the true time of the change. However, allowing an acceptable deviation, the proposed method and Chen’s SIC method are comparable overall. An empirical example on analyzing the plasma osmolality data is given. The simulation and data analysis programs in R are available from the first author.

Acknowledgments:

The authors wish to thank the Editor and referees for their

valuable comments and suggestions that helped to improve the presentation of the paper.

References [1] Berman, N.G., et.al. (1996). Applications of segmented regression models for biomedical studies. American Journal of Physiology , 270, 723-732. [2] Chen, J. (1998). Testing for a change point in linear regression models. Communications in Statistics-Theory and Methods, 27:10, 2481-2493. [3] Chow, G. (1960). Tests of equality between two sets of coefficients in two linear regressions. Econometrica, 28, 591-605. [4] Cs¨org˝o, M. and Horv´ath, L. (1997). Limit theorems in change-point analysis, Wiley Series in Probability and Statistics. [5] Dong, L.B. (2004). Testing for structural change in regression: an empirical likelihood approach, Econometrics Working Paper, 0405. [6] Feder, P.I. (1975a). Asymptotic distribution theory in segmented regression problemsidentified case. The Annals of Statistics, 3, 49-83. 16

[7] Feder, P.I. (1975b). The log likelihood ratio in segmented regimes. The Annals of Statistics, 3, 84-97. [8] Fiteni, I. (2004). τ -estimators of regression models with structural change of unknown location. Journal of Econometrics, 119, 19-44. [9] Gbur,E.E., Thomas,G.L. and Miller,F.R. (1979). The use of segmented regression models in the determination of the base temperature in heat accumulation models. Agronomy Journal , 71, 949-953. [10] Hawkins, D.M. (1980). A Note on Continuous and Discontinuous Segmented Regressions, Technometrics, 22, 443-444. [11] Hinkley, D.V. (1969). Inference about the Intersection in Two-Phase Regression. Biometrika, 56, 495-504. [12] Hinkley, D.V. (1971). Inference in two-phase regression.Journal of the American Statistical Association, 66, 736-743. [13] Koul, L.H. and Qian, L.F. (2002). Asymptotics of maximum likelihood estimator in a two-phase linear regression model. Journal of Statistical Planning and Inference, 108, 99-119. [14] Lund, R. and Reeves, J.(2002). Detection of undocumented change points: A revision of the two-phase regression model. Journal of Climate, 15, 2547-2554. [15] Luwel K., Beem A.L., Onghena P. and Verschaffel L. (2001). Using segmented linear regression models with unknown change points to analyze strategy shifts in cognitive tasks. Behavior Research Methods, Instruments, & Computers. 33, 470-478(9) [16] Muggeo, V.M.R. (2003). Estimating regression models with unknown break-points. Statistics in Medicine. 22, 3055-3071.

17

[17] Owen, A.B. (1991). Empirical likelihood for linear models. The Annals of Statistics, 19, 1725-1747. [18] Owen, A.B. (2001). Empirical Likelihood, Chapman & Hall/CRC. [19] Pastor, R. and Guallar, E. (1998). Use of two-segmented logistic Regression to estimate change-points in epidemiologic studies. American Journal of Epidemiology , 148, 63142. [20] Piegorsch, W. W. and Bailer, A. J. (1997). Statistics for environmental biology and toxicology. Chapman and Hall. [21] Piepho, H. P. and Ogutu, J.O. (2003). Inference for the break point in segmented regression with application to longitudinal data. Biometrical Journal , 45, 591-601. [22] Qian, L.F. (1998). On maximum likelihood estimation for a threshold autoregression. Journal of Statistical Planning and Inference, 75, 21-46. [23] Qian, L.F. and Yao, Q.C. (2002). Software project effort estimation using two-phase linear regression models. Proceeding of The 15th Annual Motorola Software Engineering Symposium (SES). [24] Qian, L.F. and Ryu, S.Y.(2006). Estimating tree resin dose effect on termites. Environmentrics, 17, 183-197. [25] Quandt, R.E. (1958). The estimation of the parameters of a linear regression system obeying two separate regimes. Journal of the American Statistical Association, 53, 873880. [26] Quandt, R.E. (1960). Tests of the hypothesis that a linear regression system obeys two separate regimes. Journal of the American Statistical Association, 55, 324-330. [27] Schwartz, G. (1978). Estimating the dimension og a model. Annuals of Statistics, 6, 461-464. 18

[28] Smith,A.M.F. and Cook, D.G. (1980). Straight lines with a change point: A Bayesian analysis of some renal transplant data. Applied Statistics , 29, 180-189. [29] Toms, J.D. and Lesperance, M.L. (2003). Piecewise regression: A tool for identifying ecological thresholds. Ecology, 84, 2034-2041. [30] Ulm, K.W. (1991). A statistical method for assessing a threshold in epidemiological studies. Statistics in medicine, 10, 341-349. [31] Vieth, E. (1989). Fitting piecewise linear regression functions to biological responses. Journal of Applied Physiology, 67, 390-396. [32] Zeileis, A. (2006). Implementing a class of structural change tests: an econometric computing approach. Computational Statistics & Data Analysis , 50, 2987–3008.

19

Gumbel Q−Q plot

14 12 10

Sample Quantile

0.10

6

0.00

8

0.05

Density

0.15

16

(i) n=30

10

15

20

25

30

6

8

10

12

Zn

Theoretical Quantile

(i) n=50

Gumbel Q−Q plot

14

16 14 8

0.00

10

12

0.05

Density

Sample Quantile

0.10

18

20

22

0.15

5

20

30

40

10

12

14

16

18

Zn

Theoretical Quantile

(i) n=100

Gumbel Q−Q plot

20

22

25

Sample Quantile

20

0.06 0.04 0.00

15

0.02

Density

0.08

30

0.10

10

20

30

40

50

60

15

20

25

Zn

Theoretical Quantile

(i) n=500

Gumbel Q−Q plot

30

60

Sample Quantile

Density 0.02 0.00

40

0.01

50

0.03

0.04

70

0.05

10

20

40

60

80

100

120

140

35

40

45

50

55

60

65

70

Theoretical Quantile

Zn

Figure 2: The histograms and Q-Q plots of Zn under H0 with normal errors (i){ei }ni=1 ∼ N (0, 0.12 ) for four different sample size settings. The solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density. 20

Gumbel Q−Q plot

0.00

12 6

8

10

Sample Quantile

0.10 0.05

Density

0.15

14

16

0.20

(ii) n=30

10

15

20

25

30

6

8

10

12

Zn

Theoretical Quantile

(ii) n=50

Gumbel Q−Q plot

14

16 14

0.00

10

12

0.05

Density

0.10

Sample Quantile

18

0.15

20

5

5

10

15

20

25

10

30

12

14

16

18

Theoretical Quantile

Zn

(ii) n=100

20

0.00

15

0.02

0.04

Density

0.06

Sample Quantile

25

0.08

30

0.10

Gumbel Q−Q plot

20

30

40

50

14

60

16

18

20

22

Zn

Theoretical Quantile

(ii) n=500

Gumbel Q−Q plot

24

26

28

60 50

0.00

40

0.01

0.02

Density

0.03

Sample Quantile

0.04

70

0.05

10

20

40

60

80

100

120

40

140

45

50

55

60

65

70

Theoretical Quantile

Zn

Figure 3: The histograms and Q-Q plots of Zn under H0 with log-normal errors (ii) {ei }ni=1 ∼ log N (0, 0.12 ) for four different sample size settings. The solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density. 21

Gumbel Q−Q plot

12 10

Sample Quantile

0.10 0.00

6

8

0.05

Density

14

16

0.15

18

(iii) n=30

10

15

20

25

30

6

10 Theoretical Quantile

(iii) n=50

Gumbel Q−Q plot

14

14 12

0.06

0.08

Sample Quantile

16

0.10

18

0.12

20

12

8

0.00

0.02

10

0.04

Density

8

Zn

0.14

5

10

15

20

25

8

10

12

14

Zn

Theoretical Quantile

(iii) n=50

Gumbel Q−Q plot

16

18

0.00

15

20

Sample Quantile

0.04 0.02

Density

0.06

25

0.08

30

5

20

30

40

50

60

15

20

Zn

Theoretical Quantile

(iii) n=500

Gumbel Q−Q plot

25

60 50

Sample Quantile

0.02 0.00

40

0.01

Density

0.03

70

0.04

10

20

40

60

80

100

120

30

140

40

50

60

70

Theoretical Quantile

Zn

Figure 4: The histograms and Q-Q plots of Zn under H0 with non-homogeneous errors (iii) [n/2]

{ei }i=1 ∼ N (0, 0.12 ) and {ei }ni=[n/2]+1 ∼ N (0, 1.02 ) for four different sample size settings. The solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density. 22

Suggest Documents