TWO-PHASE NONLINEAR REGRESSION WITH ...

1 downloads 0 Views 292KB Size Report
A method of estimating parameters of two-phase nonlinear regression with smooth transition between phases is described. It consists of two stages, both.
TWO-PHASE NONLINEAR REGRESSION WITH SMOOTH TRANSITION Ryszard Walkowiak and Rados•aw Kala Department of Mathematical and Statistical Methods Agricultural University of Poznañ ul. Wojska Polskiego 28 60-637 Pozna!, Poland Key Words: nonlinear regression; two-phase regression; transition point; transition function; least square method. ABSTRACT. A method of estimating parameters of two-phase nonlinear regression with smooth transition between phases is described. It consists of two stages, both utilizing the least square fit. In the first one, each phase is fitted separately and, simultaneously, a transition point is determined. In the second stage, the two phases are joined smoothly by a proper transition function, which depends on the transition point chosen by the grid search. The practical aspects of the method proposed are demonstrated on the data concerning the soil bulk density in dependence on the soil water content.

1. INTRODUCTION The method of multi-phase regression needs to be used when the dependent variable behaves differently in subsequent intervals of the independent variable. Numerous such cases have been described in literature (cf. Sprent, 1961; Hudson, 1966; Hinkley, 1969; Bacon and Watts, 1971; Watts and Bacon, 1974; Kirby, 1974; Lerman, 1980; Seber and Wild, 1989). Sometimes the need of multi-phase regression becomes apparent only after the observations are plotted in the coordinate system. There are also issues, in which it is known, prior to the

1

experiment, that in the observed material a structural change occurs, when the independent variable exceeds a given threshold. The situation is relatively simple when the threshold value, also referred to as the transition point, is known. However, it is usually not the case. Moreover, the experimenter often expects that the transition between phases will be continuous and smooth. In the literature there are known methods of determining a multi-phase linear regression with smooth transition (Seber and Wild, 1989, p. 435; Lerman, 1980), a multi-phase nonlinear regression with continuous transition, but without smoothness (Lerman, 1980), as well as the methods of smoothing the transitions between phases of linear regression (Seber and Wild, 1989, p. 465). Seber and Wild (1989, p. 470) also give the methods of smoothing multi-phase nonlinear functions. But they use the reparametrization which changes the interpretation of parameters involved. The aim of this paper is to present a procedure leading to continuous and smooth two-phase linear or nonlinear regression with an unknown transition point. This procedure uses methods known earlier which, however, were significantly modified to eliminate nonuniqueness of solutions and to ensure smoothness of the resulting regression function. Practical aspects of the approach proposed are discussed with reference to the data concerning the soil bulk density in dependence on the soil water content.

2. FORMALIZATION OF THE PROBLEM The function of a two-phase regression may be presented as ì f ( x , b 1 ), x £ a , f(x,b ) = í 1 î f 2 ( x , b 2 ), x > a ,

(1)

where a is an unknown transition point, while f1 and f2 are specific functions dependent on the unknown parameter vectors b1 and b2. To ensure the continuity of the function (1), phases f1 and f2 must be continuous, and must fulfill the condition

2

f1(a, b 1) = f2(a, b 2).

(2)

Modeling a given phenomenon with two-phase regression consists in estimating unknown parameters, i.e. the transition point a and vectors the b 1 and

b2, on the basis of observations (xi, yi), i = 1, ... , n, where xi is the i-th observation of the independent variable, yi

is the observation of the dependent variable

corresponding to xi, and n is the total number of observations of the investigated process. Adapting the least squares method, it is possible to seek estimates of a,

b1 and b2, which minimize the sum S = åi =1 ( yi - f ( x i , b )) n

2

In case of the function (1), the sum S can be expressed as

S = å xi £a ( yi - f ( x i , b 1 )) + å xi >a ( yi - f ( x i , b 2 )) . 2

2

(3)

If the transition point is known, the minimum of S can be found minimizing each component of (3) separately, on the basis of properly divided set of observations. This, however, does not ensure the continuity of the solution. As was noted by Lerman (1980), in case of the unknown transition point and under the continuity condition (2), it is practically impossible to solve this problem. Therefore he suggested a grid search method, which applied to a given interval (amin , amax) gives the most suitable point a. In results, the estimates of unknown parameters for both phases, can be found. However, this procedure does not guarantee the smoothness of the estimated regression. The problem how to modify the obtained phases in order to achieve continuity and smoothness is discussed in the next Section.

3. SMOOTH TRANSITION BETWEEN PHASES In many cases it is required that the regression function should be continuous and smooth throughout its domain. The condition (2) ensures

3

continuity, but it does not ensure smoothness of the function at the point x = a. A sufficient condition for smoothness of (1) is the equation f1’(a, b 1) = f2’(a, b 2), but it imposes too strong restrictions on the functions f1 and f2 and it may not be always used. Another solution to the problem of continuity and smoothness of two-phase regression can be found by joining phases with the use of an additional transition function. This function should change phases to the smallest possible extent and only in the area closest to the transition point. When both phases of the function (1) are linear, Griffith and Miller (1973) and Seber and Wild (1989, p. 466) suggested a transition function, denoted here by h, which fulfills the following conditions:

" lim ( xh( x,g ) - x ) = 0 ,

(4)

" lim h( x, g ) = sgn( x ) ,

(5)

" h( 0,g ) = 0,

(6)

g x ®±¥

x g ®0

g

where g > 0 is an unknown parameter. Then the two phases may be connected (cf. Seber and Wild, 1989, p. 470) in the form f(x, b ) = [f1(x,b 1) + f2(x,b 2) + { f2(x,b 2) - f1(x,b 1)}h(x - a,g)]/2,

(7)

which is a continuous and smooth function. The condition (4) guarantees that the function (7) will be consistent with the phase f1(x,b 1) to the left of the transition point and consistent with the phase f2(x,b 2) to the right of that point. From the condition (5) it follows that the goodness of fit increases when g approaches zero. The condition (6) implies that the function (7), at the transition point a, is the arithmetic mean of f1(a, b 1) and f2(a, b 2). If the phases fulfill the condition (2), the function (7) crosses also the point of their intersection. The conditions (4) - (6) are fulfilled for example by hyperbolic tangent, i.e. tanh (x/g). Seber and Wild (1989)

4

mentioned also a possibility of using the transition function for nonlinear regressions. However, when at least one of the phases f1 and f2 is nonlinear, the conditions (4) - (6) are not sufficient, in general. In order to formulate conditions ensuring proper joining of two nonlinear phases, we use the convex combination of the form f(x, b ) = [1 - h(x - a,g)]f1(x,b 1) + h(x - a,g)f2(x,b 2).

(8)

This function is a smooth approximation of the two-phase regression, if the transition function h (x,g) is continuous and smooth, i.e. it possesses a continuous first derivative for all x Î R, and if it fulfills the following conditions

" lim (1 - h( x,g ) ) f1( x, b 1 ) = 0,

(9)

" lim (1 - h( x,g ) ) f 2 ( x, b 2 ) = 0,

(10)

" lim h( x,g ) f1( x, b 1 ) = 0,

(11)

" lim h( x, g ) f 2 ( x, b 2 ) = 0,

(12)

ì1, x > 0, " lim h( x,g ) = í x ¹ 0 g ®0 î0, x < 0,

(13)

" h( x,g ) is non - decreasing in its domain.

(14)

g > 0 x ®+¥

g > 0 x ®+¥

g > 0 x ®-¥

g > 0 x ®-¥

g >0

The conditions (9) - (12) imply convergence of (8) to the phase f1(x,b 1) on the left of the point a, and to the phase f2(x,b 2) on the right of this point. The condition (13) ensures the convergence of function (8) to the function (1) throughout the whole domain except the transition point, and condition (14) guarantees gentle and rounded joining between phases. Since phases f1 and f2 are involved in conditions (9) - (12), it is not possible to determine one universal transition function applicable for any two-phase regression. Nevertheless, these conditions ensure continuous and smooth joining of two phases. If the phases additionally fulfill the continuity condition (2), then the function (8), at the transition point, is

5

equal to the common value of phases. If the condition (2) is not met, the transition is still continuous and smooth, but the value of the function (8) at the transition point is the weighted mean of the phases at this point. It should be noted that a large difference between phases in the transition point a results in an increasing effect of one phase on the other in the neighborhood of a, which may deprive vectors b 1 and b 2 of their physical interpretation near the transition point.

4. ESTIMATION In accordance with the discussion presented above, the estimation of parameters of nonlinear two-phase regression will be performed in two stages. In the first one we will search for phases and the interval containing the transition point. In the second stage the two phases will be joined estimating the transition point and the parameter of the transition function. Since the selection of the transition function depends on the phase functions, it is necessary at first to determine the forms of phases and their parametrization. The process of estimating the parameters of the first phase can be carried out sequentially. It begins by estimating the parameters of this phase on the basis of all observations, and next, in the following k-th step, on the basis of n - k first observations. This process should be continued as long as the determination coefficient R2 increases or the sum S decreases. However, it should be remembered that R2 does not always increase monotonically, especially in the initial steps. The step, in which the biggest value of the determination coefficient is obtained, gives estimates of parameters for the first phase. Simultaneously the interval [x1, xn-k] determines the domain of this phase. Repeating this process for the second phase, this time with the elimination of consecutive initial observations, gives, in the l-th step, the estimates of parameters for the second phase and the approximation of its domain [xl, xn]. In consequence it may be expected that the transition point belongs to the interval [amin, amax], where amin = min(xn-k, xl) and amax = max(xn-k, xl).

6

In the second stage, the most suitable transition point in the interval [amin, amax] is determined by the grid search method proposed by Lerman (1980). For each of the potential transition points, the estimate of g is evaluated. We choose the transition point and the estimate of the parameter g, corresponding to the highest value of R2. During the second stage, the phases can be treated as fixed or their parameters can be reestimated together with the parameter g. In the latter case it is necessary to control the effect of the selection of the transition point, as it may have large influence on the phases.

5. APPLICATION The method proposed was developed during the research on soil mechanics, when soil bulk density was analyzed in dependence on compaction soil water content. It was observed that the reaction of bulk density to changes in soil water content is not uniform. In the initial stage, bulk density increases exponentially along with the increase of water content in soil. After the maximum is reached, at the point called the optimum compaction water content or standard water content, it decreases approaching values determined by the line of the maximum pore filling with water. In result, the bulk density was modeled with the use of a twophase regression function. Its first phase is described by an exponential function of the form f1(x,b 1) = b11 + exp(b12 + b13x), where b11 represents the average bulk density at the smallest compaction soil water content, b13 is a relative increment of the density per unit of water content, and b12 is a scaling parameter. The second phase is connected with approaching the maximum pore filling with water. This extreme condition is described with the function

rd = 100rs/(100 + xrs), where rd is the maximum soil bulk density at water content x and the specific

7

density of soil rs (in our case rs was equal to 2.61 g/cm3). In result, the second phase was modeled simply by linear function of rd, i.e. in the form f2(x,b 2) = b 21 + b22 rd = b21 + b22 rs 100/(100 + xrs), where b21 represent a constant shift and b22 is the rate of arriving to the maximum bulk density. The phases were joined by a transition function of the form h( x , g ) = 1 -

1

(

1 + exp x / g 2

)

.

(15)

This function is depicted on the Figure 1 for some parameters g (g = 0.2, 0.5, 0.7, 1). It can be easily noted that the smaller values of g causes the faster change of h in the neighborhood of zero. Actually, h is a shifted and rescaled hyperbolic tangent. The function (15) fulfills the condition (9), when g2 < 1/b13, and the condition (11), when b13 > 0. The other conditions are satisfied without any restrictions. Since the inequality g2 < 1/b13 implies b13 > 0, the only restraint is the former one. It should be controlled during estimation process. According to the method presented in Section 4, the domains of individual phases as well the estimates of unknown parameters were determined by the sequential procedure. Selected results are given in TABLES I and II. TABLE I Estimation of the first phase Step

b11

b12

b13

R2

Domain

5.

-21.677

3.152

0.0002

0.052

[5.51, 19.62]

10.

-37.740

3.672

0.0004

0.373

[5.51, 18.41]

20.

1.377

-1.590

0.0600

0.657

[5.51, 13.84]

25.

1.582

-3.328

0.1500

0.683

[5.51, 14.81]

30.

1.647

-5.445

0.3000

0.730

[5.51, 13.91]

8

Ö

31.

1.659

-6.336

0.3679

0.743

[5.51, 13.44]

32.

1.663

-6.691

0.3960

0.738

[5.51, 13.27]

35.

1.667

-7.214

0.4386

0.685

[5.51, 12.99]

TABLE II Estimation of the second phase R2

b21

b22

5.

1.653

0.050

0.007

[ 5.63, 21.17]

10.

1.435

0.166

0.052

[ 6.05, 21.17]

20.

0.766

0.526

0.303

[ 9.03, 21.17]

25.

0.203

0.834

0.648

[10.81, 21.17]

30.

-0.418

1.175

0.929

[10.95, 21.17]

Ö 31.

-0.550

1.248

0.966

[12.63, 21.17]

32.

-0.553

1.249

0.965

[12.78, 21.17]

35.

-0.560

1.253

0.962

[12.99, 21.17]

Step

Domain

The optimal results for both phases were obtained coincidentally in the same step 31. Comparing the domains of phases f1 and f2, it follows that the transition point belongs to the interval [12.63, 13.44]. In the second stage this interval was searched with the increment 0.01. In the first case, only the parameter g and the transition point a were estimated. The following results were obtained: R2 = 0.856, a$ = 12.63, g$ = 0.573,

b$11 = 1.659, b$12 = -6.336, b$13 = 0.368 and b$21 = -0.550, b$22 =1.248. The resulting functions are depicted on Figure 2. The phases obtained from the initial stage are marked by segmented and dotted lines, respectively, while the joined function (8) is marked by the solid line. In the second case, all parameters bij of regression (8) were reestimated, while estimating g and searching the transition point. The results are as follows:

9

R2 = 0.858, a$ = 12.65, g$ = 0.221,

b$11 = 1.676, b$12 = -9.070, b$13 = 0.599 and b$21 = -0.551, b$22 =1.248, and the corresponding functions are shown on Figure 3. The reestimated phases are marked as on Figure 2, while the first phase from the initial stage is marked by a segmented and dotted line. Observe that in both cases the condition g2 < 1/b13 is satisfied. Moreover, observe that the estimates obtained in the first case, except parameters b12 and b13, concise with those arrived from second one. Thus, in both cases the parameters have retained their physical interpretation and each of the resulting regressions describes similarly the investigated process. On the other hand, from Figure 2 and Figure 3, it can be seen that the transition function affects the phases to some extent. The influence on the first phase is larger than on the second one. This can be explained by larger variability of observations supporting the first phase. We have also estimated the function (8) by searching the transition point throughout the whole range of variability for independent variable. The following results were obtained: R2 = 0.863, a$ = 11.64, g$ = 0.625,

b$11 = -6.967, b$12 = 2.151, b$13 = 0.000944 and b$21 = -0.584, b$22 =1.267. The resulting graphs are presented in Figure 4. It can be seen that the estimate of the phase one differs significantly from the corresponding phases established earlier. On the other hand, we can note quite well fit of the joined regression to the observed points. It is entirely due to the transition function. Therefore, confining the search of the transition point to the shortest possible interval, is advantageous for the final interpretation of resulting regression.

ACKNOWLEDGEMENT

10

The authors are very grateful to the referee for his constructive remarks.

BIBLIOGRAPHY

Bacon, D. W. and D. G. Watts (1971). "Estimating the transition between two intersecting straight lines", Biometrika 58, 525 - 534. Griffiths, D. A. and A. J. Miller (1973). "Hyperbolic regression - a model based on two-phase piecewise linear regression with a smooth transition between regimes", Commun. Stat. 2, 561 - 569. Hinkley, D. V. (1969). "Inference about the intersection in two-phase regression", Biometrika 56, 495 - 504. Hudson, D. J. (1966). "Fitting segmented curves whose joint points have to be estimated", J. Am. Stat. Assoc. 61, 1097 - 1129. Kirby, E., J., M. (1974). "Ear development in spring wheat", J. Agric. Sci., Camb. 82, 437 - 447. Lerman, P. M. (1980). "Fitting segmented regression models by grid search", Appl. Statist. 29, No 1, 77 - 84. Seber, G. A. F. and C. J. Wild (1989). Nonlinear Regression, Wiley. New York. Sprent, P. (1961). "Some hypotheses concerning two phase regression lines", Biometrics 17, 634 - 645. Watts, D. G. and D. W. Bacon (1974). "Using an hyperbola as a transition model to fit two-regime straight-line data", Technometrics 16, 369 - 373.

11

1.00

g = 0.2 g = 0.5 g = 0.7 g=1

0.00 -4

-3

-2

-1

0

12

1

2

3

4

2.00

regression function first phase second phase

Bulk Density

1.80

1.60

1.40 4.00

8.00

12.00

16.00

Soil Water Content (%w/w)

13

20.00

24.00

2.00

regression function first phase the best fit of the first phase second phase

Bulk Density

1.80

1.60

1.40 4.00

8.00

12.00

16.00

Soil Water Content (%w/w)

14

20.00

24.00

2.00 regression function first phase second phase

Bulk Density

1.80

1.60

1.40 4.00

8.00

12.00

16.00

Soil Water Content (%w/w)

15

20.00

24.00

FIG. 1. Transition function (15) with different values of g

FIG. 2. Two-phase regression function

FIG. 3. Two-phase regression function reestimated

FIG. 4. Two-phase regression function estimated over whole domain

16

Suggest Documents