"CROSS—VALIDATING REGRESSION MODELS IN MARKETING ...

8 downloads 0 Views 1MB Size Report
Associate Professor of Marketing, INSEAD, Boulevard de. Constance .... where 0 is a k x 1 vector of regression coefficients and c is an N x 1 vector of errors ...
"CROSS—VALIDATING REGRESSION MODELS IN MARKETING RESEARCH" by Joel STECKEL* and Wilfried VANHONACKER** N° 90/42/MKT

Associate Professor of Marketing, New York University, New York, U.S.A. *

Associate Professor of Marketing, INSEAD, Boulevard de Constance, Fontainebleau 77305 Cedex, France

Printed at INSEAD, Fontainebleau, France

CROSS—VALIDATING REGRESSION MODELS IN MARKETING RESEARCH

Joel H. Steckel* and

Wilfried R. Vanhonacker**

Revised April 1990

* Associate Professor of Marketing, New York University, New York, U.S.A. ** Associate Professor of Marketing, INSEAD, Fontainebleau, France

-2-

CROSS-VALIDATING REGRESSION MODELS IN MARKETING RESEARCH

Abstract

In this paper, a formal test is developed for the cross-validation of regression models under the simple random splitting framework. Analytic as well as simulation results relate the statistical power of the test to the allocation of sample observations to the estimation and validation samples. The results indicate that splitting the data into halves is suboptimal. More observations should be used for estimation than validation. Furthermore, the proportion of the sample optimally devoted to validation decreases as the sample size increases. However, although the 50/50 split is suboptimal, it is not tremendously so in a wide variety of circumstances.

Key words: CROSS-VALIDATION; REGRESSION; RANDOM SPLITTING.

- 3

1. INTRODUCTION

Suppose a marketing research analyst is interested in building a regression model to predict a consumer's interest in a new brand. He/she decides that general interest in the brand's product category, previous purchase history, exposure to TV advertising, and demographics would be relevant independent variables. After building the model, the researcher worries that his/her results are not representative, could be attributed to chance variation, and therefore the model may not be correct. After all, he/she reasons, regression always finds the best possible fit to the specific data at hand. It does not guarantee the generality of the estimated relationship. Probably the most common way of coping with this problem is crossvalidation (Green and Tull 1978). Usually, a researcher splits the observations into two parts. Next, he/she uses one part, called the estimation sample, to estimate the parameters of the model. The resulting equation is then used to predict the dependent variable values for the rest of the observations. He/she then typically compares the actual to the predicted values in order to examine the "stability" of the relationship.1 Although there are other approaches to the problem (see Cooil, Rados, and Winer (1987) for a review), the "data-splitting" paradigm described above has become dominant in both marketing (Cooil, Rados, and Winer 1987) and psychology (Murphy 1983). The approach can be traced as far back as Larson (1931). It is therefore surprising that no satisfying answers exist to the following two questions:

1. What formal "rules" or "heuristics" exist for deciding whether an estimated model or relationship is stable and governs the validation sample? and 2. What proportions of the observations should be allocated to the estimation and validation samples? This paper attempts to provide such answers.

Related to the first question, a variety of measures exist for assessing the correspondence between the actual and predicted values. Green and Tull (1978; p. 335) suggest comparing the coefficient of determination between

- 4 these values to the R 2 of the estimated model to examine the "shrinkage" in fit. Pickard and Cook (1984) suggest the sum of squared deviations between actual and predicted values while Rust and Schmittlein (1985) advocate Akaike's information criterion. However, none of these measures have been couched in a formal hypothesis testing framework where an explicit acceptreject decision is made with respect to the estimated model. In this paper, we develop such a test. We assume that the analyst is more interested in using regression to predict future observations than to explore the structure of a given data set (Baskerville and Toogood 1982). Accordingly, the test is based on the vector of differences between the actual and predicted values. The null hypothesis is of the form that the model posited by the researcher is the one that generates the data at hand. With respect to question two above, no prior logic dictates how many observations to put into each sample. More data in the estimation sample leads to more efficient estimates; while more in the validation sample leads to a more powerful validity test. The obvious solution has been to randomly split the data into halves (Kerlinger and Pedhazur 1973), although Green and Tull (1978) suggest that one-quarter to one-third of the data are typically used for validation. Several authors have argued that such random splitting can produce estimation and validation samples which are dissimilar and hence validation could be biased. McCarthy (1976), Snee (1977), and Pickard and Cook (1984) have proposed procedures which attempt to ensure that the two halves are "similar." On the other hand, splitting at random is simple. It is also far and away the most common method used. Therefore, we recognize that researchers will continue to split this way and, recognizing its limitations, we attempt to gain insight into how many data points to put into each of the estimation and validation samples. We do this by finding the quantities which maximize the power of the test constructed in response to question one. Alternatively, an analyst may have the objective of exploring the structure of a data set via regression. In that case, our test may not be applicable. Presumably, the analyst might then be more interested in a procedure that leads to minimized standard errors on a subset of variables. Statistical power may not be as much of a concern. We do not study that case. Cooil, Rados, and Winer (1987) review a variety of approaches to crossvalidation other than the data-splitting we focus on here. Among these are two methods that make intensive use of all available data through re-sampling

- 5 -

or sample re-use, the jackknife and the bootstrap (Fenwick 1979; Efron 1982; Efron 1983; and Efron and Gong 1983). These involve estimation of the basic model on a large number of different subsets of the data. The suitability of a given model is assessed not by how well a model predicts independent observations, but by the empirical distribution of the parameter estimates over the many trials. However, the approach Cooil, Rados, and Winer (1987) advocate is one developed independently by Stone (1974) and Geisser (1975). It simultaneously estimates parameters and cross-validates the estimates. The logic behind the approach is that coefficients are estimated on some part of the sample subject to a constraint which stipulates (loosely) that they must cross-validate well to the rest of the sample. Li (1987) shows that this method has some very desirable optimality-type properties. Alternatively, an analyst may have the objective of exploring the structure of a data set via regression. In that case, our test may not be applicable. Presumably, the analyst might then be more interested in a procedure that leads to minimized standard errors on a subset of variables. Statistical power may not be as much of a concern. We do not study that case. Cooil, Rados, and Winer (1987) weigh the advantages and disadvantages of all the approaches and find that data-splitting comes up short. Nevertheless, the simplicity of the approach means that it is probably here to stay. It is with this in mind that we write this paper. Other approaches may be better but none is more common. Therefore, we view our task as improving the paradigm, rather than abandoning it. The results in this paper indicate that splitting the data into halves is suboptimal. More observations should be used for estimation than validation. This suggests that the one-quarter to one-third recommendations of Green and Tull (1978) are better than a 50/50 split. The proportion of the sample optimally devoted to validation decreases as the sample size increases. However, though the 50/50 split is suboptimal, it is not tremendously so in a wide variety of circumstances.

2. DERIVATION OF TEST-STATISTIC

Let X denote the N x k matrix of independent variables in the true model where N is the number of data points and k is the number of (true) independent

- 6 variables. X is related to the N x 1 vector of dependent variables y by the model

y

X0 +

(1)

where 0 is a k x 1 vector of regression coefficients and c is an N x 1 vector of errors drawn from a normal distribution with mean 0 and scalar covariance matrix a2I. The major error which can occur in specifying a true regression model is omission of relevant independent variables. Inclusion of irrelevant variables does not have the same impact since they can be assumed to have a regression coefficient equal to zero. In principle, misspecification of functional form can also be a serious problem. However, since any continuous function can be approximated by a polynomial and polynomials are often modeled as distinct independent variables in a linear regression (Draper and Smith 1981), functional form misspecification can also be conceived of as a problem with omitted variables. Suppose, therefore, that only the first k, (< k) columns of X are used and the first N, (< N) rows comprise the estimation sample and the remainder comprise the validation sample. In line with this, we partition X as follows

X

=

X II X 12

I1

X 21 X 22

where X

11

XI X2

is the N, x k, matrix of observations on which the estimation is

based; X,, is the N 1 x k 2 matrix of values for the omitted variables in the estimation sample; X 21 is the N 2 x k, matrix of values for the included variables in the validation sample, and

X

22

is the N 2 x k 2 matrix of values

for the omitted variables in the validation sample. X, is defined to be the N 1 x k matrix

[

X 11

X

12 ]

and

X

2

is the N 2 x k matrix [ X 21 X22]. We further

partition 0 and c as

=

11

[0 2

, and y=

E = E2

Y1 Y2

0 is partitioned to correspond to the included and omitted variables while c and y are partitioned with respect to the estimation and validation samples.

- 7

The estimated regression coefficient is 0,. (XI, X„) predicted values of y 2 then become y 2 =

-1

XI, y 1 . The

X 21 (Xl 1 X„)

X21 O i

-1

X' 11 y, and

the actual values y 2 are equal to X 2 0 + e 2 . Moreover, the residuals equal

= Y2 - Y2

.

X2 0

+ E2

- X2101

X 2I (X'„ X 11 )

= E2

X2 0 -

= C2

X2101 + X2202 -

=

E2 -

X21 (X'„

X

11

)

-1

X1,1 y1

X 21 (X i

-1

XI,

6

1

X „ )

1

+

-1

X 22

X;, [X,0 +

X21

(X

Et]

X 11

)

-1

Xi 1X12]

0 2 (2)

These expressions form natural bases for a cross-validation test of predictive validity. If the difference between predicted and actual values is too large, then we should reject a null hypothesis which implies that we have the correct model. In order to construct such a test, we need to find a statistic which is a function of 0 and whose distribution we know. We follow a very common statistical paradigm. The null hypothesis that we have the correct model can be formally stated as

H: X= 0

_X22

or equivalently that there are no omitted variables. The statistic we derive has a noncentral F distribution. Under the null hypothesis the distribution becomes (central) F and a standard F test can be performed.

As a point of departure for our test, note that 0, is normally distributed with mean 0, and covariance matrix a' (

X,,)

1.

ismativariatenormalwithmeall(sinceE(c.)= 0 for i = 1,2)

E (&)

X2202- X21

(X I "

-

X12) 1

XII

X1202

from (2), or alternatively,

E

(a)

[x„ — X 21

)c,1)

-1

X',, )( 12 3 02.

Therefore, 6

- 8

Furthermore, the variance-covariance matrix equals

E(0

a' [ I +

u')

-1 X

(X11

li 1. (y'l 0'). Note that

Consider now the N-dimensional random vector v' X,0 E(v) X2I ( X ' X 11

( X 22

)

-1

Xi, X 12

) 02

and (see Appendix 1)

(XI, X„)

E(vv') = a' `I

[-1 - X„ (X'„ X„)

X'„

I + X,,(X'„ X„) -1

Consider now the following matrices both of dimension N x

1 A

1.

XS1.1

N

0

0

a2

XS,

0

II

I + X 2I (X'„ X„)

-1

X'21] -1

(3)

and 13 1

0

0

0

(4)

where B„= I - X„ (X'„

)

-1

X'21 [ X„ (X'„ X„)

-1

X'211

1

X2, (X'„ X„)

1

X„

Matrices A in (3) and B in (4) play an instrumental role in the development of our test statistic. The following three lemmas provide the foundation of the statistic:

Lemma 1: q

v'Av follows a noncentral chi-square distribution with

degrees of freedom and noncentrality parameter

N2

-9 1 [E(0)]' [ I + X 2I (XI I X 11 )

6 -

-1

X'2 ,3

-1

[E(u)]•

(5)

202

The proof for lemma 1 is given in Appendix 2.

Lemma 2: s

v'Bv follows a central chi-square distribution with N,

degrees of freedom.

The proof of lemma 2 is rather straightforward using theorem 8.2 in Johnson and Kotz (1970, p. 177), knowing that B as defined in (4) is an idempotent matrix (for proof, see Appendix 3).

Lemma 3: q = v'Av and s = v'Bv are mutually independent.

The proof of lemma 3 is an application of theorem 8.1 in Johnson and Kotz (1970, p. 176), knowing that AB=O.

We are now ready to state our main result. q/N2 Theorem 1: f -

is F' (N„ N, - k„ 6) s/N,- k,

where F'(N„ N, - k„ 8) denotes the noncentral F distribution with degrees of freedom N 2 , N,- k„ and noncentrality parameter 8. This theorem is an immediate consequence of Lemmas 1 through 3 and the definition of a noncentral F. A formal test of the null hypothesis is now straightforward. Under the null hypothesis 8 = 0, f is a (central) F with N, and N, - k, degrees of freedom. An a level test then involves computing f and comparing it to the upper a level percentage point on a standard F table, F

N2, N - k„ a

.

If f

exceeds the value, the hypothesis and the model would have to be rejected.

3. THE NONCENTRALITY PARAMETER

The noncentrality parameter of the test-stastic, 6 as defined in (5), plays a critical role in the power of the statistic and, hence, the allocation

- 10 of observations to the estimation and validation samples. Before addressing the issue of statistical power, we expand on the noncentrality parameter itself. By definition, 1 -1 -1 S - — [EM]' [I + X 2 , (XC 1 X 11 ) r21 ] 2a2

[E(0)]•

As shown at the beginning of Appendix 4,

8

= — r

2a2

2

( X 22— x

(X 22 - X 2 ,

X„)

2 , ( X ',

( X'21 X„)

-1

-1

xl,

X„) , [I +

x2

, (X'„ X„)

-1

XCI X 12 ) 0 2 ]

X'21]

-1

(6)

Under the hypothesis that we are studying the correct model (i.e.,

s2 =

0), 8

equals zero and the distribution of q in Lemma 1 is (central) chi-square with N 2 degrees of freedom. For the special case where X„(and, hence, X 21 ) contains a single constant (i.e., X1 2 = [1 1....1]) and X 12 contains a single variable, Appendix 4 also shows that the noncentrality parameter equals

6 =

2 ,-'2

[Nq - p (p + 2k) + (N 2 /N 1 ) kl

(7)

2Na2 where N = N 1 + N 2 ,

q =

N E

x22i

i.N 1 +1

p .

N E

x2 i , and

i=N2+1

N2 k.Ex.withx.0 = 1, 21 1

2, ...,

i=1 single predictor in X 12 and X 22 .

N) denoting the observations on the

From expression (7), the impact on 6 of the

magnitude of the validation sample (N 2 ) and the pattern of observations on the single predictor can be assessed. Since the partial derivatives 3q a

/aN 2 > 0,

p/aN 2 < 0 for monotonically decreasing series (> 0 for monotonically

increasing series), and 3k /3N 2