Regression Course – Additional Notes I & II - CiteSeerX

1 downloads 0 Views 151KB Size Report
Jun 13, 2007 - Thus Bowler A does better than Bowler B in both innings (gives away fewer runs for each wicket taken), but ends up giving away more runs per ...
Regression Course – Additional Notes I & II John Maindonald June 13, 2007 These consolidate the earlier separate sets of additional notes – I & II

1

Balanced vs Unbalanced Data – What is the Effect?

This section illustrates, in a very simple case, confounding effects that may arise when data are unbalanced. It demonstrates, also, how the estimate for a term may change when a further term is included in the model.

1.1

Cricket bowling averages example

In a game of cricket the bowler’s aim is to give away as few runs as possible, while taking the maximum possible number of wickets. (A wicket is taken when a batsman is given out). The game is divided into two innings, and the conditions can be very different between the two innings. Here the interest is in comparing the performances of different bowlers. Bowler A 1st Innings 2nd Innings 50 runs 70 runs 1w 5w Runs per wicket 50 r/w 14 r/w

Overall 120 runs 6w

1st Innings 240 runs 6w

Bowler B 2nd Innings 40 runs 4w

Overall 280 runs 10 w

20 r/w

40 r/w

10 r/w

28 r/w

Thus Bowler A does better than Bowler B in both innings (gives away fewer runs for each wicket taken), but ends up giving away more runs per wicket overall. Observe that Bowler A did more of the bowling in the first innings, when there were more runs per wicket. Thus Bowler A’s runs per wicket average is skewed towards the first innings runs per wicket rate. Here is the 2 by 2 table of runs per wicket information. Adding the total of runs for each bowler, and dividing in each case by the corresponding total of wickets is equivalent to taking averages for each bowler of runs per wicket, with number of wickets as weight.

A B Average

Innings 1 50 (1 w) 40 (6 w) 45

Innings 2 14 (5 w) 10 (4 w) 12 = 45 - 33

Average 25 = 32 - 7 28.5

The default contrasts used by the function lm() take the baseline for all effects (the “intercept”) as the outcome for Innings 1 and Bowler A. With innings 1 is taken as the baseline, the innings effect (Innings 2 - Innings 1) appears as -33, as in the above table. With Bowler A as the baseline, the bowler effect (Bowler B - Bowler A) appears as -7.

1

1

BALANCED VS UNBALANCED DATA – WHAT IS THE EFFECT?

1.2 > + > > > 1 2 3 4

2

Analysis Using lm()

cricket cricket.wlm round(coef(cricket.wlm), 2) (Intercept) BowlerB InningsInn2 46.29 -5.67 -31.55 Observe that the estimate of the bowler effect has changed, but is still fair to Bowler B. Now omit consideration of the innings effect: > cricket1.wlm round(coef(cricket1.wlm), 2) (Intercept) BowlerB 20 8 This is equivalent to dividing the overall number of runs, for each bowler, by the overall number of wickets. As above, Bowler B is estimated as costing 8 runs per wicket more than Bowler A. The combination of unequal weights and omission of a relevant factor generates this misleading result.

1.3

Calculations using Householder reflections

First, form the matrix on which the calculations will be performed. > Xy Xy (Intercept) BowlerB InningsInn2 RunsPerW 1 1 0 0 50 2 1 0 1 14

2

BOOTSTRAP & PERMUTATION RESAMPLING

3 4

1 1

1 1

0 1

3

40 10

Now use my function house() for the calculation. > house(Xy, showzeros=3) (Intercept) BowlerB InningsInn2 RunsPerW 1 2 1 1 57 2 0 1 0 -7 3 0 0 1 -33 4 0 0 0 3 To obtain estimates of the model parameters, solve:     2 1 1 57 X =  0 1 0  b =  −7  0 0 1 −33 Here are the calculations for the weighted analysis: > house(Xy*sqrt(cricket$w), showzeros=3) (Intercept) BowlerB InningsInn2 RunsPerW 1 4 2.500000 2.2500000 100.000000 2 0 1.936492 -0.8391464 15.491933 3 0 0.000000 1.7981472 -56.725056 4 0 0.000000 0.0000000 -4.718903 Now solve:

   100.000000 4 2.500000 2.2500000 X =  0 1.936492 −0.8391464  b =  15.491933  −56.725056 0 0 1.7981472 

A key difference is that the R matrix no longer has a zero in the (2, 3) position. The consequence is that the estimate of the bowler effect changes (and changes in sign) if there is no allowance for Innings.

1.4

A further example

The table shows, separately for males and females, the effect of pentazocine on post-operative pain profiles (average VAS scores), with (mbac and fbac) and without (mpl and fpl) preoperatively administered baclofen. Pain scores are recorded every 20 minutes, from 10 minutes to 170 minutes. Results are shown for 30 minutes and 50 minutes only. 15 females were given baclofen, as against 3 males. 7 females received the placebo, as against 16 males. Averages for the two treatments (baclofen/placebo), taken over all trial participants and ignoring sex, are misleading. Notice that, both for males and for females, the scores are lower when baclofen is Time (min) 30 50

mbac 1.31 0.05

mpl 1.65 0.67

fbac 3.48 3.13

fpl 4.15 3.66

avbac 3.12 2.61

avplac 2.74 1.98

administered. Reliance on the overall average would suggest that the scores are higher when baclofen is administered.

2

Bootstrap & Permutation Resampling

Here our interest will be in comparing residual sums of squares for models that are not nested. The methodology will be demonstrated using the fruitohms data in the DAAG package.

2

BOOTSTRAP & PERMUTATION RESAMPLING

2.1

4

Strategy

For this initial discussion, denote the models as M1 and M2. We first fit the model M1 to the original data, then obtaining fitted values and residuals. Also, we fit M2, and note the change as measured by a statistic F˜12 that is calculated just as for the usual F -statistic. If M2 is not giving an improvement, then F˜12 will differ only by statistical error from the F12 observed when a random set of residuals are added. It will, in effect, be a random sample from the distribution obtained when repeated random sets of residuals are added on to the fitted values, and F12 is calculated for each such new set of “observations”. We can then locate F˜12 within this distribution. If F˜12 falls above its 95th percentile then we will judge that, at the 5% significance level, M2 improves on M1. The change when the residuals are correctly attached is, on this criterion, more than statistical variation. With slight modification, the argument applies if the resdiduals are obtained by permuting the original residuals. If model M2 is not improving on M1, then the change when new “observations” are derived by adding randomly permuted residuals back onto the fitted values will be attributable to statistical variation. It also applies, again in slightly modified from, if repeated boostrap samples of the original residuals are used. Note that we are not, here, deriving bootstrap estimates of some parameter. While there are theoretical issues even with this use of the bootstrap, we do not have the issues of bias that commonly arise when the bootstrap is used to estimate variance-like statistics. Note that the three distributions contemplated above are different. There will, accordingly, be differences of power between the three approaches. The methodology can be varied or extended in various ways: • We can choose M2 if it appears, on average, to improve on M1, i.e., choose M2 if the p-value is less than 0.5. • This same style of approach can be used if M2 is selected from a wider class of models. The model selection step must then be repeated for each new bootstrap or permutation sample.

2.2

Code

Here are functions that can handle the calculations: ‘funF‘