The importance of graphing the data: Anscombe's regression examples

43 downloads 4457 Views 800KB Size Report
... of graphing the data: Anscombe's regression examples ... of linear regression analysis. .... part of data analysis (Tukey, 1977; Tufte, 1997). ➢Let's look at some ...
The importance of graphing the data: Anscombe’s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008

B. Weaver, NHRC 2008

1

The Objective

To To demonstrate demonstrate that that good good graphs graphs are are an an essential essential part part of of linear linear regression regression analysis. analysis.

B. Weaver, NHRC 2008

2

Not this kind of regression analysis

B. Weaver, NHRC 2008

3

This kind of regression analysis

B. Weaver, NHRC 2008

4

A very brief primer on simple linear regression

B. Weaver, NHRC 2008

5

Simple linear regression • A model in which X is used to predict Y. • Y is a continuous variable with interval scale properties. • In the prototypical case, X is also a continuous variable with interval-scale properties. • Example: • Y = distance in a 6-minute walk test • X = FEV1

B. Weaver, NHRC 2008

6

Back to high school Equation for a straight line

Y = bX + a SLOPE

b = slope of the line = the rise over the run

B. Weaver, NHRC 2008

INTERCEPT

a = the value of Y when X = 0

7

Example of a straight line ¾ ¾ ¾ ¾ ¾

Gym membership Annual fee = $100 Fee per visit = $2 Let X = the number of visits to the gym Let Y = the total cost

Y = 2 X + 100 ¾ Let X = 200 visits to the gym ¾ Total cost = 2(200) + 100 = $500

B. Weaver, NHRC 2008

8

What if the relationship is imperfect? Straight line for a perfect relationship:

Y = bX + a Straight line for an imperfect relationship:

Y ′ = bX + a

Y = bX + a Two different symbols for the “predicted value of Y” B. Weaver, NHRC 2008

9

R-squared • R-squared = the proportion of variability in Y that is accounted for by explanatory variables in the model. • For a simple linear regression model (i.e., one predictor variable), R-squared = the proportion of the variability in Y that can be accounted for by the linear relationship between X and Y • The adjusted R-squared corrects for upward bias in R-squared

B. Weaver, NHRC 2008

10

Anscombe’s examples (1973) ¾Frank Anscombe devised 4 sets of X-Y pairs ¾He performed simple linear regression for each data set ¾Here are the results

B. Weaver, NHRC 2008

11

Means & Standard Deviations X

Y

Data Set

N

Mean

SD

Mean

SD

1

11

7.50

2.03

9.00

3.32

2

11

7.50

2.03

9.00

3.32

3

11

7.50

2.03

9.00

3.32

4

11

7.50

2.03

9.00

3.32

The means and SDs for the 4 data sets are identical to two decimals. B. Weaver, NHRC 2008

12

Correlations between X and Y Data Set

Pearson r

R-squared 0.67

Adj. R-sq 0.63

SE 1.24

1

0.82

2

0.82

0.67

0.63

1.24

3

0.82

0.67

0.63

1.24

4

0.82

0.67

0.63

1.24

Correlations, R-squared, adjusted Rsquared, and standard errors are all identical to two decimals. B. Weaver, NHRC 2008

13

ANOVA Summary Tables Data Set

Source

SS

df

MS

F

p

Regression

27.490

1

27.490

18.003

0.002

Residual

13.742

9

1.527

Total

41.232

10

Regression

27.470

1

27.470

17.972

0.002

Residual

13.756

9

1.528

Total

41.226

10

Regression

27.500

1

27.500

17.966

0.002

Residual

13.776

9

1.531

Total

41.276

10

Regression

27.510

1

27.510

17.990

0.002

Residual

13.763

9

1.529

Total

41.273

10

1

2

3

4

B. Weaver, NHRC 2008

14

The Regression Coefficients Data Set 1 2 3 4

B

SE

t

p

Constant

3.00

1.124

2.67

X

0.50

0.118

Constant

3.00

X

95% CI Lower

Upper

0.026

0.459

5.544

4.24

0.002

0.233

0.766

1.124

2.67

0.026

0.459

5.546

0.50

0.118

4.24

0.002

0.233

0.766

Constant

3.00

1.125

2.67

0.026

0.455

5.547

X

0.50

0.118

4.24

0.002

0.233

0.767

Constant

3.00

1.125

2.67

0.026

0.456

5.544

X

0.50

0.118

4.24

0.002

0.233

0.767

For all 4 models, Y’ = 0.5(X) + 3 B. Weaver, NHRC 2008

15

Which Model is Best? ¾ Judging by everything we’ve just seen, it appears that the models are all equally good ¾ But if that were true, I wouldn’t be doing this talk! ¾ It is well known that good graphs are an essential part of data analysis (Tukey, 1977; Tufte, 1997) ¾ Let’s look at some graphs that show the relationship between X and Y

B. Weaver, NHRC 2008

16

Scatter-plot for Data Set 1

10 data points

Influential point

Not a good model

B. Weaver, NHRC 2008

17

Scatter-plot for Data Set 2 Perfect linear relationship except for one outlier Better model than for Data Set 1, but still not great.

B. Weaver, NHRC 2008

18

Scatter-plot for Data Set 3 Wrong model! The relationship between X and Y is curvilinear, not linear! The model should include both X and X2 as predictors.

B. Weaver, NHRC 2008

19

Scatter-plot for Data Set 4 This is a good looking plot. No influential points; straight line provides a good fit.

B. Weaver, NHRC 2008

20

Summary ¾ The usual summary statistics for the 4 regression models were virtually identical ¾ Scatter-plots revealed that only one of the 4 data sets gave us a good model ¾ Appropriate graphs are an essential part of data analysis

B. Weaver, NHRC 2008

21

What about multivariable models? ¾Scatter-plots are useful for simple linear regression models (i.e., only one predictor variable) ¾But often, we have multiple, or multivariable regression models (i.e., 2 or more predictor variables) ¾In that case, it is more common to assess the fit of the model by looking at residual plots B. Weaver, NHRC 2008

22

What is a residual? ¾In linear regression, a residual is an error in prediction Residual = (Y – Y’) = (actual score – predicted score)

B. Weaver, NHRC 2008

23

Set 1: Scatter-plot vs. Residual Plot Residual Plot

Residual

Scatter-plot

Y

X

B. Weaver, NHRC 2008

Predicted value of Y

24

Set 2: Scatter-plot vs. Residual Plot Residual Plot

Residual

Scatter-plot

Predicted value of Y

B. Weaver, NHRC 2008

25

Set 3: Scatter-plot vs. Residual Plot Residual Plot

Residual

Scatter-plot

Predicted value of Y

Runs of same-sign residuals B. Weaver, NHRC 2008

26

Set 4: Scatter-plot vs. Residual Plot Residual Plot

Residual

Scatter-plot

Predicted value of Y

B. Weaver, NHRC 2008

27

Summary ¾ The usual summary statistics for the 4 regression models were virtually identical ¾ Scatter-plots revealed that only one of the 4 data sets gave us a good model ¾ Residual plots reveal the same thing, and have the advantage of being applicable to multivariable regression models ¾ Appropriate graphs are an essential part of data analysis

B. Weaver, NHRC 2008

28

Questions?

I think you should be more explicit here in step 2.

B. Weaver, NHRC 2008

29

References Anscombe FJ. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. Tufte ER. (1997). Visual Explanations, Images and Quantities, Evidence and Narrative (3rd Ed.). Graphics Press: Cheshire. Tukey JW. (1977). Exploratory data analysis. Addison-Wesley: Reading, Mass.

B. Weaver, NHRC 2008

30

Extra Slides

B. Weaver, NHRC 2008

31

Just as one would expect! ¾ The experimentalist comes running excitedly into the theorist's office, waving a graph taken off his latest experiment. ¾ "Hmmm," says the theorist, "That's exactly where you'd expect to see that peak. Here's the reason (long logical explanation follows)." ¾ In the middle of it, the experimentalist says "Wait a minute", studies the chart for a second, and says, "Oops, this is upside down." He fixes it. ¾ "Hmmm," says the theorist, "you'd expect to see a dip in exactly that position. Here's the reason...". B. Weaver, NHRC 2008

32

Best-fitting line: Least squares criterion ¾ Many lines could be placed on the scatter-plot, but only one of them is considered the best-fitting line. ¾ The most common criterion for best-fitting is that the sum of the squared errors in prediction is minimized. ¾ This is called the least-squares criterion.

B. Weaver, NHRC 2008

33

Illustration of Least Squares

Error in prediction

B. Weaver, NHRC 2008

34

Illustration of Least Squares Squared error in prediction

Squared error in prediction Error = 0 for this point, so no square

B. Weaver, NHRC 2008

35

Illustration of Least Squares Sum of squared errors = the sum of the areas of all these squares

For any other regression line, the sum of the squared errors would be greater.

B. Weaver, NHRC 2008

36

What is a residual plot? ¾ Scatter-plot with: X = the fitted (or predicted) value of Y Y = the residual (i.e., the error in prediction)

¾ Residuals should be independent of the fitted value of Y ¾ There should be no serial correlation in the residuals (e.g., long runs of same-sign residuals) ¾ Both of these problems (plus some others) can be detected via residual plots ¾ Advantage of residual plots: they can be used in multivariable (i.e., multi-predictor) regression models B. Weaver, NHRC 2008

37

Examples of residual plots Residual

Curvilinear relationship

Predicted Y

Outlier B. Weaver, NHRC 2008

Heteroscedasticity 38

Example of a good residual plot

B. Weaver, NHRC 2008

39

Example of a zig-zag pattern

You do not want to see this kind of zig-zag pattern in the residual plot.

B. Weaver, NHRC 2008

40

Simple linear regression & correlation • Pearson r = the correlation • It measures of the direction and strength of the linear association between X and Y • It ranges from -1 to +1

B. Weaver, NHRC 2008

41

Direction of the linear relationship Positive relationship

As X increases, Y increases

B. Weaver, NHRC 2008

Negative relationship

As X increases, Y decreases

42

Perfect vs. Imperfect Relationship Perfect relationship

B. Weaver, NHRC 2008

Imperfect relationship

43

r-squared • The square of Pearson r is a measure of how well the regression model fits the observed data • It gives the proportion of variability in Y that is accounted for the linear relationship between X and Y. • E.g., let r = 0.6 (or -0.6) • r2 = 0.36 • So 36% of the variability in the Y-scores is accounted for by the linear relationship between X and Y

B. Weaver, NHRC 2008

44

Suggest Documents