The importance of graphing the data: Anscombe's regression examples

The importance of graphing the data: Anscombe’s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008

B. Weaver, NHRC 2008

1

The Objective

To To demonstrate demonstrate that that good good graphs graphs are are an an essential essential part part of of linear linear regression regression analysis. analysis.


2

Not this kind of regression analysis


3

This kind of regression analysis


4

A very brief primer on simple linear regression


5

Simple linear regression • A model in which X is used to predict Y. • Y is a continuous variable with interval scale properties. • In the prototypical case, X is also a continuous variable with interval-scale properties. • Example: • Y = distance in a 6-minute walk test • X = FEV1


6

Back to high school Equation for a straight line

Y = bX + a SLOPE

b = slope of the line = the rise over the run


INTERCEPT

a = the value of Y when X = 0

7

Example of a straight line ¾ ¾ ¾ ¾ ¾

Gym membership Annual fee = $100 Fee per visit = $2 Let X = the number of visits to the gym Let Y = the total cost

Y = 2 X + 100 ¾ Let X = 200 visits to the gym ¾ Total cost = 2(200) + 100 = $500


8

What if the relationship is imperfect? Straight line for a perfect relationship:

Y = bX + a Straight line for an imperfect relationship:

Y ′ = bX + a

Y = bX + a Two different symbols for the “predicted value of Y” B. Weaver, NHRC 2008

9

R-squared • R-squared = the proportion of variability in Y that is accounted for by explanatory variables in the model. • For a simple linear regression model (i.e., one predictor variable), R-squared = the proportion of the variability in Y that can be accounted for by the linear relationship between X and Y • The adjusted R-squared corrects for upward bias in R-squared


10

Anscombe’s examples (1973) ¾Frank Anscombe devised 4 sets of X-Y pairs ¾He performed simple linear regression for each data set ¾Here are the results


11

Means & Standard Deviations X

Y

Data Set

N

Mean

SD

Mean

SD

1

11

7.50

2.03

9.00

3.32

2

11

7.50

2.03

9.00

3.32

3

11

7.50

2.03

9.00

3.32

4

11

7.50

2.03

9.00

3.32

The means and SDs for the 4 data sets are identical to two decimals. B. Weaver, NHRC 2008

12

Correlations between X and Y Data Set

Pearson r

R-squared 0.67

Adj. R-sq 0.63

SE 1.24

1

0.82

2

0.82

0.67

0.63

1.24

3

0.82

0.67

0.63

1.24

4

0.82

0.67

0.63

1.24

Correlations, R-squared, adjusted Rsquared, and standard errors are all identical to two decimals. B. Weaver, NHRC 2008

13

ANOVA Summary Tables Data Set

Source

SS

df

MS

F

p

Regression

27.490

1

27.490

18.003

0.002

Residual

13.742

9

1.527

Total

41.232

10

Regression

27.470

1

27.470

17.972

0.002

Residual

13.756

9

1.528

Total

41.226

10

Regression

27.500

1

27.500

17.966

0.002

Residual

13.776

9

1.531

Total

41.276

10

Regression

27.510

1

27.510

17.990

0.002

Residual

13.763

9

1.529

Total

41.273

10

1

2

3

4


14

The Regression Coefficients Data Set 1 2 3 4

B

SE

t

p

Constant

3.00

1.124

2.67

X

0.50

0.118

Constant

3.00

X

95% CI Lower

Upper

0.026

0.459

5.544

4.24

0.002

0.233

0.766

1.124

2.67

0.026

0.459

5.546

0.50

0.118

4.24

0.002

0.233

0.766

Constant

3.00

1.125

2.67

0.026

0.455

5.547

X

0.50

0.118

4.24

0.002

0.233

0.767

Constant

3.00

1.125

2.67

0.026

0.456

5.544

X

0.50

0.118

4.24

0.002

0.233

0.767

For all 4 models, Y’ = 0.5(X) + 3 B. Weaver, NHRC 2008

15

Which Model is Best? ¾ Judging by everything we’ve just seen, it appears that the models are all equally good ¾ But if that were true, I wouldn’t be doing this talk! ¾ It is well known that good graphs are an essential part of data analysis (Tukey, 1977; Tufte, 1997) ¾ Let’s look at some graphs that show the relationship between X and Y


16

Scatter-plot for Data Set 1

10 data points

Influential point

Not a good model


17

Scatter-plot for Data Set 2 Perfect linear relationship except for one outlier Better model than for Data Set 1, but still not great.


18

Scatter-plot for Data Set 3 Wrong model! The relationship between X and Y is curvilinear, not linear! The model should include both X and X2 as predictors.


19

Scatter-plot for Data Set 4 This is a good looking plot. No influential points; straight line provides a good fit.


20

Summary ¾ The usual summary statistics for the 4 regression models were virtually identical ¾ Scatter-plots revealed that only one of the 4 data sets gave us a good model ¾ Appropriate graphs are an essential part of data analysis


21

What about multivariable models? ¾Scatter-plots are useful for simple linear regression models (i.e., only one predictor variable) ¾But often, we have multiple, or multivariable regression models (i.e., 2 or more predictor variables) ¾In that case, it is more common to assess the fit of the model by looking at residual plots B. Weaver, NHRC 2008

22

What is a residual? ¾In linear regression, a residual is an error in prediction Residual = (Y – Y’) = (actual score – predicted score)


23

Set 1: Scatter-plot vs. Residual Plot Residual Plot

Residual

Scatter-plot

Y

X


Predicted value of Y

24


Residual

Scatter-plot



25


Residual

Scatter-plot


Runs of same-sign residuals B. Weaver, NHRC 2008

26


Residual

Scatter-plot



27

Summary ¾ The usual summary statistics for the 4 regression models were virtually identical ¾ Scatter-plots revealed that only one of the 4 data sets gave us a good model ¾ Residual plots reveal the same thing, and have the advantage of being applicable to multivariable regression models ¾ Appropriate graphs are an essential part of data analysis


28

Questions?

I think you should be more explicit here in step 2.


29

References Anscombe FJ. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. Tufte ER. (1997). Visual Explanations, Images and Quantities, Evidence and Narrative (3rd Ed.). Graphics Press: Cheshire. Tukey JW. (1977). Exploratory data analysis. Addison-Wesley: Reading, Mass.


30

Extra Slides


31

Just as one would expect! ¾ The experimentalist comes running excitedly into the theorist's office, waving a graph taken off his latest experiment. ¾ "Hmmm," says the theorist, "That's exactly where you'd expect to see that peak. Here's the reason (long logical explanation follows)." ¾ In the middle of it, the experimentalist says "Wait a minute", studies the chart for a second, and says, "Oops, this is upside down." He fixes it. ¾ "Hmmm," says the theorist, "you'd expect to see a dip in exactly that position. Here's the reason...". B. Weaver, NHRC 2008

32

Best-fitting line: Least squares criterion ¾ Many lines could be placed on the scatter-plot, but only one of them is considered the best-fitting line. ¾ The most common criterion for best-fitting is that the sum of the squared errors in prediction is minimized. ¾ This is called the least-squares criterion.


33

Illustration of Least Squares

Error in prediction


34

Illustration of Least Squares Squared error in prediction

Squared error in prediction Error = 0 for this point, so no square


35

Illustration of Least Squares Sum of squared errors = the sum of the areas of all these squares

For any other regression line, the sum of the squared errors would be greater.


36

What is a residual plot? ¾ Scatter-plot with: X = the fitted (or predicted) value of Y Y = the residual (i.e., the error in prediction)

¾ Residuals should be independent of the fitted value of Y ¾ There should be no serial correlation in the residuals (e.g., long runs of same-sign residuals) ¾ Both of these problems (plus some others) can be detected via residual plots ¾ Advantage of residual plots: they can be used in multivariable (i.e., multi-predictor) regression models B. Weaver, NHRC 2008

37

Examples of residual plots Residual

Curvilinear relationship

Predicted Y

Outlier B. Weaver, NHRC 2008

Heteroscedasticity 38

Example of a good residual plot


39

Example of a zig-zag pattern

You do not want to see this kind of zig-zag pattern in the residual plot.


40

Simple linear regression & correlation • Pearson r = the correlation • It measures of the direction and strength of the linear association between X and Y • It ranges from -1 to +1


41

Direction of the linear relationship Positive relationship

As X increases, Y increases


Negative relationship

As X increases, Y decreases

42

Perfect vs. Imperfect Relationship Perfect relationship


Imperfect relationship

43

r-squared • The square of Pearson r is a measure of how well the regression model fits the observed data • It gives the proportion of variability in Y that is accounted for the linear relationship between X and Y. • E.g., let r = 0.6 (or -0.6) • r2 = 0.36 • So 36% of the variability in the Y-scores is accounted for by the linear relationship between X and Y


44

The importance of graphing the data: Anscombe's regression examples

The importance of graphing the data: Anscombe's regression examples

Suggest Documents

The importance of graphing the data: Anscombe's ...

3: Graphing Data

s Examples of the importance of trace gas measurements ... - CiteSeerX

Examples with importance weights - GitHub

Examples for the Application of Statistical Data

[PDF] The Elements of Graphing Data - Google Sites

The Importance of Neutral Examples for Learning Sentiment - CiteSeerX

Basic principles of graphing data - SciELO

Regression — the basics When we speak of regression data, what ...

PDF Online The Elements of Graphing Data - Google Sites

[PDF] The Elements of Graphing Data - Google Sites

The AcroFLeX Graphing System

[eBook] Download The Elements of Graphing Data Full ... - Google Sites

[Ebook] Download The Elements of Graphing Data Full ... - Google Sites

[PDF] The Elements of Graphing Data - Google Sites

Basic principles of graphing data - SciELO

William CLEVELAND's Elements of Graphing Data ... - rci.rutgers.edu

Graphing data with MATLAB - Department of Biomedical ...

Table S13. Logistic regression on rating the importance of ... - PLOS

Examples of items rejected from the data The ...

Mass Data Evaluation Of The Importance Of ... - Research & Data

^+VeaM~)) Read 'Data Structure Examples; Over 45 Examples (Data ...

DiaS~(( Read 'Data Structure Examples; Over 45 Examples (Data ...

The importance of importance