... of graphing the data: Anscombe's regression examples ... of linear regression analysis. .... part of data analysis (Tukey, 1977; Tufte, 1997). â¢Let's look at some ...
The importance of graphing the data: Anscombe’s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008
B. Weaver, NHRC 2008
1
The Objective
To To demonstrate demonstrate that that good good graphs graphs are are an an essential essential part part of of linear linear regression regression analysis. analysis.
B. Weaver, NHRC 2008
2
Not this kind of regression analysis
B. Weaver, NHRC 2008
3
This kind of regression analysis
B. Weaver, NHRC 2008
4
A very brief primer on simple linear regression
B. Weaver, NHRC 2008
5
Simple linear regression • A model in which X is used to predict Y. • Y is a continuous variable with interval scale properties. • In the prototypical case, X is also a continuous variable with interval-scale properties. • Example: • Y = distance in a 6-minute walk test • X = FEV1
B. Weaver, NHRC 2008
6
Back to high school Equation for a straight line
Y = bX + a SLOPE
b = slope of the line = the rise over the run
B. Weaver, NHRC 2008
INTERCEPT
a = the value of Y when X = 0
7
Example of a straight line ¾ ¾ ¾ ¾ ¾
Gym membership Annual fee = $100 Fee per visit = $2 Let X = the number of visits to the gym Let Y = the total cost
Y = 2 X + 100 ¾ Let X = 200 visits to the gym ¾ Total cost = 2(200) + 100 = $500
B. Weaver, NHRC 2008
8
What if the relationship is imperfect? Straight line for a perfect relationship:
Y = bX + a Straight line for an imperfect relationship:
Y ′ = bX + a
Y = bX + a Two different symbols for the “predicted value of Y” B. Weaver, NHRC 2008
9
R-squared • R-squared = the proportion of variability in Y that is accounted for by explanatory variables in the model. • For a simple linear regression model (i.e., one predictor variable), R-squared = the proportion of the variability in Y that can be accounted for by the linear relationship between X and Y • The adjusted R-squared corrects for upward bias in R-squared
B. Weaver, NHRC 2008
10
Anscombe’s examples (1973) ¾Frank Anscombe devised 4 sets of X-Y pairs ¾He performed simple linear regression for each data set ¾Here are the results
B. Weaver, NHRC 2008
11
Means & Standard Deviations X
Y
Data Set
N
Mean
SD
Mean
SD
1
11
7.50
2.03
9.00
3.32
2
11
7.50
2.03
9.00
3.32
3
11
7.50
2.03
9.00
3.32
4
11
7.50
2.03
9.00
3.32
The means and SDs for the 4 data sets are identical to two decimals. B. Weaver, NHRC 2008
12
Correlations between X and Y Data Set
Pearson r
R-squared 0.67
Adj. R-sq 0.63
SE 1.24
1
0.82
2
0.82
0.67
0.63
1.24
3
0.82
0.67
0.63
1.24
4
0.82
0.67
0.63
1.24
Correlations, R-squared, adjusted Rsquared, and standard errors are all identical to two decimals. B. Weaver, NHRC 2008
13
ANOVA Summary Tables Data Set
Source
SS
df
MS
F
p
Regression
27.490
1
27.490
18.003
0.002
Residual
13.742
9
1.527
Total
41.232
10
Regression
27.470
1
27.470
17.972
0.002
Residual
13.756
9
1.528
Total
41.226
10
Regression
27.500
1
27.500
17.966
0.002
Residual
13.776
9
1.531
Total
41.276
10
Regression
27.510
1
27.510
17.990
0.002
Residual
13.763
9
1.529
Total
41.273
10
1
2
3
4
B. Weaver, NHRC 2008
14
The Regression Coefficients Data Set 1 2 3 4
B
SE
t
p
Constant
3.00
1.124
2.67
X
0.50
0.118
Constant
3.00
X
95% CI Lower
Upper
0.026
0.459
5.544
4.24
0.002
0.233
0.766
1.124
2.67
0.026
0.459
5.546
0.50
0.118
4.24
0.002
0.233
0.766
Constant
3.00
1.125
2.67
0.026
0.455
5.547
X
0.50
0.118
4.24
0.002
0.233
0.767
Constant
3.00
1.125
2.67
0.026
0.456
5.544
X
0.50
0.118
4.24
0.002
0.233
0.767
For all 4 models, Y’ = 0.5(X) + 3 B. Weaver, NHRC 2008
15
Which Model is Best? ¾ Judging by everything we’ve just seen, it appears that the models are all equally good ¾ But if that were true, I wouldn’t be doing this talk! ¾ It is well known that good graphs are an essential part of data analysis (Tukey, 1977; Tufte, 1997) ¾ Let’s look at some graphs that show the relationship between X and Y
B. Weaver, NHRC 2008
16
Scatter-plot for Data Set 1
10 data points
Influential point
Not a good model
B. Weaver, NHRC 2008
17
Scatter-plot for Data Set 2 Perfect linear relationship except for one outlier Better model than for Data Set 1, but still not great.
B. Weaver, NHRC 2008
18
Scatter-plot for Data Set 3 Wrong model! The relationship between X and Y is curvilinear, not linear! The model should include both X and X2 as predictors.
B. Weaver, NHRC 2008
19
Scatter-plot for Data Set 4 This is a good looking plot. No influential points; straight line provides a good fit.
B. Weaver, NHRC 2008
20
Summary ¾ The usual summary statistics for the 4 regression models were virtually identical ¾ Scatter-plots revealed that only one of the 4 data sets gave us a good model ¾ Appropriate graphs are an essential part of data analysis
B. Weaver, NHRC 2008
21
What about multivariable models? ¾Scatter-plots are useful for simple linear regression models (i.e., only one predictor variable) ¾But often, we have multiple, or multivariable regression models (i.e., 2 or more predictor variables) ¾In that case, it is more common to assess the fit of the model by looking at residual plots B. Weaver, NHRC 2008
22
What is a residual? ¾In linear regression, a residual is an error in prediction Residual = (Y – Y’) = (actual score – predicted score)
B. Weaver, NHRC 2008
23
Set 1: Scatter-plot vs. Residual Plot Residual Plot
Residual
Scatter-plot
Y
X
B. Weaver, NHRC 2008
Predicted value of Y
24
Set 2: Scatter-plot vs. Residual Plot Residual Plot
Residual
Scatter-plot
Predicted value of Y
B. Weaver, NHRC 2008
25
Set 3: Scatter-plot vs. Residual Plot Residual Plot
Residual
Scatter-plot
Predicted value of Y
Runs of same-sign residuals B. Weaver, NHRC 2008
26
Set 4: Scatter-plot vs. Residual Plot Residual Plot
Residual
Scatter-plot
Predicted value of Y
B. Weaver, NHRC 2008
27
Summary ¾ The usual summary statistics for the 4 regression models were virtually identical ¾ Scatter-plots revealed that only one of the 4 data sets gave us a good model ¾ Residual plots reveal the same thing, and have the advantage of being applicable to multivariable regression models ¾ Appropriate graphs are an essential part of data analysis
B. Weaver, NHRC 2008
28
Questions?
I think you should be more explicit here in step 2.
B. Weaver, NHRC 2008
29
References Anscombe FJ. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. Tufte ER. (1997). Visual Explanations, Images and Quantities, Evidence and Narrative (3rd Ed.). Graphics Press: Cheshire. Tukey JW. (1977). Exploratory data analysis. Addison-Wesley: Reading, Mass.
B. Weaver, NHRC 2008
30
Extra Slides
B. Weaver, NHRC 2008
31
Just as one would expect! ¾ The experimentalist comes running excitedly into the theorist's office, waving a graph taken off his latest experiment. ¾ "Hmmm," says the theorist, "That's exactly where you'd expect to see that peak. Here's the reason (long logical explanation follows)." ¾ In the middle of it, the experimentalist says "Wait a minute", studies the chart for a second, and says, "Oops, this is upside down." He fixes it. ¾ "Hmmm," says the theorist, "you'd expect to see a dip in exactly that position. Here's the reason...". B. Weaver, NHRC 2008
32
Best-fitting line: Least squares criterion ¾ Many lines could be placed on the scatter-plot, but only one of them is considered the best-fitting line. ¾ The most common criterion for best-fitting is that the sum of the squared errors in prediction is minimized. ¾ This is called the least-squares criterion.
B. Weaver, NHRC 2008
33
Illustration of Least Squares
Error in prediction
B. Weaver, NHRC 2008
34
Illustration of Least Squares Squared error in prediction
Squared error in prediction Error = 0 for this point, so no square
B. Weaver, NHRC 2008
35
Illustration of Least Squares Sum of squared errors = the sum of the areas of all these squares
For any other regression line, the sum of the squared errors would be greater.
B. Weaver, NHRC 2008
36
What is a residual plot? ¾ Scatter-plot with: X = the fitted (or predicted) value of Y Y = the residual (i.e., the error in prediction)
¾ Residuals should be independent of the fitted value of Y ¾ There should be no serial correlation in the residuals (e.g., long runs of same-sign residuals) ¾ Both of these problems (plus some others) can be detected via residual plots ¾ Advantage of residual plots: they can be used in multivariable (i.e., multi-predictor) regression models B. Weaver, NHRC 2008
37
Examples of residual plots Residual
Curvilinear relationship
Predicted Y
Outlier B. Weaver, NHRC 2008
Heteroscedasticity 38
Example of a good residual plot
B. Weaver, NHRC 2008
39
Example of a zig-zag pattern
You do not want to see this kind of zig-zag pattern in the residual plot.
B. Weaver, NHRC 2008
40
Simple linear regression & correlation • Pearson r = the correlation • It measures of the direction and strength of the linear association between X and Y • It ranges from -1 to +1
B. Weaver, NHRC 2008
41
Direction of the linear relationship Positive relationship
As X increases, Y increases
B. Weaver, NHRC 2008
Negative relationship
As X increases, Y decreases
42
Perfect vs. Imperfect Relationship Perfect relationship
B. Weaver, NHRC 2008
Imperfect relationship
43
r-squared • The square of Pearson r is a measure of how well the regression model fits the observed data • It gives the proportion of variability in Y that is accounted for the linear relationship between X and Y. • E.g., let r = 0.6 (or -0.6) • r2 = 0.36 • So 36% of the variability in the Y-scores is accounted for by the linear relationship between X and Y
B. Weaver, NHRC 2008
44