Review Exercises for Chapter 8

Chapter 8 -- Review Exercises Statistics 1040 -- Dr. McGahagan Correlation Simple warmup exercises: Do these by hand, and check your computation on a computer. Have the computer draw the plots: (plot y x), (plot z x) and so on. Begin with: (bind x (list 1 2 3 4 5)) (bind y (list 5 4 3 2 1)) (bind z (list 5 3 4 2 1)) Then compute the correlations: Do (corr x y) and (corr x z) by hand before running the command. What would you expect (corr y z) to be (before you do any computations)? What is it? Review Exercises (p. 134-139) Problem 1. IQ of husbands and wives. Mean IQ SD (X) Husbands 100 15 (Y) Wives 100 15

Rho = 0.60

Which of the plots on p. 135 represents this data? Eliminate (A) because the mean of X and the mean of Y are both 50. Eliminate (B) because mean +/- 1 SD would cover all the data; this is improbable with any distribution. Eliminate (C) because the correlation looks much higher than 0.60 (probably around 0.95); generate a few examples with (scatter 0.90) and (scatter 0.60) to get a sense of the difference. Also, it looks as if it would take 3 SD in either direction to cover all the data; this is possible but not very likely with a normal distribution. We are left with (D), which seems to have the right means for X and Y in the center of the ellipse, to have a correlation of about 0.60, and to have most of the data covered with 2 SD from the means of both variables. Problem 2. Cars, age and gas mileage. (a) Correlation between age of car and gas mileage would be expected to be negative: as age increases, gas mileage decreases. (b) Correlation between owner's income and gas mileage would be expected to be positive, even if the cars of the rich were identical in age, model and odometer miles: the rich are more likely to pay for regular maintenance. Problem 3. Men always marry women exactly 10 percent shorter. Let HH = height of husband and HW = height of wife Therefore, HW = a + 0.90 HH will be the equation of the line enabling us to predict the height of the wife, and every point representing a pair of husbands and wives will be exactly on that line -- a man 60 inches tall will have a wife 54 inches tall, a man 70 inches tall will have a wife 63 inches tall, a man 500 inches tall will have a wife 450 inches tall, and so on. The correlation coefficient will be + 1. IMPORTANT LESSON: the correlation coefficient does not give the slope of the line, but is determined by the presence or absence of scatter about the line. Problem 4. Estimate the likely correlation between the height of husbands and wives: Some positive correlation would be expected, but 0.95 surely too high. I would guess 0.6 the most likely value, but would think 0.3 a reasonable guess as well.

Problem 5. Guess the correlation. Choices: -0.5, 0.0, 0.3, 0.6, 0.95 (a) Between freshman and sophomore GPA. Positive but certainly not perfect. Many students who have initial trouble improve in second year. Best guess: 0.6 [incidentally, this is on the order of the correlation between SAT scores and freshman grades] (b) Between freshman and senior grades: Still positive, but with more time comes more variation. Best guess: 0.30 (c) Between weight and length of 2x4 pieces of pine. Best guess: 0.95 There is still likely to be variation if moisture conditions or number of knotholes differ. Problem 6. Is correlation of 0.90 "Twice as Linear" as correlation of 0.45 ? The phrase "twice as linear" is nonsense -- all correlations measure the strength of a linear relation. You can say "twice as strong, " but it is best just to report the number. Problem 7. Associate scatter diagrams with correlation coefficients. Grid of correlation coefficients corresponding with the plots should be: 0.62 (positive, not very strong)

- 1.00 (perfectly negatively correlated)

- 0.85 (negative, pretty strong)

+ 0.95 (positive and very strong)

+ 0.06 (no real pattern, and not even very clear that the relation is positive)

-0.38 (weak, but pretty definitely negative; compare with the plot to the left to decide which of the final two is which)

Problem 8. Human Growth Study. Plot of heights at age 18 against height at age 4 shows fairly strong but not perfect correlation. Average height at age 4 seems about 41 inches -- at about the point where the two lines cross. SD of height at age 4 = about 1.5 inches, because 41 +/- 2 (1.5) will cover almost all points (put an index card vertically at 44 and 38 inches to see this) Average height at age 18 seems about 71 inches (again the point at which the two lines cross). SD of height at age 18 = about 2.5 inches, because 71 +/- 2 (2.5) will cover almost all points (put an index card horizontally at 66 and 76 inches to see this). The correlation coefficient is about 0.75 to 0.80 -- high but not perfect correlation. The SD line will be the solid line; we will soon meet the dashed line as the regression line. If you xerox the page (on higher magnification) and draw the horizontal and vertical lines as above, you will see that the SD box corners should pretty much fall on the solid line.

Problem 9. Correlation coefficient calculation exercise. Mean SD Corr. (a) X 2 1 - 0.8 Y 3 2 (b)

X Y

2 2

1 1

+ 0.3

(c)

X Y

2 4

1 2

+ 1.0 (note that Y = 2 X)

Computer notes: (bind x (list 1 1 1 1 2 2 2 3 3 4)) and X will stay the same for all three problems. (bind y (list ....)) will change with each problem. (plot y x) and use the PLOT menu to adjust the axis range so X and Y start with zero. You can jitter the plots with the capital J key, distinguishing overlapping points. Problem 10. IQ scores from two different tests. Can one predict the other? The given diagram shows that points are more widely scattered at high values than at low; hence predictions made at low values will have smaller errors than those made at high scores. The problem is often met in economic forecasting: firms with small sales have small profits, but the possible range of variation is small. Those with large sales tend to have large profits, but the range of variation is quite large. In regression, the problem is known as heteroscedasticity

Problem 11. Quiz in statistics With 10 questions on a quiz, and an average number of 6.4 right, and a SD of 2.0 for the right answers, we note that, letting R = number of right answers and W = number of wrong answers, W = 10 - R There cannot possibly be any point off the SD line, so the correlation coefficient will be minus 1. Simulation: (bind right (rnd 20 10)) Create 50 uniform random integers from 0 to 10, using the Mersenne Twister (bind wrong (- 10 right)) (stats right wrong) (corr right wrong) (plot right wrong) (abline 10 -1) Problem 12. Count the dots --- with a little bit of help from your friends. Down a bit from the middle of the columns on p. 139, note that three students all chose 91 dots the first time, and all chose 82 the second time -- this seems a bit suspicious. Just below that, two students chose 85 each time. This is not quite as obvious collusion, but add in the student in the fifth row who also chose 85 each time, and you might be suspicious. Important lesson: when errors have patterns, something is going on. It is difficult to tell by eyeballing -- there seems a more common tendency to guess low the second time, which would imply a negative correlation. Calculation shows a negative correlation of -0.2851; plotting the points shows this seems entirely due to student 7, who chose 85 the first time and 89 the second time, bucking the general trend to lower the estimate on the second count.