Review Exercises for Chapter 11

634 downloads 1733 Views 89KB Size Report
Review Questions -- Chapter 11 -- RMS Error. Statistics 1040 -- Dr. McGahagan. Problem 1. Formula for RMS error. RMS error = (sqrt (1 - R-squared)) * SDy.
Review Questions -- Chapter 11 -- RMS Error Statistics 1040 -- Dr. McGahagan Problem 1. Formula for RMS error. RMS error = (sqrt (1 - R-squared)) * SDy Problem 2. Expected value of GPA in college given GPA in high school The program with a RMSE of 3.12 is not very good -- supposing the average GPA in college is 2.5, a RMSE says that someone with an average HS GPA will have a college GPA of 2.5 plus or minus 3 or so -from -0.5 to 5.5. We didn't need a computer to tell us this, and indeed one doubts the SD of the college GPA could be 3.12. The only way to get an RMSE greater than the SD of the predicted variable is to have a correlation with an absolute value of MORE than one, which would indicate an error in programming. Problem 3. California measuring. Average height at age 6 = 46 inches; SD at age 6 = 1.7 inches Average height at age 18 = 70 inches; SD at age 18 = 2.5 inches. Correlation of heights = 0.80 (a) Supposing the response variable is age 18 height, and the regression is to establish E [H18 | H6], or the expected value of height at 18 given height at 6: RMSE = (sqrt (1 - 0.64)) * SD18 = (sqrt 0.36) * 2.5 = 0.6 * 2.5 = 1.5 inches. (b) If the response variable is age 6 height, and the regression is to establish E [H6 | H18], then RMSE = (sqrt (1 - 0.64)) * SD6 = (sqrt 0.36) * 1.7 = 0.6 * 1.7 = 1.02 inches. * Problem 4. Midterm and Final Scores, part I. Average grade on midterm = 50; SD on midterm = 25 Average grade on final = 55; SD on final = 15 Correlation of midterm and final = 0.60 (a) For about one-third of students, prediction of final will be off by more than --- RMSEs. For about 2/3 of the students, the prediction will be off by less than the same unknown value. But we do know that a central area of about 2/3 will be from - 1 to + 1; so the blank should be filled by 1 RMSE.This is given by the problem 1 formula as (sqrt (1 - 0.36)) * 15 = 0.8 * 15 = 12 points. (b) Given a midterm of 80, the student scored (80 - 50) / 25 = 30 / 25 = 1.2 SDs above the mean for the midterm. He is therefore expected to score 0.6 * 1.2 = 0.72 SDs above the mean on the final. The standardized score of the final will have been computed as (X - 55) / 15 = 0.72; hence X = 55 + 0.72 * 15 or we would predict a score of 65.8. In our notation, E [final | midterm = 80] = 65.8 (c) This prediction will, if all assumptions are met (normal distribution of residuals, no pattern to residuals), have a 50 percent chance of being within 0.7 RMSEs (see the normal table; look for central area of 51.61) . That is, it has an even chance of falling within 0.7 * 12 = 8.4 points of the predicted value. * Problem 5. Midterm and Final Scores, Part II. Assuming normality, the percentage of students scoring above 80 on the final can be found without regression: Z-score of 80 points = (80 - 55) / 15 = 25 / 15 = 5/3 = 1.67 Corresponding central area (for Z = 1.65, closest value in table) = 48.43 percent Two tail area = 100 - 48.43 = 51.57 percent, so one tail area (above 80 points) = 25.785 percent. Given a score of 80 on the midterm, the expected score on the final is 65.8 (see last problem). RMSE is 12 points, so a score of 80 would be (80 - 65.8) / 12 = 1.18 RMSEs above the mean. The central area for Z = +/- 1.2 (closest to 1.18) is 76.99, so about 23 percent of the points would be outside this area, or half that = 12.5 points above 80.

* Note on Problems 4 and 5. The regression line (see chapter 12 for more like this). We are able to make the regression calculation more direct: Slope of regression line giving E [final | midterm] = 0.60 * 15 / 25 = 0.36 This line has the form Y = a + 0.36 X, where X is the midterm score and Y the final score. Since the line must pass through the means of X and Y, we know that: Mean(final) = a + 0.36 * Mean(midterm) or 55 = a + 0.36 * 50 or a = 55 - 18 = 37 The regression line is therefore Y = 37 + 0.36 X We can apply that to the case of the student with a score of 80 on the midterm, and predict that student will get: 37 + 0.36 (80) = 37 + 28.8 = 65.8 on the final. Problem 6. Homework and achievement tests. There is a correlation between time spent doing homework and scores on achievement tests. It is not necessarily the case that doing more time on homework causes high scores on achievement tests. There could easily be a common cause -- the greater diligence of the student, a greater emphasis on academic achievement at home -- for both homework and high scores. * Problem 7. Correlation and regression (Extended) There is a correlation between math and physics test scores; both have mean of 60 and equal SDs. We would expect students who scored an above average 75 on the math test to score above average on the physics test as well -- but, due to the regression effect, not as much above average. So the expected score will be more than 60 but less than 75. Suppose the SDs of each test are 20, and the correlation coefficient 0.90 Then the score of 75 was (75 - 60) / 20 = 15 /20 = 0.75 SDs above average on the math test. We would expect a score of 0.9 * 0.75 = 0.675 SDs above average on the physics test. This Z-score would have been computed as; 0.675 = (X - 60) / 20 so X - 60 = 0.675 * 20 = 13.5 and X = 73.5 Note that due to the high correlation, we don't expect the new score to be much less Problem 8. Joseph Berkson and the Bends Berkson (see the Encyclopedia of Biostatistics for an article on Berkson by W. Michael O'Fallon) was right to demand replication of the test -- the regression effect made it likely that those most affected by the first trial would not be quite as much affected on the next test. Olympic competition and college courses use multiple trials to get an overall average. Problem 9. Rookie-of-the-year "Sophomore Slumps"Sophomore slumps could be due to "regression to the mean" rather than distractions. Perhaps the initial award was due to random variation as well. Problem 10. Stock price forecasting. Data could not have come from a simple regression on past price since the forecasts are not linearly related to the previous price. Taking the points A (10, 8) and C (12, 13), the equation of the line connecting them would be Y = -17 + 2.5 * X, which does not fit point D (forecast would be -17 + 2.5 (14) = 18, not 12) or E. Problem 11. Income/education graph. We would expect the vertical stripes, since education is reported as 9, 12, 16 years and so on -- but we would expect the most crowded stripe to come at 12 years of education (HS diploma) and another crowded one at 16 years (bachelor's). These in fact come at 13 and 17 years -- maybe someone was counting kindergarten. Problem 12. Blood pressure and education. Correlation - 0.1 (very weak) 18 years of education is (18 - 13) / 3 = 5/3 SD above the mean; we expect blood pressure to be 0.1 * 5/3 = 5/30 SD below the mean; this is (5/30) * 14 = 2.33 mm below the mean. Our man had blood pressure only 1 mm below the mean, so he was a bit higher than expected. But RMSE is (sqrt (1 - (.1 * .1)) * 14 = 13.93, so this is hardly a huge discrepancy.