Review Exercises for Chapter 9

670 downloads 2749 Views 117KB Size Report
In the chapter: Exercise sets A and B make one point which is worth taking a bit of ... (a) Correlation of height at age 4 and age 18 will be lower than correlation of ...
Chapter 9 -- Exercise Sets A and B. Statistics 1040 -- Dr. McGahagan In the chapter: Exercise sets A and B make one point which is worth taking a bit of extra time over: Problem 2. Corr (x, y) = Corr (y, x) X = (1 2 3 4 5) and Y = ( 2 3 1 5 6) Plot the points and note that except for the middle point of the two lists (3, 1), all lie on a straight line: Y = 1 + X. Add the line to the plot with the command (abline 1 1) , which generates a line of the form Y = a + b X Set the computer for 8 decimal places with the command (decimals 8), so you can see the results are not likely due to accident: (corr y x) = 0.76249285 (corr x y) = 0.76249285 The order in which the terms are entered into the command does not make any difference -- of course, since (x - (mean x)) * (y - (mean y)) = (y - (mean y)) * (x - (mean y)) implies that the covariance will be the same, and we will divide that by SDx * SDy if X comes first, and SDy * SDx if Y comes first. Problem 3. Add 3 to each element of Y. Quick answer: adding 3 to each element also changes the means, so the deviations don't change. We saw in the last chapter that this would not change the standard deviation; it will also leave the covariance unchanged, since its calculation depends on the deviations of y from its mean. Calculation: (bind y+3 (+ 3 y)) (corr x y+3) = 0.76249285 -- the same as (corr x y). Problem 4. Double each value of X. Quick answer: the deviations of X from its mean are doubled; therefore the covariance will double -- but it will be divided by a standard deviation of which is also double, so the correlation will remain the same. Calculation: (bind x*2 (* 2 x)). (sd x) = 1.41421356 (sd x*2) = 2.82842712 Problem 5. Interchange the last two values of Y. Unless the last two values of X are equal, this will change the covariance, and hence the correlation. (bind y5 (list 2 3 1 5 6)) (plot y5 x), and follow this with (abline 1 1) and also (abline 0 1 blue). What has changed? Would you expect the correlation to increase or decrease or to stay the same? (corr x y5) = ?? will tell you whether you were right or not. Problem 6. Corr (x, y) = 0.73. What happens if you multiply all values of y by -1? (scatter .73) and look at the plot. You will have defined variables called (depending on the version of EcLS) X1 and X2 or X and Y. (corr x1 x2) = .7300 (bind x2neg) = (- x2) (corr x1 x2neg) = ?? You might compare the plots (plot x2 x1) and (plot x2neg x1) Working through the rest of the problems in Exercise sets A and B will drive home the lesson that correlations do NOT depend on units of measurement, so is insensitive to a change of scale (for example, Fahrenheit to Celsius). See Set B, p. 146, problem 3.

Chapter 9 -- Review Exercises Statistics 1040 -- Dr. McGahagan Problem 1. Graphs for 1 or 2 variables. One variable: use a histogram or density plot or boxplot to see the distribution. Two variables: use a scatterplot to examine their correlation. Note that a density plot or side by side boxplots may be useful in comparing their distributions. Problem 2. True or False? (a) FALSE. If correlation coefficient is -0.8, below-average values of the dependent variable are associated with ABOVE average values of the independent variable. Try (scatter -0.8) and compare to (scatter 0.8) (b) FALSE. If y is always less than x, the correlation coefficient may be positive or negative. Try (bind x (list 10 20 30 40 50)), (bind y (/ x 10)) (< y x) = (T T T T T) so y is always less than X. (corr x y) = 1.00 Then (bind y (- (/ x 10))) and repeat: (< y x) = (T T T T T) so y is always less than X. (corr x y) = -1.00 Problem 3. Which correlation is higher? (a) Correlation of height at age 4 and age 18 will be lower than correlation of height at age 16 and age 18; there is more time for the heights to diverge. (b) Correlation of height at age 4 and age 18 will be higher than correlation of weight at age 4 and age 18; height depends more on genetics (at least in a well-nourished society) and weight on eating habits. (c) Height and weight at age 4 will be more highly correlated than height and weight at age 18 -weight depends more on eating habits, and at age 4 they are likely to be more similar than at age 18. Problem 4. Height and weight of college students. Correlation of height and weight is 0.60 for both the men and the women. If we ignore gender in computing the correlation, will it be higher, lower or the same? Surprisingly, HIGHER. Why? When you combine the data, the SD line becomes LONGER, but the scatter of points around the data is no greater than before. The technical note on p. 146 implies that this will lead to a higher value for the correlation coefficient. The problem can be simulated by EcLS: (scatter 0.6 50 70 144 3 21) will generate 50 observations on male height and weight with the means and SDs given and with correlation of about 0.6. Save these variables by (bind mh (copy-list x)) and (bind mw (copy-list y)) (Copying the list is needed because the next scatter command will redefine x and y -- and also any variables which depend on x and y for their distribution).

To generate the female height and weight observations: (scatter 0.6 50 64 120 3 21), and again re-bind them: (bind fh (copy-list x)) and (bind fw (copy-list y)) Then define the pooled heights and weights: (bind height (combine mh fh)) (bind weight (combine mw fw)) Also define an indicator to distinguish men from women: since the first 50 observations are men, we indicate them by 0 and women by 1 in the following. (bind indicator (combine (makelist 50 0) (makelist 50 1)) The command (plot weight height indicator) will plot weight against height, with men in red and women in blue. See the next page for the plot:

Problem 4 (continued)

(corr height weight) = 0.7375.

Problem 5. Can X and Y be perfectly correlated? Given the following sets of numbers, can you fill in the final number to ensure rho = + 1 ? (a) X = (1 2 2 4) Y = (1 3 3 --) All points given lie along a line through the points (1, 1) and (2, 3) The slope of the line is + 2, so the equation of the line would be Y = a + 2 X To fit the other points, we must have: 1 = a + 2 (1) so a = -1 3 = a + 2 (2) so a = -1 The required point will be Y = -1 + 2 (4) = 7 You may check your solution by defining the lists: (bind y (list 1 2 2 4)) (bind x (list 1 3 3 7)) Then (corr x y) = 1.0000 will show that you have solved the problem (b) X = ( 1 2 3 4) Y = ( 1 3 4 --) The first two points lie on a line through the points (1, 1) and (2, 3) The equation of that line was Y = -1 + 2 X, as found in part (a) The third point (3, 4) does NOT lie on that line: Y = -1 + 2 (3) = 5, so an X value of 3 corresponds to a value of 5, not 4. It is therefore impossible for all 4 points to line on the same line. Problem 6. Is the computer right? The correlation coefficient of the two sets of numbers is reported differently -- but since the X values are identical in both, and the second set of Y values is the first set plus 3 it should be the same, since the value of the correlation coefficient is invariant to a linear transformation. To check this: (bind x (seq 1 7)) (bind y1 (list 2 1 4 3 7 5 6)) (bind y2 (+ y1 3)) ;; check to see you have in fact obtained the second set of y values. (decimals 8) may help convince you the correlation is not just approximate: (corr x y1) = 0.82142857 (corr x y2) = 0.82142857 (corr y1 y2) = 1.00000000 Problem 7. Hiram Johnson in 1910 California gubernatorial race Johnson ran more strongly in California counties with higher percentages of native-born citizens than immigrant citizens -- if X = percentage of native born, and Y = percentage of vote for Johnson, (corr x y) = 0.5, as calculated by M. Rogin and John Shover, Political Change in California, 1970. Is this a "fair measure of the extent to which 'Johnson received native, as opposed to immigrant support'"? There is the possibility of a misleading ecological correlation here, unless there is data at the individual level. (The possibility is very remote, since on taking office Johnson passed the California Alien Land Act, which basically said that Asian non-citizen immigrants could not own land !) A more solid example of a misleading ecological correlation is in Andrew Gelman, Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way they Do, Princeton University Press, 2008. Gelman points out that poor states (Mississippi, Kansas) vote Republican and rich states (New York, California) vote Democratic -- but we should not conclude that the rich are Democrats, since within each state, the rich are much more likely to vote Republican and the poor to vote Democratic. See Gelman's website devoted to the book: http://redbluerichpoor.com.

Problem 8. Age and education. Correlation of age and education level = - 0.28. Does this mean that as we become older, we become less educated? Not likely -- no one re-examines you at age 40 or 50 or 60 and takes away your high school diploma or Ph.D. if you don't pass an examination. The correlation is due to earlier generations typically having fewer years of school, and has nothing to do with individuals. Problem 9. Student evaluations of assistants and exam performance. Calculate the correlations: (bind assistant (list 3.3 2.9 4.1 3.3 2.7 3.4 2.8 2.1 3.7 3.2 2.4)) (bind course (list 3.5 3.2 3.1 3.3 2.8 3.5 3.6 2.8 2.8 3.3 3.3)) (bind final (list 70 64 47 63 69 69 69 63 53 65 64)) The correlations are: (corr assistant course) = 0.1243; not much relation. (b) is true. (corr assistant final ) = - 0.5704 definite NEGATIVE correlation. Hence (a) is false. The more students liked the assistant, the worse they did on the final. Hypothesis: the well-liked assistants were entertaining, but not demanding. (corr course final) = 0.4642. Positive relation here; though it is unclear from the correlation whether students put forward effort because they liked the course, or liked the course because they realized they were doing well in it. Problem 10. Math SAT scores. Correlation between percentage of HS schools taking SAT and score of state = - 0.86. The strong negative correlation indicates that test scores are lower where a larger percentage of students take the test, so part (a) is true. However, this does not mean that higher test scores tell you very much about how well the school system is performing -- New York may well encourage more students to take the test than Wyoming, and we would have to know more about what percentage of students in each state take the test before we could draw any conclusion. Problem 11. Average verbal SAT and math SAT by state. Correlation by state = 0.97 (very, very high) The correlation for individuals would be expected to be much lower; see Figure 6 in the section on ecological correlations, which tend to overstate the strength of individual correlations. Problem 12. Education of husbands and wives Correlation of about 0.8, at a guess. (a) Vertical/horizontal stripes occur because few people have, and even fewer report, 13.2134 years of school -- the data are discrete, not continuous. (b) Few points appear because they overlap. If the graph were in EcLS, the capital J key would jitter the points (add some random noise so they did not overlap). (c) Shaded areas on which plot indicate that: (i) Wife completed exactly 16 years of schooling. Plot C (ii) Wife completed more schooling than husband. No plot shows this; given the actual plot, you could draw a 45-degree line (y = x) with the command (abline 0 1). Points above and to the left of that line would indicate wife's education > husband's education. (iii) Husband completed more than 16 years of schooling. Plot B. (iv) Husband completed exactly 12 years, and wife completed fewer than 12 years. Plot A.