Correlation, Regression and Test of Significance in R B. N. Mandal ICAR-IASRI, Library Avenue, Pusa, New Delhi 110 012
[email protected]
1
Introduction
During a given investigation, when the investigator has completed his data collection, the data may have several variables on a number of objects or individuals. Several questions may arise in the mind of the investigator depending on the objective of the investigation and nature of the data. Some of the common questions are i) whether there is any correlation between any two given variables ii) whether one variable is dependent on several other variables and iii) are two or more groups significantly different from each other with respect to one or more particular variable. These questions can be answered by three broad classes of statistical methods namely correlation analysis, linear regression analysis and tests of significance. This note shows how to perform correlation analysis, linear regression analysis and test of significance in R with some data.
2
Correlation analysis
Correlation is a linear measure of association between two variables. Given the observations on two or more variables on same individual or object, correlation between any two variables can be visually inspected by a scatter plot. In a scatter plot, the values of two variables X and Y are plotted in X and Y axis respectively. We consider the data given in Table 1. The scatter plot of data given in Table 1 can obtained using R with the following command. > pairs(mydata) Here mydata is the object in R which holds the data given in Table 1. Running the above code will produce a scatter plot as shown in Figure 1. It may be seen that for the variables X1 and X2 , the points have an upward trend which indicates that there is a positive correlation between the variables. The variables X1 and X3 have a downward trend of the points which is indicative of negative correlation between these two variables. The plot of the variables X1 and X4 have no definite pattern and in fact they are scattered in the entire space. Such pattern indicates that there is no correlation between the two variables.
2.1 Pearson’s correlation coefficient
2 CORRELATION ANALYSIS
Table 1: Data for correlation and regression analysis X1 190.00 196.00 215.00 223.00 232.00 236.00 281.00 279.00 255.00 261.36 275.00 281.00 246.00
2.1
X2 119.08 120.00 155.58 157.00 162.33 165.00 161.25 162.36 169.58 171.58 163.25 165.00 147.24
X3 237.00 225.00 215.00 207.00 194.00 179.00 139.00 140.00 170.00 155.64 149.00 137.00 180.00
X4 257.00 205.00 263.00 267.00 209.00 273.00 257.00 214.00 275.00 258.00 203.00 289.00 231.00
X1 248.00 250.00 244.00 277.00 271.00 282.00 287.00 251.00 255.36 290.00 286.00 260.42 265.32
X2 145.00 143.00 140.00 188.16 187.16 166.00 160.00 163.25 161.32 162.00 165.00 168.50 165.50
X3 180.00 178.00 181.00 143.00 155.00 133.00 134.00 165.00 167.64 124.00 139.00 161.58 146.68
X4 245.00 226.00 217.00 246.00 221.00 287.00 289.00 244.00 247.00 257.00 270.00 249.00 220.00
Pearson’s correlation coefficient
Though scatter plot gives an idea about whether two variables are positive, negative or zero correlation, to get a magnitude of the correlation, the correlation coefficient is used. Pearson’s correlation coefficient measures the correlation between two continuous variables. Given two continuous variables X and Y , the correlation coefficient between them is denoted as ρ. Since in most of the practical cases, data on entire population level is not available, correlation coefficient is computed based on sample values of the two variables. The sample correlation coefficient is computed as Pn (xi − x¯)(yi − y¯) r = pPn i=1 P ¯)2 ni=1 (yi − y¯)2 i=1 (xi − x where xi and yi are the ith observation on two variables X and Y respectively on the same individual or object and x¯ and y¯ are the sample means of the two variables. The value of r lies between -1 to 1. A value of r = 0 means there is no correlation between the two variables based on the sample data. To test the hypothesis that H0 : ρ = 0 vs H1 : ρ 6= 0, the following test statistics is used √ r n−2 t= √ . 1 − r2 The test statistic follows t distribution with n − 2 degrees of freedom. If the computed value of t exceeds the tabulated value of t at n − 2 degrees of freedom at α level of significance then the null hypothesis is rejected. R software can be used to compute the value of Pearson correlation coefficient and testing its significance. For example, to compute the correlation coefficient between X1 and X2 , use the following code. > attach(mydata) II–52
2.1 Pearson’s correlation coefficient
2 CORRELATION ANALYSIS
Figure 1: Scatter plot of 4 variables 140
160
180
200
240
280 260
120
160
200
X1
180
240
120
X2
260
120
X3
200
X4 200
240
280
120
160
200
240
> cor(X1,X2) [1] 0.6936527 > cor.test(X1,X2) Pearson’s product-moment correlation data: X1 and X2 t = 4.7177, df = 24, p-value = 8.513e-05 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4188372 0.8520652 sample estimates: cor 0.6936527 > detach(mydata) Thus, it can be seen that the correlation coefficient between X1 and X2 is 0.694 and the null hypothesis that ρ = 0 is rejected at α = 8.513e − 05. Clearly, the population correlation is not zero at 1% and 5% level of significance. It is possible to perform left and right tailed test using arguments alternative=“less” and alternative=“right” arguments inside the cor.test() function.
II–53
2.2 Spearman’s rank correlation coefficient
2.2
2 CORRELATION ANALYSIS
Spearman’s rank correlation coefficient
Often the measurements on individuals or objects on two or more variables are in the form of ranks. Pearson’s correlation cannot be used to obtain the correlation coefficient for such situation. The appropriate correlation coefficient is the Spearman’s rank correlation coefficient which is given by Pn 2 i=1 di r =1−6 n(n2 − 1) where di is the difference between ranks given to ith individual or object. A data set is given in Table 2 where two judges measures 10 candidates in terms of ranks. Table 2: Rank of individuals by two judges Individual 1 2 3 4 5 6 7 8 9 10
Judge 1 Judge 2 4 7 6 9 2 3 1 4 3 5 5 8 8 2 10 10 9 6 7 1
The rank correlation between the ranks of the two judges may be obtained and test of significance can be performed using the code given below. > cor(Judge.1,Judge.2,method="spearman") [1] 0.2606061 > cor.test(Judge.1,Judge.2,method="spearman") Spearman’s rank correlation rho data: Judge.1 and Judge.2 S = 122, p-value = 0.4697 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.2606061 Clearly there is no correlation between the ranks given by the two judges to the individuals at 5 % level.
II–54
3 REGRESSION ANALYSIS
3
Regression analysis
Regression analysis is concerned with studying the relationship between a dependent variable and one or more predictor variables. The linear regression model is given by Y = β0 + β1 X1 + β2 X2 + ... + βp Xp + ǫ where Y denote the dependent variable and the variables X1 , ..., Xp are the predictor variables and ǫ is the error term . The coefficients β0 , β1 , ..., βp are the regression coefficients which are estimated by the method of least squares. The above model is often represented in matrix notation as y = Xβ + ǫ. The estimated coefficients are given by βˆ = (X′ X)−1 X′ y. This estimator is best linear unbiased estimator of β. To test the null hypothesis H0 : β1 = β2 = ... = βp vs H1 : βj 6= 0 for at least one j, it is assumed that errors identically and independently follow normal distribution with mean 0 and variance σ 2 . For testing of the hypothesis mentioned above, total sum of squares (TSS) is partitioned into regression sum of squares (SSreg ) and residual sum of squares (SSres ), i.e., T SS = SSreg + SSres where T SS = y′ y − (1′ y)2 /n, SSreg = ˆ ′ X′ y − (1′ y)2 /n and SSres = y′ y − β ˆ ′ X′ y. The test statistic is given by β F =
SSreg /p SSres /(n − p − 1)
which follows F distribution with p and n − p − 1 degrees of freedom. The above information is nicely displayed in the form of an Analysis of Variance (ANOVA) table. Source of variation Predictors Residual Total
Degrees of Sum of squares freedom p SSreg n−p−1 SSres n−1 TSS
Mean square
F
MSreg = SSreg /p MSres = SSres /(n − p − 1)
MSreg /MSres
If calculated F value exceeds the tabulated F value at given level of significance α, then the null hypothesis is rejected. Once we have established that the model has at least one non-zero regression coefficient then one may wish to test which of the regression coefficients are actually significant. For this, we need to test individually H0 : βj = 0 vs H1 : βj 6= 0 for each j = 1, 2, ..., p. The test statistic for this test is given by t=
βˆj se(βˆj )
q ˆ where se(βj ) = σˆ2 Cjj with Cjj being the diagonal element of (X′X)−1 corresponding to βj . The test statistic follows t distribution with n − p − 1 degrees of freedom. Thus the null hypothesis is rejected at α level of significance if the calculated value of t is greater than tabulated value at n − p − 1 degrees of freedom. II–55
3 REGRESSION ANALYSIS A very useful measure of how well the predictor variables explain the dependent variable is R2 which is given by SSreg R2 = T SS 2 and it lies between 0 to 1. Since R can always be increased by adding extra variables, often adjusted R2 is used. Adjusted R2 is computed as 2 Radj =1−
SSres /(n − p) . T SS/(n − 1)
The value of Adjusted R2 increases only if residual mean square reduces on addition of a new variable. Now, we perform regression analysis on the data given in Table 1. For this purpose we assume that X1 is the dependent variable and other variables are independent variable. Use the following code to perform regression analysis. > lm1=lm(X1~X2+X3+X4,data=mydata) > summary(lm1) Call: lm(formula = X1 ~ X2 + X3 + X4, data = correlation.data) Residuals: Min 1Q Median -8.629 -2.615 0.533
3Q Max 3.015 6.092
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 395.787409 19.181299 20.634 6.93e-16 *** X2 0.050607 0.075707 0.668 0.511 X3 -0.881510 0.040319 -21.863 < 2e-16 *** X4 -0.006295 0.034186 -0.184 0.856 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 4.379 on 22 degrees of freedom Multiple R-squared: 0.9774,Adjusted R-squared: 0.9744 F-statistic: 317.7 on 3 and 22 DF, p-value: < 2.2e-16 You can see that F statistic value is 317.7 at 3 and 22 degrees of freedom and it is significant at α = 2.2 × 10−16 . Therefore, at least one of the predictors is not zero. The coefficient estimates are also available along with their calculated t values and level of significance. It is seen that only variable X3 is not zero. The predictors explain 97.74% of the variability in the dependent variable. You can get individual predictor’s contribution in the sum of square by using following code. II–56
4 TESTS OF SIGNIFICANCE
> anova(lm1) Analysis of Variance Table Response: X1 Df Sum Sq Mean Sq F value Pr(>F) X2 1 8994.4 8994.4 469.1397 2.502e-16 *** X3 1 9276.5 9276.5 483.8564 < 2.2e-16 *** X4 1 0.7 0.7 0.0339 0.8556 Residuals 22 421.8 19.2 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 It may be pointed out here that the object lm1 contains many other useful information for further diagnostic checking of the fitted model.
4
Tests of significance
Test of significance is concerned with testing a hypothesis based on a given set of observations. The hypothesis that we want to test is called the null hypothesis. An appropriate test statistic is developed to test the null hypothesis and its sampling distribution is then obtained. Acceptance or rejection of the null hypothesis depends on the sampling distribution of the test statistic and the level of significance. In this note, we consider some commonly used tests of hypothesis using R software.
4.1
One sample t-test
Given a sample of independent observations on a variable X, one often wishes to know whether the population mean of the variable is equal to a given value. In other words one wishes to test H0 : µ = µ0 versus the alternative H1 : µ 6= µ0 or H1 : µ > µ0 or H1 : µ < µ0 where µ denote the unknown population mean and µ0 denote a given value. An appropriate test statistic for testing the above hypothesis is t=
x¯ − µ0 √ s/ n
Pn 1 ¯)2 . This test where x¯ is the computed sample mean from the data, s2 = n−1 i=1 (xi − x statistic follows t distribution with n − 1 degrees of freedom. We reject the null hypothesis when the calculated value of t statistic is greater than the tabulated value of t at a level of significance α. To illustrate the single sample t-test in R, we consider the data given in Table 1. Suppose we want to test the hypothesis that the population mean of the variable X1 is 300. This test can be performed in R using following code. > t.test(X1,mu=300)
II–57
4.2 Two independent samples t-test
4 TESTS OF SIGNIFICANCE
One Sample t-test data: X1 t = -8.3377, df = 25, p-value = 1.097e-08 alternative hypothesis: true mean is not equal to 300 95 percent confidence interval: 244.2422 266.3317 sample estimates: mean of x 255.2869 Note that the null hypothesis is rejected and hence the true mean of X1 is not 300. The argument mu=300 is used to specify that µ0 = 300.
4.2
Two independent samples t-test
Given two random samples of two variables which are not paired or matched then an investigator may be interested to know that the difference of the population means of the two variable is a given constant. To put succinctly, the investigator wishes to test H0 : µ1 −µ2 = d versus H1 : µ1 − µ2 6= d where µ1 and µ2 denote the unknown population means of the two variables and d denote a given value. For d = 0, the test becomes whether the population means of the two variables are same or not. Under the assumption of equal variance of the two variables, an appropriate test statistic is x¯1 − x¯2 − d t= p s (1/n1 + 1/n2 ) (n −1)s2 +(n −1)s2
2 1 2 where x¯1 and x¯2 are the sample mean of the two variables, s2 = 1 n1 +n with 2 −2 P ni 1 2 2 si = ni −1 j=1(xij − x¯i ) , i = 1, 2 and xij denote the jth observation on ith variable. For illustration, again consider the data in Table 1. Let us consider testing of the hypothesis that the population mean of X1 and X2 is same. We can perform this test in R with the following code.
> t.test(X1,X2,var.equal=TRUE) Two Sample t-test data: X1 and X2 t = 15.507, df = 50, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 83.81083 108.75225 sample estimates: mean of x mean of y 255.2869 159.0054 II–58
4.3 Paired samples t-test
4 TESTS OF SIGNIFICANCE
The argument var.equal=TRUE is used to specify that the two samples are drawn from populations with same variance. From the output, we can see that the null hypothesis is rejected at 5% level and therefore, two variables do not have same population mean.
4.3
Paired samples t-test
When the two samples are matched or related, then paired samples t-test is used. In this case, the null hypothesis is again H0 : µ1 − µ2 = d versus H1 : µ1 − µ2 6= d where µ1 and µ2 denote the unknown population means of the two variables and d denote a given value. For d = 0, the test becomes whether the population means of the two variables are same or not. The test statistic is d¯ − d t= √ s/ n P P n 1 ¯ where d¯ = n1 nj=1 (x1j − x2j ) and s2 = n−1 j=1 (x1j − x2j − d). The test statistic follows t distribution with n − 1 degrees of freedom. For illustration, consider the data in Table 1 and assume that the observations are matched or paired and the variables are related variables. For testing the equality of population means of X1 and X2 , the paired t-test can be performed using the code > t.test(X1,X2,paired=TRUE) Paired t-test data: X1 and X2 t = 24.638, df = 25, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 88.23332 104.32975 sample estimates: mean of the differences 96.28154 Here the argument paired=TRUE is to specify that the variables are related. The output clearly indicates that the population means of the two variables are not same.
4.4
Testing equality of several population means
To test equality of population means of more than two independent samples, F-test is used. The null hypothesis is H0 : µ1 = µ2 = ... = µp versus H1 : At least two µ’s are different. For this purpose, total sum of squares in the data is partitioned into two components: between group some of squares and within group some of squares. That’s why the test is also called ANOVA F test. The test statistic is given by F =
SSBG/(p − 1) SSW G/(n − p) II–59
4.4 Testing equality of several population means
4 TESTS OF SIGNIFICANCE
where SSBG denotes the between group sum of squares and SSW G denotes the within group sum of squares. A very common application of this test is in designed experiments where SSBG becomes the sum of squares due to the treatments in the experiment and SSW G is the residual sum of squares. We illustrate the ANOVA F-test using the data in Table 1 for testing equality of the population mean of the four variables. The following R code may be used for this purpose. > obs=cbind(X1,X2,X3,X4) > group=c(rep(1,length(X1)),rep(2,length(X2)),rep(3,length(X3)),rep(4,length(X4))) > mydata2=cbind(obs,group) > mydata2=as.data.frame(mydata2) > output=aov(obs~factor(group),data=mydata2) > anova(output) Analysis of Variance Table Response: obs Df Sum Sq Mean Sq F value Pr(>F) factor(group) 3 203992 67997 104.42 < 2.2e-16 *** Residuals 100 65118 651 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
In the above code, observations on four variables are first stacked into one variable obs and then one grouping variable group is created to denote that they belong to four different variables. The output shows that the null hypothesis is rejected at 5% level of significance.
II–60