Correlation and Regression

Correlation and Regression Scatterplots Correlation Explanatory and response variables Simple linear regression

General Principles of Data Analysis

First plot the data, then add numerical summaries

Look for overall patterns and deviations from those patterns

When overall pattern is regular, use a compact mathematical model to describe it

1

General Principles of Data Analysis Plot your data

To understand the data, always start with a series of graphs

Interpret what you see

Look for overall pattern and deviations from that pattern

Numerical summary?

Choose an appropriate measure to describe the pattern and deviation

Mathematical model?

If the pattern is regular, summarize the data in a compact mathematical model

Univariate Data Analysis Plot your data


For continuous variable, use stemplot, histogram For categorical variable, use bar chart, dot plot

Describe distribution in terms of shape, center, spread, outliers

Numerical summary?

Select measure of central tendency and spread: mean, standard deviation, five-number summary

Mathematical model?

If appropriate, use Normal distribution as model for overall pattern

2

Bivariate Data Analysis For two quantitative variables, use a scatterplot Plot your data


Describe the direction, form, and strength of the relationship

Numerical summary?

If pattern is roughly linear, summarize with correlation, means, and standard deviations

Mathematical model?

Regression gives a compact model of overall pattern, if relationship is roughly linear

Explanatory and Response Variables

Response variable measures outcome of a study

Explanatory variable explains or influences change in response variable

Response variables often called dependent variables, explanatory independent

“Independent” and “dependent” have other meanings in statistics, so prefer to avoid

But remember that calling one variable explanatory and other response doesn’t necessarily imply cause

3

Scatterplot

Scatterplot shows relationship between two quantitative variables measured on same cases

By convention, explanatory variable on horizontal axis, response on vertical axis

Each case appears as point fixed by values of both variables

Wine Consumption and Heart Attacks

Some evidence that drinking moderate amounts of wine my help prevent heart attacks

We have the following data from 19 countries

Yearly wine consumption (liters of alcohol from wine, per person) Yearly deaths (per 100,000 people) from heart disease

In Stata, use -twoway scatter-

Interpret a scatterplot

Overall pattern and striking deviations Form, direction, and strength Look for values that fall outside overall pattern of relationship (outlier)

4

Heart Disease Deaths (per 100,000 people) 50 100 150 200 250 300 0

2 4 6 8 Alcohol with Wine (liters per person per year)

10

Measuring Linear Association: Correlation

Correlation measures direction and strength of linear relationship between two quantitative variables

Usually written as r

In Stata, use -correlate- or -pwcorr-

Find and interpret correlation for wine and heart disease example

5

Using and Interpreting Correlation

Ranges from -1 to 1; values closer to |1| indicate stronger linear relationship

Positive values indicate positive association

Does not distinguish between explanatory and response variables

Requires that both variables be quantitative

It has no unit of measurement – because it uses standardized values, it is scale free

Measures strength only of linear relationships

Like mean and standard deviation, strongly affected by outlying observations

Least-Squares Regression Line

Correlation measures direction and strength of linear relationships

A regression line summarizes relationship between explanatory, x, and response variable, y

We can use regression line to predict value of y for a given value of x

These predictions have error, called residuals

The least-squares regression line of y is the line that minimizes residuals

6

Heart disease death rate (per 100,000) 50 100 150 200 250 300

Vertical distances from least-squares line are residuals

A good line for prediction makes these distances small

0


10

In Stata, obtain this graph with twoway (lfit heartdis alcohol) (scatter heartdis alcohol)

Regression in Stata regress regress heartdis heartdis alcohol alcohol Source SS df MS Source || SS df MS -------------+------------------------------------------+-----------------------------Model | 59813.5718 1 59813.5718 Model | 59813.5718 1 59813.5718 Residual 17 Residual || 24391.3756 24391.3756 17 1434.7868 1434.7868 -------------+------------------------------------------+-----------------------------Total | 84204.9474 18 4678.05263 Total | 84204.9474 18 4678.05263

Number 19 Number of of obs obs == 19 F( 17) F( 1, 1, 17) == 41.69 41.69 Prob > F = Prob > F = 0.0000 0.0000 R-squared = 0.7103 R-squared = 0.7103 Adj Adj R-squared R-squared == 0.6933 0.6933 Root MSE = Root MSE = 37.879 37.879

slope, b

----------------------------------------------------------------------------------------------------------------------------------------------------------heartdis Coef. tt P>|t| [95% heartdis || Coef. Std. Std. Err. Err. P>|t| [95% Conf. Conf. Interval] Interval] -------------+----------------------------------------------------------------------------+---------------------------------------------------------------alcohol | -22.96877 3.55739 -6.46 0.000 -30.4742 -15.46333 alcohol | -22.96877 3.55739 -6.46 0.000 -30.4742 -15.46333 _cons 18.83 231.3733 289.7534 _cons || 260.5634 260.5634 13.83536 13.83536 18.83 0.000 0.000 231.3733 289.7534 ----------------------------------------------------------------------------------------------------------------------------------------------------------y-intercept, a

y^ = a + bx Estimated heart disease death rate = 260.56 + (-22.97)(Per capita alcohol consumption)

7

Using and Interpreting Regression

Distinction between explanatory and response variables is essential in regression

There is a close connection between correlation and slope of least-squares line

Regression of y on x ≠ regression of x on y

b = r(sy / sx) Change of 1 sd in x corresponds to a change of r standard deviations in y

Square of correlation, r2, is fraction of variation in values of y explained by x (PRE)

Conditions for Inference

Observations are independent

True relationship is linear

Standard deviation of response about the true line is same everywhere

Response varies normally about true regression line

¾

Analysis of residuals is key to diagnosing violations of these conditions

8

-50

Residuals 0

50

Residuals-versus-Fitted Plot

50

100

150 Fitted values

200

250

In Stata, obtain this plot after regress with rvfplot, yline(0)

-50

Residuals 0

50

Residuals-versus-Predictor Plot

0


10

In Stata, obtain this plot after regress with rvpplot alcohol, yline(0)

9

300 200 100 0

Heart disease deaths per 100,000 people

95% Confidence Interval for Least-Squares Line

0


10

Obtain this graph with twoway (lfitci heartdis alcohol) (scatter heartdis alcohol)

10

Correlation and Regression

Correlation and Regression

Suggest Documents

Regression and Correlation

The Correlation and Linear Regression Analysis ...

Penalized Regression with Correlation Based Penalty

Correlation-Regression Analysis for Understanding Dominant Height

Applied Multiple Regression/Correlation Analysis for the

Read Applied Multiple Regression/Correlation ... - PDFKUL.COM

Nonparametric Regression for Analyzing Correlation between ... - AGH

Using Correlation and Regression: Mediation, Moderation, and More

Correlation, regression and path analysis among yield and yield traits ...

Power 3.1: Tests for correlation and regression anal

Tests for correlation and regression analyses - G*Power - Heinrich ...

RRreg: An R Package for Correlation and Regression Analyses of ...

Correlation, Regression and Path Analyses of Seed Yield ...

statistical diagnostics for bivariate correlation and regression analysis ...

Linear correlation and regression between the ... - Thermal Science

Scatter Plot, Correlation, and Regression on the TI-83/84

A Simplified Introduction to Correlation and Regression K. L. Weldon ...

Grey Correlation Analysis and Regression Estimation of Mining ... - Core

364-2008: Introduction to Correlation and Regression Analysis - SAS

POL 681 Problem Set 1: Regression and Correlation Review ...

Correlation and Regression in the Training of Teachers - Springer Link

Correlation, Regression and Test of Significance in R

REGRESSION AND CORRELATION ANALYSIS OF GRID SOIL DATA ...

Correlation and regression for prediction of wheat leaf ...