Intermediate IBM SPSS - Flinders University

52 downloads 356 Views 880KB Size Report
Logistic Regression and Survival. Analysis ... SPSS Survival Manual : A step by step guide to data analysis using SPSS for Windows (SPSS Version 15) (3rd ed.). Maidenhead, Berkshire ..... statistics : use and interpretation (4th ed.). New York:  ...
Intermediate IBM SPSS Correlation and Multiple Regression Pawel Skuza Statistical Consultant eResearch@Flinders

• Please note that the workshop is aimed to be a brief introduction to the topic and this PowerPoint is primarily designed to support the flow of the workshop. It cannot be seen as either an exclusive or exhaustive resource on the statistical concepts which are introduced in this course. You are encouraged to refer to peer-reviewed books or papers that are listed throughout the presentation. • It is acknowledged that a number of slides have been adapted from presentations produced by the previous statistical consultant (Kylie Lange) and a colleague with whom I worked with in the past (Dr Kelvin Gregory).

Pawel Skuza 2013

Statistical Consulting Website http://www.flinders. edu.au/library/rese arch/eresearch/stati stics-consulting/ or go to Flinders University Website A-Z Index S Statistical Consultant

Introductory Level • Introduction to IBM SPSS • Introduction to Statistical Analysis IBM SPSS - Intermediate Level • Understanding Your Data (Descriptive Statistics, Graphs and Custom Tables) • Correlation and Multiple Regression • Logistic Regression and Survival Analysis • Basic Statistical Techniques for Difference Questions • Advanced Statistical Techniques for Difference Questions • Longitudinal Data Analysis Repeated Measures ANOVA • Categorical Data Analysis IBM SPSS - Advanced Level • Structural Equation Modelling using Amos • Linear Mixed Models • Longitudinal Data Analysis - Mixed and Latent Variable Growth Curve Models • Scale Development • Complex Sample Survey Design / ABS and FaHCSIA Confidentialised Datasets

Pawel Skuza 2013

(1) How to check?

(Examples with APA Style)

• •





• SPSS – Statistical Package for the Social Sciences

• PASW – Predictive Analytics Software

• IBM SPSS Statistics

Pawel Skuza 2013

– IBM SPSS Statistics Base – IBM SPSS Regression – IBM SPSS Advanced Statistics – IBM SPSS Complex Samples – IBM SPSS Categories – IBM SPSS Exact Tests – IBM SPSS Missing Values – IBM SPSS Forecasting – IBM SPSS Custom Tables – IBM SPSS Conjoint – IBM SPSS Statistics Programmability Extension and AMOS

(2) How to cite? •

• In late 2009 SPSS Inc. was taken over by IBM Company and the software changed its official name twice over the period of one year. From SPSS it was relabelled to PASW (Predictive Analytics Software) and later to IBM SPSS. Consequently, there may be books, online resources, etc. that use either of those different names but in fact refer to the same software.

• Flinders University has licence for number of IBM SPSS products (versions 19, 20, 21) covering following modules:

START SOFTWARE  HELP  ABOUT

SPSS Inc. Released 2007. SPSS for Windows, Version 16.0. Chicago, SPSS Inc. SPSS Inc. Released 2008. SPSS Statistics for Windows, Version 17.0. Chicago: SPSS Inc. SPSS Inc. Released 2009. PASW Statistics for Windows, Version 18.0. Chicago: SPSS Inc. IBM Corp. Released 2010. IBM SPSS Statistics for Windows, Version 19.0. Armonk, NY: IBM Corp. IBM Corp. Released 2011. IBM SPSS Statistics for Windows, Version 20.0. Armonk, NY: IBM Corp. IBM Corp. Released 2012. IBM SPSS Statistics for Windows, Version 21.0. Armonk, NY: IBM Corp.

??? SPSS / PASW / IBM SPSS ???

IBM SPSS on Flinders University

SPSS / PASW / IBM SPSS



Pawel Skuza 2013

• For details explaining various modes of obtaining access to the software go to

Pawel Skuza 2013

http://www.flinders.edu.au/library/research/eresearch/statistics-consulting/spsslicenses-and-technical-support/licenses-for-university-and-home.cfm Pawel Skuza 2013

1

Levels of Measurement and Measurement Scales EXAMPLES:

Ratio Data

Differences between measurements, true zero exists

Height, Age, Weekly Food Spending

Temperature in Celsius, Standardized exam score

Interval Data

Differences between measurements but no true zero

Ordinal Data

Ordered Categories (rankings, order, or scaling)

Service quality rating, Student letter grades

Nominal Data

Categories (no ordering or direction)

Marital status, Type of car owned, Gender/Sex

MEASUREMENT

Selection of statistical methods Example 1 Figure 4.11 from Dancey, C. P., & Reidy, J. (2004). Statistics without maths for psychology : using SPSS for Windows (3rd ed.). New York: Prentice Hall.

Example 2 Table from Pallant, J. (2007). SPSS Survival Manual : A step by step guide to data analysis using SPSS for Windows (SPSS Version 15) (3rd ed.). Maidenhead, Berkshire. U.K. ; New York, NY: Open University Press.

Example 3 Flowchart from http://gjyp.nl/marta/Flowchart%20(English).pdf

Similar ones in other resources … Pawel Skuza 2013

Pawel Skuza 2013

Selection of an Appropriate Inferential Statistics for Basic, Two Variable, Associational Questions or Hypotheses

Parametric Statistics Nonparametric Statistics

Reproduced from (Leech, Barrett, & Morgan, 2008, p. 81)

Pawel Skuza 2013

Selection of an Appropriate Inferential Statistics for Basic, Two Variable, Associational Questions or Hypotheses

Parametric Statistics Nonparametric Statistics

Level (Scale) of Measurement of Both Variables

RELATE

Two Variables or Scores for the Same or Related Subjects

Variables Are Both Normal /Scale and Assumptions Not Markedly Violated Both Variables at Least Ordinal Data or Assumptions Markedly Violated One Variable Is Normal /Scale and One Is Nominal Both Variables Are Nominal or Dichotomous

MEANS

Analyze  Correlate  Bivariate

RANKS

Analyze  Correlate  Bivariate

Level (Scale) of Measurement of Both Variables

RELATE

Two Variables or Scores for the Same or Related Subjects

Variables Are Both Normal /Scale and Assumptions Not Markedly Violated Both Variables at Least Ordinal Data or Assumptions Markedly Violated One Variable Is Normal /Scale and One Is Nominal Both Variables Are Nominal or Dichotomous

MEANS

PEARSON CORRELATION / BIVARIATE REGRESSION

RANKS

KENDALL'S TAU-B or SPEARMAN’S RANK ORDER CORRELATION (RHO) ETA

COUNTS

PHI or CRAMER'S V

Reproduced from (Leech, Barrett, & Morgan, 2008, p. 75)

Pawel Skuza 2013

Data Used Simplified data from PISA 2003 Study – Australia & Indonesia (The Programme for International Students Assessment)

http://www.pisa.oecd.org Analyze Descriptive Statistics Crosstabs COUNTS

Analyze Descriptive Statistics Crosstabs Pawel Skuza 2013

Pawel Skuza 2013

2

Concept of Correlation • Measures of correlation – Used to describe the relationship between two variables • Does mathematics achievement co-vary with attitude towards mathematics – Poor attitude, poor achievement – Good attitude, good achievement

• A coefficient of correlation is a statistical summary of the degree and direction of relationship or association

Concept of Correlation • Correlation coefficients measure of the strength of association between two continuous variables • Of interest is whether one variable generally increases as the second increases, whether it decreases as the second increases, or whether their patterns of variation are totally unrelated. • Correlation measures observed co-variation • It does not provide evidence for causal relationship between the two variables.

Pawel Skuza 2013

Pawel Skuza 2013

Monotonic or Linear Correlation

Monotonic or Linear Correlation Monotonic, linear

• Data may be correlated in either a linear or nonlinear fashion. • When y generally increases or decreases as x increases, the two variables are said to possess a monotonic correlation.

Nonmonotonic

– This correlation may be nonlinear, with exponential patterns, piecewise linear patterns, or patterns similar to power functions when both variables are non-negative.

Monotonic, non-linear

Pawel Skuza 2013

Pawel Skuza 2013

Measures of Correlation • Three measures of correlation are in common use – Kendall's tau – Spearman's rho – Pearson's r

Monotonic or Linear Correlation Y

Linear, use Pearson’s r

• The first two are based on ranks, and measure all monotonic relationships

Y

– They are also resistant to effects of outliers

• The more commonly-used Pearson's r is a measure of linear correlation

Not suitable for correlation

X Y

– One specific type of monotonic correlation.

• None of the measures will detect nonmonotonic relationships, where the pattern doubles back on itself, X

Pawel Skuza 2013

Monotonic, non-linear, use Kendall tau or Spearman rho

X

Pawel Skuza 2013

3

Some Examples

Scatterplots

• If height of a person increases, what happens to the weight? • If a coach increases the training schedule, what happens to the fitness level of the team? • If the time working with SPSS increases, what happens to the proficiency in statistics?

• Often a good understanding of the correlation between two variables can be obtained using a scatterplot • The main purpose of the scatterplot is to study the relationship between two variables – But keep in mind that scatterplots can be deceptive

Pawel Skuza 2013

Graphical Examples

Pawel Skuza 2013

Graphical Examples

High

Variable Y (Dependent)

• Perfect positive

High

• Perfect negative

Low

Low Low

Variable X (Independent)

High

Low

Variable X (Independent)

Pawel Skuza 2013

Graphical Examples

High Pawel Skuza 2013

Graphical Examples • No relationship

High

• Positive, less than 1.00

Low Pawel Skuza 2013

Low

Variable X (Independent)

High Pawel Skuza 2013

4

Graphical Examples

Correlation Between Height and Weight • Negative, greater than -1.00

High

Low

Variable X (Independent) Low

Height (m)

High

Pawel Skuza 2013

Correlation Between Height and Weight

– Capture most (all) of the points with a single oval

Outlier?

Mass (kg)

• Imagine drawing an oval around the points

Outlier?

Pawel Skuza 2013

Consequences of Not Ignoring Outliers • Outliers may distort correlation • Circle shape – No relationship between height and weight

Height (m)

Height (m)

Pawel Skuza 2013

Consequences of Ignoring Outliers • Outliers may distort correlation • Strong relationship where none exists

Pawel Skuza 2013

Outliers in More Depth • What should you do with outliers? • Many statistics techniques aim to describe patterns in the data – They describe relationships

• Outliers affect these relationships

• Should outliers be • Deleted? • Investigated? • Or simply left alone…

Height

Pawel Skuza 2013

Pawel Skuza 2013

5

Dealing with Outliers

Correlations

• Outlier may be due to data entry problem • Check data entry • Implement quality control measures at data entry stage

• Zero-order correlation – Simple correlation between predictor and dependent, ignoring any other variables

– “Garbage in, garbage out”

• Outlier may be from a different population or sample

• Partial correlation – Contribution of other predictors are removed from relationship

• Check sample

• Outlier may be due to unusual behavior

• Part (semi-partial) correlation

• For example, illness or upset relating to a test • May be excluded only after thorough checking and documentation

– Contribution of other predictors are removed from effect of the predictor only

Pawel Skuza 2013

Selection of the Appropriate Complex Associational Statistic for Predicting a Single Dependent/Outcome Variable from Several Independent Variables SEVERAL INDEPENDENT OR PREDICTOR VARIABLES One Dependent or Outcome Variable

All Normal / Scale

Normal/Scale (Continuous)

MULTIPLE REGRESSION

Dichotomous

DISCRIMINANT ANALYSIS

Some Normal Some or all Dichotomous (2 categories)

Some or all Nominal (Categorical with more than 2 categories)

Normal and/or Dichotomous, with at least one random and/or nested variable

MULTIPLE REGRESSION or GENERAL LINEAR MODEL LOGISTIC REGRESSION

GENERAL LINEAR MODEL

LINEAR MIXED MODELS

LOGISTIC REGRESSION

Generalized Estimating Equations

Reproduced from (Leech, Barrett, & Morgan, 2008, p. 75)

Pawel Skuza 2013

Introduction to Regression Analysis

Pawel Skuza 2013

Selection of the Appropriate Complex Associational Statistic for Predicting a Single Dependent/Outcome Variable from Several Independent Variables SEVERAL INDEPENDENT OR PREDICTOR VARIABLES One Dependent or Outcome Variable Normal/Scale (Continuous)

All Normal / Scale

Analyze  Regression Linear

Some Normal Some or all Dichotomous (2 categories)

Analyze  Regression

Linear

or Analyze  General Linear Models Univariate Dichotomous

Analyze  Classify  Discriminant

Analyze  Regression

Some or all Nominal (Categorical with more than 2 categories)

Normal and/or Dichotomous, with at least one random and/or nested variable

Analyze  General Linear Models Univariate

Analyze  Mixed Models Linear

Analyze  Regression

Binary

Binary

Logistic

Logistic

Analyze  Generalized Linear Models  Generalized Estimating Equations Pawel Skuza 2013

Simple Linear Regression Model • Only one independent variable, X

• Regression analysis is used to: – Predict the value of a dependent variable based on the value of at least one independent variable – Explain the impact of changes in an independent variable on the dependent variable

• Dependent variable: the variable we wish to predict or explain

• Relationship between X and Y is described by a linear function • Changes in Y are assumed to be caused by changes in X

• Independent variable: the variable used to explain the dependent variable Pawel Skuza 2013

Pawel Skuza 2013

6

Types of Relationships Linear relationships

Simple Linear Regression Model

Curvilinear relationships

Y

Y

Population Y intercept Dependent Variable

X

X

Y

Y

Population Slope Coefficient

Independent Variable

Yi  β0  β1Xi  ε i Linear component

Random Error component

X

X

Pawel Skuza 2013

The simple linear regression equation provides an estimate of the population regression line

Yi  β0  β1Xi  ε i

Observed Value of Y for Xi

εi

Predicted Value of Y for Xi

Slope = β1 Random Error for this Xi value

X Pawel Skuza 2013

Least Squares Method

values of b0 and b1 that minimize the sum of the squared differences



i

i

 (Y  (b i

Estimate of the regression intercept

Estimate of the regression slope Value of X for observation i

The individual random error terms ei have a mean of zero Pawel Skuza 2013

Interpretation of the slope and the Intercept

• b0 and b1 are obtained by finding the

min between (Y Yˆ Y )2 and  min :

Estimated (or predicted) Y value for observation i

Yˆi  b0  b1Xi

Intercept = β0

Xi

Pawel Skuza 2013

Simple Linear Regression Equation (Prediction Line)

Simple Linear Regression Model Y

Random Error term

0

 b1Xi ))

2

Pawel Skuza 2013

• b0 is the estimated average value of Y when the value of X is zero • b1 is the estimated change in the average value of Y as a result of a one-unit change in X

Pawel Skuza 2013

7

Linear Regression

Variable selection

• Unstandardised coefficients (B, b) – The effect a 1 unit change in the predictor has on the outcome

• What variables should be included in the model? – All theoretically relevant variables – Interesting variables identified during preliminary analyses

• Standardised coefficients () – What the coefficients would be if all predictors had same mean and standard deviation – Allows comparison of relative importance of predictors measured in different units

• Methods of variable selection – Enter (block): all variables entered simultaneously – Forward / Backward: sequentially eliminate OR insert variables

• Conditional on all other variables in the model

– Stepwise:

sequentially eliminate AND insert variables

Pawel Skuza 2013

Pawel Skuza 2013

Variable selection

Variable selection

• Automatic variable selection methods (forward, backward, stepwise) are problematic – Multiple testing – Capitalises on chance relationships – Overfits to the sample data and won’t replicate in other samples

• Needs a larger sample • Cross-validation to check results

• Sequential / stepped / hierarchical entry – Specify the order that variables should be entered into the model

• Eg: – Block 1: confounders – Block 2: intervention variables

• Does the intervention contribute anything in addition to the confounding variables? Pawel Skuza 2013

Pawel Skuza 2013

Dummy-Variable Example (with 2 Levels)

Linear Regression • Categorical predictors – Dummy variables needed if more than 2 categories – Choose a reference category and create 0/1 variables for each other category – Tests whether Cat X differs significantly in the outcome compared to the reference category – Other schemes possible for other comparisons (eg, ordinal variables) Pawel Skuza 2013

Yˆ  b 0  b1 X1  b 2 (1)  (b0  b 2 )  b1 X1 Yˆ  b 0  b1 X1  b 2 (0)  b 0  b1 X1 Y (sales)

b0 + b2 b0

Different intercept

(continued) Holiday No Holiday

Same slope

If H0: β2 = 0 is rejected, then “Holiday” has a significant effect on pie sales X1 (Price)

Pawel Skuza 2013

8

Model fit

Multiple Regression Assumptions

• R-square / Adjusted R-square

• No set values for “good” models • Compare to other models on similar data Pawel Skuza 2013

Errors (residuals) from the regression model: