Logistic Regression and Survival. Analysis ... SPSS Survival Manual : A step by
step guide to data analysis using SPSS for Windows (SPSS Version 15) (3rd ed.).
Maidenhead, Berkshire ..... statistics : use and interpretation (4th ed.). New York:
...
Intermediate IBM SPSS Correlation and Multiple Regression Pawel Skuza Statistical Consultant eResearch@Flinders
• Please note that the workshop is aimed to be a brief introduction to the topic and this PowerPoint is primarily designed to support the flow of the workshop. It cannot be seen as either an exclusive or exhaustive resource on the statistical concepts which are introduced in this course. You are encouraged to refer to peer-reviewed books or papers that are listed throughout the presentation. • It is acknowledged that a number of slides have been adapted from presentations produced by the previous statistical consultant (Kylie Lange) and a colleague with whom I worked with in the past (Dr Kelvin Gregory).
Pawel Skuza 2013
Statistical Consulting Website http://www.flinders. edu.au/library/rese arch/eresearch/stati stics-consulting/ or go to Flinders University Website A-Z Index S Statistical Consultant
Introductory Level • Introduction to IBM SPSS • Introduction to Statistical Analysis IBM SPSS - Intermediate Level • Understanding Your Data (Descriptive Statistics, Graphs and Custom Tables) • Correlation and Multiple Regression • Logistic Regression and Survival Analysis • Basic Statistical Techniques for Difference Questions • Advanced Statistical Techniques for Difference Questions • Longitudinal Data Analysis Repeated Measures ANOVA • Categorical Data Analysis IBM SPSS - Advanced Level • Structural Equation Modelling using Amos • Linear Mixed Models • Longitudinal Data Analysis - Mixed and Latent Variable Growth Curve Models • Scale Development • Complex Sample Survey Design / ABS and FaHCSIA Confidentialised Datasets
Pawel Skuza 2013
(1) How to check?
(Examples with APA Style)
• •
•
•
• SPSS – Statistical Package for the Social Sciences
• PASW – Predictive Analytics Software
• IBM SPSS Statistics
Pawel Skuza 2013
– IBM SPSS Statistics Base – IBM SPSS Regression – IBM SPSS Advanced Statistics – IBM SPSS Complex Samples – IBM SPSS Categories – IBM SPSS Exact Tests – IBM SPSS Missing Values – IBM SPSS Forecasting – IBM SPSS Custom Tables – IBM SPSS Conjoint – IBM SPSS Statistics Programmability Extension and AMOS
(2) How to cite? •
• In late 2009 SPSS Inc. was taken over by IBM Company and the software changed its official name twice over the period of one year. From SPSS it was relabelled to PASW (Predictive Analytics Software) and later to IBM SPSS. Consequently, there may be books, online resources, etc. that use either of those different names but in fact refer to the same software.
• Flinders University has licence for number of IBM SPSS products (versions 19, 20, 21) covering following modules:
START SOFTWARE HELP ABOUT
SPSS Inc. Released 2007. SPSS for Windows, Version 16.0. Chicago, SPSS Inc. SPSS Inc. Released 2008. SPSS Statistics for Windows, Version 17.0. Chicago: SPSS Inc. SPSS Inc. Released 2009. PASW Statistics for Windows, Version 18.0. Chicago: SPSS Inc. IBM Corp. Released 2010. IBM SPSS Statistics for Windows, Version 19.0. Armonk, NY: IBM Corp. IBM Corp. Released 2011. IBM SPSS Statistics for Windows, Version 20.0. Armonk, NY: IBM Corp. IBM Corp. Released 2012. IBM SPSS Statistics for Windows, Version 21.0. Armonk, NY: IBM Corp.
??? SPSS / PASW / IBM SPSS ???
IBM SPSS on Flinders University
SPSS / PASW / IBM SPSS
•
Pawel Skuza 2013
• For details explaining various modes of obtaining access to the software go to
Pawel Skuza 2013
http://www.flinders.edu.au/library/research/eresearch/statistics-consulting/spsslicenses-and-technical-support/licenses-for-university-and-home.cfm Pawel Skuza 2013
1
Levels of Measurement and Measurement Scales EXAMPLES:
Ratio Data
Differences between measurements, true zero exists
Height, Age, Weekly Food Spending
Temperature in Celsius, Standardized exam score
Interval Data
Differences between measurements but no true zero
Ordinal Data
Ordered Categories (rankings, order, or scaling)
Service quality rating, Student letter grades
Nominal Data
Categories (no ordering or direction)
Marital status, Type of car owned, Gender/Sex
MEASUREMENT
Selection of statistical methods Example 1 Figure 4.11 from Dancey, C. P., & Reidy, J. (2004). Statistics without maths for psychology : using SPSS for Windows (3rd ed.). New York: Prentice Hall.
Example 2 Table from Pallant, J. (2007). SPSS Survival Manual : A step by step guide to data analysis using SPSS for Windows (SPSS Version 15) (3rd ed.). Maidenhead, Berkshire. U.K. ; New York, NY: Open University Press.
Example 3 Flowchart from http://gjyp.nl/marta/Flowchart%20(English).pdf
Similar ones in other resources … Pawel Skuza 2013
Pawel Skuza 2013
Selection of an Appropriate Inferential Statistics for Basic, Two Variable, Associational Questions or Hypotheses
Parametric Statistics Nonparametric Statistics
Reproduced from (Leech, Barrett, & Morgan, 2008, p. 81)
Pawel Skuza 2013
Selection of an Appropriate Inferential Statistics for Basic, Two Variable, Associational Questions or Hypotheses
Parametric Statistics Nonparametric Statistics
Level (Scale) of Measurement of Both Variables
RELATE
Two Variables or Scores for the Same or Related Subjects
Variables Are Both Normal /Scale and Assumptions Not Markedly Violated Both Variables at Least Ordinal Data or Assumptions Markedly Violated One Variable Is Normal /Scale and One Is Nominal Both Variables Are Nominal or Dichotomous
MEANS
Analyze Correlate Bivariate
RANKS
Analyze Correlate Bivariate
Level (Scale) of Measurement of Both Variables
RELATE
Two Variables or Scores for the Same or Related Subjects
Variables Are Both Normal /Scale and Assumptions Not Markedly Violated Both Variables at Least Ordinal Data or Assumptions Markedly Violated One Variable Is Normal /Scale and One Is Nominal Both Variables Are Nominal or Dichotomous
MEANS
PEARSON CORRELATION / BIVARIATE REGRESSION
RANKS
KENDALL'S TAU-B or SPEARMAN’S RANK ORDER CORRELATION (RHO) ETA
COUNTS
PHI or CRAMER'S V
Reproduced from (Leech, Barrett, & Morgan, 2008, p. 75)
Pawel Skuza 2013
Data Used Simplified data from PISA 2003 Study – Australia & Indonesia (The Programme for International Students Assessment)
http://www.pisa.oecd.org Analyze Descriptive Statistics Crosstabs COUNTS
Analyze Descriptive Statistics Crosstabs Pawel Skuza 2013
Pawel Skuza 2013
2
Concept of Correlation • Measures of correlation – Used to describe the relationship between two variables • Does mathematics achievement co-vary with attitude towards mathematics – Poor attitude, poor achievement – Good attitude, good achievement
• A coefficient of correlation is a statistical summary of the degree and direction of relationship or association
Concept of Correlation • Correlation coefficients measure of the strength of association between two continuous variables • Of interest is whether one variable generally increases as the second increases, whether it decreases as the second increases, or whether their patterns of variation are totally unrelated. • Correlation measures observed co-variation • It does not provide evidence for causal relationship between the two variables.
Pawel Skuza 2013
Pawel Skuza 2013
Monotonic or Linear Correlation
Monotonic or Linear Correlation Monotonic, linear
• Data may be correlated in either a linear or nonlinear fashion. • When y generally increases or decreases as x increases, the two variables are said to possess a monotonic correlation.
Nonmonotonic
– This correlation may be nonlinear, with exponential patterns, piecewise linear patterns, or patterns similar to power functions when both variables are non-negative.
Monotonic, non-linear
Pawel Skuza 2013
Pawel Skuza 2013
Measures of Correlation • Three measures of correlation are in common use – Kendall's tau – Spearman's rho – Pearson's r
Monotonic or Linear Correlation Y
Linear, use Pearson’s r
• The first two are based on ranks, and measure all monotonic relationships
Y
– They are also resistant to effects of outliers
• The more commonly-used Pearson's r is a measure of linear correlation
Not suitable for correlation
X Y
– One specific type of monotonic correlation.
• None of the measures will detect nonmonotonic relationships, where the pattern doubles back on itself, X
Pawel Skuza 2013
Monotonic, non-linear, use Kendall tau or Spearman rho
X
Pawel Skuza 2013
3
Some Examples
Scatterplots
• If height of a person increases, what happens to the weight? • If a coach increases the training schedule, what happens to the fitness level of the team? • If the time working with SPSS increases, what happens to the proficiency in statistics?
• Often a good understanding of the correlation between two variables can be obtained using a scatterplot • The main purpose of the scatterplot is to study the relationship between two variables – But keep in mind that scatterplots can be deceptive
Pawel Skuza 2013
Graphical Examples
Pawel Skuza 2013
Graphical Examples
High
Variable Y (Dependent)
• Perfect positive
High
• Perfect negative
Low
Low Low
Variable X (Independent)
High
Low
Variable X (Independent)
Pawel Skuza 2013
Graphical Examples
High Pawel Skuza 2013
Graphical Examples • No relationship
High
• Positive, less than 1.00
Low Pawel Skuza 2013
Low
Variable X (Independent)
High Pawel Skuza 2013
4
Graphical Examples
Correlation Between Height and Weight • Negative, greater than -1.00
High
Low
Variable X (Independent) Low
Height (m)
High
Pawel Skuza 2013
Correlation Between Height and Weight
– Capture most (all) of the points with a single oval
Outlier?
Mass (kg)
• Imagine drawing an oval around the points
Outlier?
Pawel Skuza 2013
Consequences of Not Ignoring Outliers • Outliers may distort correlation • Circle shape – No relationship between height and weight
Height (m)
Height (m)
Pawel Skuza 2013
Consequences of Ignoring Outliers • Outliers may distort correlation • Strong relationship where none exists
Pawel Skuza 2013
Outliers in More Depth • What should you do with outliers? • Many statistics techniques aim to describe patterns in the data – They describe relationships
• Outliers affect these relationships
• Should outliers be • Deleted? • Investigated? • Or simply left alone…
Height
Pawel Skuza 2013
Pawel Skuza 2013
5
Dealing with Outliers
Correlations
• Outlier may be due to data entry problem • Check data entry • Implement quality control measures at data entry stage
• Zero-order correlation – Simple correlation between predictor and dependent, ignoring any other variables
– “Garbage in, garbage out”
• Outlier may be from a different population or sample
• Partial correlation – Contribution of other predictors are removed from relationship
• Check sample
• Outlier may be due to unusual behavior
• Part (semi-partial) correlation
• For example, illness or upset relating to a test • May be excluded only after thorough checking and documentation
– Contribution of other predictors are removed from effect of the predictor only
Pawel Skuza 2013
Selection of the Appropriate Complex Associational Statistic for Predicting a Single Dependent/Outcome Variable from Several Independent Variables SEVERAL INDEPENDENT OR PREDICTOR VARIABLES One Dependent or Outcome Variable
All Normal / Scale
Normal/Scale (Continuous)
MULTIPLE REGRESSION
Dichotomous
DISCRIMINANT ANALYSIS
Some Normal Some or all Dichotomous (2 categories)
Some or all Nominal (Categorical with more than 2 categories)
Normal and/or Dichotomous, with at least one random and/or nested variable
MULTIPLE REGRESSION or GENERAL LINEAR MODEL LOGISTIC REGRESSION
GENERAL LINEAR MODEL
LINEAR MIXED MODELS
LOGISTIC REGRESSION
Generalized Estimating Equations
Reproduced from (Leech, Barrett, & Morgan, 2008, p. 75)
Pawel Skuza 2013
Introduction to Regression Analysis
Pawel Skuza 2013
Selection of the Appropriate Complex Associational Statistic for Predicting a Single Dependent/Outcome Variable from Several Independent Variables SEVERAL INDEPENDENT OR PREDICTOR VARIABLES One Dependent or Outcome Variable Normal/Scale (Continuous)
All Normal / Scale
Analyze Regression Linear
Some Normal Some or all Dichotomous (2 categories)
Analyze Regression
Linear
or Analyze General Linear Models Univariate Dichotomous
Analyze Classify Discriminant
Analyze Regression
Some or all Nominal (Categorical with more than 2 categories)
Normal and/or Dichotomous, with at least one random and/or nested variable
Analyze General Linear Models Univariate
Analyze Mixed Models Linear
Analyze Regression
Binary
Binary
Logistic
Logistic
Analyze Generalized Linear Models Generalized Estimating Equations Pawel Skuza 2013
Simple Linear Regression Model • Only one independent variable, X
• Regression analysis is used to: – Predict the value of a dependent variable based on the value of at least one independent variable – Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predict or explain
• Relationship between X and Y is described by a linear function • Changes in Y are assumed to be caused by changes in X
• Independent variable: the variable used to explain the dependent variable Pawel Skuza 2013
Pawel Skuza 2013
6
Types of Relationships Linear relationships
Simple Linear Regression Model
Curvilinear relationships
Y
Y
Population Y intercept Dependent Variable
X
X
Y
Y
Population Slope Coefficient
Independent Variable
Yi β0 β1Xi ε i Linear component
Random Error component
X
X
Pawel Skuza 2013
The simple linear regression equation provides an estimate of the population regression line
Yi β0 β1Xi ε i
Observed Value of Y for Xi
εi
Predicted Value of Y for Xi
Slope = β1 Random Error for this Xi value
X Pawel Skuza 2013
Least Squares Method
values of b0 and b1 that minimize the sum of the squared differences
i
i
(Y (b i
Estimate of the regression intercept
Estimate of the regression slope Value of X for observation i
The individual random error terms ei have a mean of zero Pawel Skuza 2013
Interpretation of the slope and the Intercept
• b0 and b1 are obtained by finding the
min between (Y Yˆ Y )2 and min :
Estimated (or predicted) Y value for observation i
Yˆi b0 b1Xi
Intercept = β0
Xi
Pawel Skuza 2013
Simple Linear Regression Equation (Prediction Line)
Simple Linear Regression Model Y
Random Error term
0
b1Xi ))
2
Pawel Skuza 2013
• b0 is the estimated average value of Y when the value of X is zero • b1 is the estimated change in the average value of Y as a result of a one-unit change in X
Pawel Skuza 2013
7
Linear Regression
Variable selection
• Unstandardised coefficients (B, b) – The effect a 1 unit change in the predictor has on the outcome
• What variables should be included in the model? – All theoretically relevant variables – Interesting variables identified during preliminary analyses
• Standardised coefficients () – What the coefficients would be if all predictors had same mean and standard deviation – Allows comparison of relative importance of predictors measured in different units
• Methods of variable selection – Enter (block): all variables entered simultaneously – Forward / Backward: sequentially eliminate OR insert variables
• Conditional on all other variables in the model
– Stepwise:
sequentially eliminate AND insert variables
Pawel Skuza 2013
Pawel Skuza 2013
Variable selection
Variable selection
• Automatic variable selection methods (forward, backward, stepwise) are problematic – Multiple testing – Capitalises on chance relationships – Overfits to the sample data and won’t replicate in other samples
• Needs a larger sample • Cross-validation to check results
• Sequential / stepped / hierarchical entry – Specify the order that variables should be entered into the model
• Eg: – Block 1: confounders – Block 2: intervention variables
• Does the intervention contribute anything in addition to the confounding variables? Pawel Skuza 2013
Pawel Skuza 2013
Dummy-Variable Example (with 2 Levels)
Linear Regression • Categorical predictors – Dummy variables needed if more than 2 categories – Choose a reference category and create 0/1 variables for each other category – Tests whether Cat X differs significantly in the outcome compared to the reference category – Other schemes possible for other comparisons (eg, ordinal variables) Pawel Skuza 2013
Yˆ b 0 b1 X1 b 2 (1) (b0 b 2 ) b1 X1 Yˆ b 0 b1 X1 b 2 (0) b 0 b1 X1 Y (sales)
b0 + b2 b0
Different intercept
(continued) Holiday No Holiday
Same slope
If H0: β2 = 0 is rejected, then “Holiday” has a significant effect on pie sales X1 (Price)
Pawel Skuza 2013
8
Model fit
Multiple Regression Assumptions
• R-square / Adjusted R-square
• No set values for “good” models • Compare to other models on similar data Pawel Skuza 2013
Errors (residuals) from the regression model: