Statistics and Research Methodology. Statistics Part. By. Dr. Samir Safi.
Associate Professor of Statistics. Spring 2012. ١. ٢. Lecture #1. Introduction.
Numerical ...
Statistics and Research Methodology Dr. Samir Safi
Statistics and Research Methodology
Statistics Part By
Dr. Samir Safi Associate Professor of Statistics Spring 2012
١
Lecture #1 Introduction Numerical Descriptive Techniques
٢
١
Statistics and Research Methodology Dr. Samir Safi
Types of Statistics • Statistics • The branch of mathematics that transforms data into useful information for decision makers.
Descriptive Statistics
Inferential Statistics
Collecting, summarizing, and describing data
Drawing conclusions and/or making decisions concerning a population based only on sample data
٣
Descriptive Statistics • Collect data – e.g., Survey
• Present data – e.g., Tables and graphs
• Characterize data – e.g., Sample mean =
∑X
i
n ٤
٢
Statistics and Research Methodology Dr. Samir Safi
Inferential Statistics • Estimation – e.g., Estimate the population mean weight using the sample mean weight • Hypothesis testing – e.g., Test the claim that the population mean weight is 120 pounds Drawing conclusions about a large group of individuals based on a subset of the large group.
٥
Basic Vocabulary of Statistics VARIABLE A variable is a characteristic of an item or individual. DATA Data are the different values associated with a variable. POPULATION A population consists of all the items or individuals about which you want to draw a conclusion. SAMPLE A sample is the portion of a population selected for analysis. PARAMETER A parameter is a numerical measure that describes a characteristic of a population. STATISTIC A statistic is a numerical measure that describes a characteristic of a sample. ٦
٣
Statistics and Research Methodology Dr. Samir Safi
Types of Data Numerical (Quantitative)
• Numerical (quantitative) variables have values that represent quantities • Values are real numbers • All calculations are valid Categorical (Qualitative)
• Categorical (qualitative) variables have values that can only be placed into categories, such as “yes” and “no.”
٧
Types of Data
(continued)
Categorical can be classified into: Ordinal • Values must represent the ranked order of the data • Calculations based on an ordering process are valid Nominal • Values are the arbitrary numbers that represent categories • Only calculations based on the frequencies of occurrence are valid
٨
٤
Statistics and Research Methodology Dr. Samir Safi
Types of Data Data
Categorical
Numerical
Examples:
Marital Status Political Party Eye Color (Defined categories)
Discrete
Continuous
Examples:
Number of Children Defects per hour (Counted items)
Examples:
Weight Age (Measured characteristics)
٩
Numerical Descriptive Techniques Measures of Center Tendency (Location) Mean Median Measures of Variability (Spread) Quartiles (Quantiles) Variance and Standard deviation
١٠
٥
Statistics and Research Methodology Dr. Samir Safi
Measures of Central Tendency: The Mean • The arithmetic mean (often just called “mean”) is the most common measure of central tendency Pronounced x-bar
The ith value
– For a sample of size n: n
∑X X=
i=1
n
i
=
X1 + X 2 + ⋯ + Xn n
Sample size
Observed values ١١
Measures of Central Tendency: The Median • In an ordered array, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
• Not affected by extreme values ١٢
٦
Statistics and Research Methodology Dr. Samir Safi
Measures of Central Tendency: The Mode • Value that occurs most often • Not affected by extreme values • Used for either numerical or categorical data • There may be no mode • There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 No Mode
Mode = 9
١٣
Mean, Median, AND Mode: • Statistics • Mean
Type of Data Interval “Numerical data” (without extreme observations)
• Median
Ordinal or interval (with extreme observations)
• Mode
Nominal, ordinal, interval ١٤
٧
Statistics and Research Methodology Dr. Samir Safi
In The Presence Of Outliers Q: Do outliers affect the Mean and Median? Consider the list on numbers from 1 through 9 : 1, 2, 3, 4, 5, 6, 7 ,8 ,9 The Mean is : 5
The Median is : 5
What if we put the number 100 at the end of the list : 1, 2, 3, 4, 5, 6, 7 ,8 ,9, 100 The Mean is :14.5
The Median is : 5.5
A: Outliers affect the Mean much more than the Median ! ١٥
Measures of Central Tendency: Which Measure to Choose? The mean is generally used, unless extreme values (outliers) exist. The median is often used, since the median is not sensitive to extreme values. For example, median home prices may be reported for a region; it is less sensitive to outliers. In some situations it makes sense to report both the mean and the median. ١٦
٨
Statistics and Research Methodology Dr. Samir Safi
Measures of Variation Variation
Range
Variance
Standard Deviation
Coefficient of Variation
Measures of variation give information on the spread or variability or dispersion of the data values.
Chap 3-١٧
Same center, different variation ١٧
Measures of Variation: The Range Simplest measure of variation Difference between the largest and the smallest values:
Range = Xlargest – Xsmallest Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
Range = 13 - 1 = 12 ١٨
٩
Statistics and Research Methodology Dr. Samir Safi
Measures of Variation: The Variance • Average (approximately) of squared deviations of values from the mean n
– Sample variance:
Where
∑ (X − X)
2
i
S2 =
i =1
n -1
X = arithmetic mean n = sample size Xi = ith value of the variable X ١٩
Measures of Variation: The Standard Deviation • • • •
Most commonly used measure of variation Shows variation about the mean Is the square root of the variance Has the same units as the original data n
– Sample standard deviation:
∑ (X − X)
2
i
S=
i =1
n -1 ٢٠
١٠
Statistics and Research Methodology Dr. Samir Safi
Measures of Variation: Summary Characteristics The more the data are spread out, the greater the range, variance, and standard deviation. The more the data are concentrated, the smaller the range, variance, and standard deviation. If the values are all the same (no variation), all these measures will be zero. None of these measures are ever negative. ٢١
Quartile Measures • Quartiles split the ranked data into 4 segments with an equal number of values per segment 25%
25%
Q1
25%
Q2
25%
Q3
The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% of the observations are smaller and 50% are larger) Only 25% of the observations are greater than the third quartile ٢٢
١١
Statistics and Research Methodology Dr. Samir Safi
Quartile Measures: The Interquartile Range (IQR) • The IQR is Q3 – Q1 and measures the spread in the middle 50% of the data • The IQR is also called the midspread because it covers the middle 50% of the data • The IQR is a measure of variability that is not influenced by outliers or extreme values • Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant measures ٢٣
The Five Number Summary The five numbers that help describe the center, spread and shape of data are: Xsmallest First Quartile (Q1) Median (Q2) Third Quartile (Q3) Xlargest
٢٤
١٢
Statistics and Research Methodology Dr. Samir Safi
Which Measure To Use ? Q: When is the mean better than median? When is the five number summary better than the standard deviation?
Rules Of Thumb A1: If outliers appear, or if your distribution is skewed, then the mean could be affected, so use the median and the five number summary. A2: If the distribution is reasonably symmetric and is free of outliers, then the mean and standard deviation should be used. ٢٥
Lecture #2
Introduction to Statistical Inference • Introduction • T-Tests
٢٦
١٣
Statistics and Research Methodology Dr. Samir Safi
Introudction to Statistical Inference Statistical inference involves using data collected in a sample to make statements (inferences) about unknown population parameters. Two types of statistical inference are: • estimation of parameters (Point and Confidence estimation). • statistical tests about parameters (Testing Hypothesis).
٢٧
Tests of Significance There are two common types of formal statistical inference: 1) Confidence intervals • They are appropriate when our goal is to estimate a population parameter. 2) Hypothesis Testing • To assess the evidence provided by the data in favor of some claim about the population.
٢٨ ٢٨
١٤
Statistics and Research Methodology Dr. Samir Safi
General Terms: Hypotheses :What is a Hypothesis? • A hypothesis is a claim (assertion) about a population parameter:
- population mean Example: The mean monthly cell phone bill in this city is µ = $42
– population proportion Example: The proportion of adults in this city with cell phones is π = 0.68 ٢٩
General Terms: Hypotheses The null hypothesis, denoted by H0, is a conjecture about a population parameter that is presumed to be true. It is usually a statement of no effect or no change. Example: The population is all students taking the Research Methodology. The parameter of interest is the mean Research Methodology score. ( µ = mean Methodology score) Suppose we believe that the mean Methodology score is 75.
Then H0: µ = 75
٣٠ ٣٠
١٥
Statistics and Research Methodology Dr. Samir Safi
The Null Hypothesis, H0 (continued)
• Begin with the assumption that the null hypothesis is true –Similar to the notion of innocent until proven guilty • Always contains “=” , “≤” or “≥ ≥” sign • May or may not be rejected
٣١
General Terms and Characteristics The alternative (or research) hypothesis, denoted by Ha or H1, is a conjecture about a population parameter that the researcher suspects or hopes is true.
Example: A new course has been developed which will hopefully improve students scores on the Research Methodology. We want to test to see if there is an improvement. Then Ha: µ > 75 ٣٢
١٦
Statistics and Research Methodology Dr. Samir Safi
The Alternative Hypothesis, H1 (continued)
• Is the opposite of the null hypothesis – e.g., The average number of TV sets in Gaza homes is not equal to 2 ( H1: µ ≠ 2 )
• Never contains the “=” , “≤” or “≥ ≥” sign • May or may not be proven • Is generally the hypothesis that the researcher is trying to prove ٣٣
General Terms: Hypotheses The research hypothesis Ha will contain either a greater than sign, a less than sign, or a not equal to sign. Greater than: > results if the problem says increases, improves, better, result is higher, etc. Less than: < results if the problem says decreases, reduces, worse than, result is lower, etc. Not equal to: ≠ results if the problem says different from, no longer the same, changes, etc. ٣٤
١٧
Statistics and Research Methodology Dr. Samir Safi
General Terms: Decision When we carry out the test we assume the null hypothesis is true. Hence the test will result in one of two decisions. (i) Reject H0: Hence we have sufficient evidence to conclude that the alternative hypothesis is true. Such a test is said to be significant.
(ii) Fail to reject H0: Hence we do not have sufficient evidence to conclude that the alternative hypothesis is true. Such a test is said to be insignificant. ٣٥
General Terms: Decision If a significance level α is specified, we make a decision about the significance of the test by comparing the Sig. (pvalue) directly to α. If Sig. < α, then we reject Ho and hence can conclude that there is sufficient evidence in favor of the alternative hypothesis. If Sig. > α, then we fail to reject Ho and hence can conclude that there is not sufficient evidence in favor of the alternative hypothesis. α =0.01, 0.02,…, 0.05 Small Sig. favor the alternative hypothesis. ٣٦
١٨
Statistics and Research Methodology Dr. Samir Safi
General Testing Procedure 1. State the null and alternative hypothesis. 2. Carry out the experiment, collect the data, verify the assumptions. 3. Compute the value of the test statistic and Sig. by SPSS. 4. Make a decision on the significance of the test (reject or fail to reject H0). Make a conclusion statement in the words of the original problem.
٣٧
Nonparametric Tests For T-tests and ANOVAs, • 1) the dependent variable has to be a continuous, numeric variable; • 2) The assumptions of these tests is that the variables are normally distributed and the populations have equal variances. Non-Parametric tests • Tests that do not make assumptions about the population distribution. These tests are also called distribution-free tests. • Common situations that result in non-normal distributions: • 1) skewed distributions; • 2) Significant outliers. • Data Measured on any Scale (Ratio or Interval, Ordinal “ranks” or Nominal) • Examples: Mann-Whitney, Wilcoxon, Kruskal, Friedman, … ٣٨
١٩
Statistics and Research Methodology Dr. Samir Safi
Choosing Between Parametric And Nonparametric Tests • Definitely choose a parametric test if you are sure that your data were sampled from a population that follows a Normal distribution (at least approximately). • Definitely select a nonparametric test if the outcome is a rank or a score and the population is clearly not Normal. e.g class ranking of students, or a Likert scale “Strongly disagree (1), disagree (2), Neutral (3), Agree (4), Strongly agree (5)”. ٣٩
٣٩
Normality Test • We can examine the normality assumption both graphically and by use a formal statistical test. • The Kolmogorov-Smirnov and Shapiro-Wilk tests assess whether there is a significant departure from normality in the population distribution for the interested data. • The Lilliefors correction is appropriate when the mean and variance are estimated from the data. • The Shapiro-Wilks test is more sensitive to outliers in the data and is computer only when the sample size is less than 50. • The Kolmogorov-Smirnov test is used when the sample size is at least 50 . • For normality test, the null Hypothesis states that the population distribution is normal. • ٤٠
٢٠
Statistics and Research Methodology Dr. Samir Safi
Normality Test (continued)
• Both tests indicate departure from Normality (p=0.018 and p=0.031 for the two tests respectively). Tests of Normality a
vitamin
Kolmogorov-Smirnov Statistic df Sig. .189 26 .018
Statistic .913
Shapiro-Wilk df 26
Sig. .031
a. Lilliefors Significance Correction
٤١
One Sample T-Test Example: Banana Prices The average retail price for bananas in 1998 was 51¢ per pound, as reported by the U.S. Department of Agriculture in Food cost Review. Recently, a random sample of 15 markets gave the following prices for bananas in cents per pound. 56 53 55 53 50 57 58 54 48 47 57 57 51 55 50 At 0.05 level, can you conclude that the current mean retail price for bananas is different from the 1998 mean of 51 ¢ per pound? ٤٢
٢١
Statistics and Research Methodology Dr. Samir Safi
SPSS Output One-Sample Statistics
N Banana's price
Mean 53.4000
15
Std. Deviation 3.50102
Std. Error Mean .90396
One-Sample Test Test Value = 51
Banana's price
t 2.655
df 14
Sig. (2-tailed) .019
Mean Difference 2.40000
95% Confidence Interval of the Difference Lower Upper .4612 4.3388
Sig. (P-value) =0.019 Decision: Reject H0 Conclusion: There is a sufficient evidence that the current mean retail price for bananas is different from the 1998 mean of 51 ¢ per pound. ٤٣
Comparing Two Means Related Samples: T- Test Example 1 The water diet requires one to drink two cups of water every half hour from when one gets up until one goes to bed, but otherwise allows one to eat whatever one likes. Four adult volunteers agree to test the diet. They are weighed prior to beginning the diet and after six weeks on the diet. The weights (in pounds) are Pearson Weight before the diet Weight after six weeks
1 180 170
2 125 130
3 240 215
4 150 152
For the population of all adults, assume that the weight loss after six weeks on the diet (weight before beginning the diet – weight after six weeks on the diet) is normally distributed. Does the diet lead to weight loss?
٤٤
٢٢
Statistics and Research Methodology Dr. Samir Safi
SPSS Output Paired Samples Statistics
Pair 1
Weight before (pound) Weight after(pound)
Mean 173.7500 166.7500
N 4 4
Std. Deviation 49.56057 36.08670
Std. Error Mean 24.78028 18.04335
Paired Samples Test Paired Differences
Mean Pair 1
Weight before (pound) - Weight after(pound)
7.00000
Std. Deviation
Std. Error Mean
13.63818
6.81909
95% Confidence Interval of the Difference Lower Upper
t
-14.7014
1.027
28.70139
df
Sig. (2-tailed) 3
.380
Sig. (P-value) =0.380 Decision: Do not reject H0 Conclusion: There is not a significant change in the weight before and after the diet. ٤٥
Example 2 A company wanted to know if attending a course on "how to be a successful salesperson" can increase the average sales of its employees. The company sent six of its salespersons to attend this course. The following table gives the week sales of these salespersons before and after they attended this course.
Before 12 18 25 9 14 16 After 18 24 24 14 19 20 Using the 1% significance level, can you conclude that the mean weekly sales for all salespersons increase as a result of attending this course? Assume that the population of paired difference has a normal distribution.
٤٦
٢٣
Statistics and Research Methodology Dr. Samir Safi
SPSS Output Paired Samples Statistics
Pair 1
Before After
Mean 15.4868 19.9536
N 302 302
Std. Deviation 4.06322 2.86307
Std. Error Mean .23381 .16475
Paired Samples Test Paired Differences
Pair 1
Before - After
Mean -4.46689
Std. Deviation 1.99265
Std. Error Mean .11466
95% Confidence Interval of the Difference Lower Upper -4.69253 -4.24124
t -38.956
df 301
Sig. (2-tailed) .000
Sig. (P-value) =0.000 ( music c. silence = music
Test Statisticsb
Z Asymp. Sig. (2-tailed)
silence music -1.357 a .175
a. Based on positive ranks. b. Wilcoxon Signed Ranks Test
٧٩
Wilcoxon Test- Interpretation Null Hypothesis: Background music does not affect the mood of factory workers P-Value = 0.175 Decision: Do Not Reject at α = .05
Conclusion: Workers' mood appears to be unaffected by presence or absence of background music
٨٠
٤٠
Statistics and Research Methodology Dr. Samir Safi
The Kruskal-Wallis Test • Used to test differences between three or more treatment conditions, using a separate group for every treatment • The KWS investigated differences in three+ separate samples by combining all the sample scores and giving them an overall rank. • Requirement: The data must be independent samples from populations with the same shape (but not necessarily normal).
٨١
Kruskal-Wallis Test - Example Investigating estimates of duration for three different tempos of classical music: slow, medium and fast. Three separate groups of participants each estimated the length (in seconds) of one piece of music. Each piece lasted for 45 seconds.
Group 1: Slow
Group 2: Medium
Group 3: Fast
44
39
33
25
26
29
35
35
20
51
34
24
32
40
36
45
22
21
38
27
15
37
28
19
41
31
23
47
37
18
٨٢
٤١
Statistics and Research Methodology Dr. Samir Safi
Kruskal-Wallis Example TestSPSS Output Ranks groups Low Medium Fast Total
length
Test Statistics
Chi-Square df Asymp. Sig.
N 10 10 10 30
Mean Rank 22.70 15.90 7.90
a,b
length 14.169 2 .001
a.
Kruskal Wallis Test
b.
Grouping Variable: groups
٨٣
Kruskal-Wallis Example TestInterpretation Null Hypothesis: There is no significant difference between the three means estimated length of the three classical music: slow, medium and fast. P-Value = 0.001 Decision: Reject at α = .05 Conclusion: There was a significant difference in the mean estimates given for the three pieces. Rank means for the slow classical piece was 22.7, with 15.9 for the medium piece and 7.9 for the slow piece. The mean length for the slow classical piece is statistically greater than medium and fast classical pieces. ٨٤
٤٢
Statistics and Research Methodology Dr. Samir Safi
Friedman ANOVA Test • Used to test whether the k related samples could probably have come from the same population with respect to mean rank. Test Statisticsa
Ranks Mean Rank
N
503
Social Worker
1.88
Chi-Square
Doctor
2.45
df
Lawyer
1.68
Asymp. Sig. .000 a. Friedman Test
204.241 2
P-Value = 0.000 Decision: Reject at α = .05 Conclusion: There was a significant difference in the means. Doctors have the highest mean rank. ٨٥
Kolmogorov-Smirnov Test (K-S) Test • Compares the distribution of a variable with a uniform, normal, Poisson, or exponential distribution, • Null hypothesis: the observed values were sampled from a distribution of that type.
٨٦
٤٣
Statistics and Research Methodology Dr. Samir Safi
(K-S) Example TestSPSS Output
P-Value = 0.000 Decision: Reject at α = .05 Conclusion: The distribution of the data is NOT normally distributed at 0.05 level of significance.
٨٧
Lecture #5 Measures of Relationship • • •
Correlation Coefficient Simple Linear Regression Chi-Square Tests
٨٨ ٨٨
٤٤
Statistics and Research Methodology Dr. Samir Safi
Measures of Relationship • If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the variables ? • Dependent (Response) Variable: A dependent variable measures an outcome of a study. • Independent (Explanatory) Variable: An independent variable explains or causes changes in the response variable. ٨٩ ٨٩
Correlation Coefficient • Correlation analysis is used to measure the strength of the association (linear relationship) between two numerical variables • The scatter plot displays the form, direction, and strength of the relationship between two quantitative variables. • A scatter plot can be used to show the relationship between two variables ٩٠ ٩٠
٤٥
Statistics and Research Methodology Dr. Samir Safi
Correlation Coefficient • Correlation is usually denoted by r, and is a number between -1 and 1. • The + / - sign denotes a positive or negative association. • The numeric value shows the strength. If the strength is strong, then r will be close to 1 or -1. If the strength is weak, then r will be close to 0. Question : Will outliers effect the correlation ? YES ٩١ ٩١
Measuring Relationship Three types: Pearson Correlation coefficient: For numerical data that is normally distributed Spearman Correlation coefficient: For numerical data that is not normally distributed. For ordinal data Chi Square test of independence: At least one is nominal data and the other is either a) Ordinal data or b) Numerical data is coded as categorical data ٩٢ ٩٢
٤٦
Statistics and Research Methodology Dr. Samir Safi
Features of the Coefficient of Correlation • The sample coefficient of correlation has the following features : – Unit free – Ranges between –1 and 1 – The closer to –1, the stronger the negative linear relationship – The closer to 1, the stronger the positive linear relationship – The closer to 0, the weaker the linear relationship ٩٣
Scatter Plots of Sample Data with Various Coefficients of Correlation Y
Y
X r = -1 Y
X r = -.6 Y
Y
r = +1
X
r = +.3
X
r=0
X ٩٤ Chap 3-٩٤
٤٧
Statistics and Research Methodology Dr. Samir Safi
Correlation Coefficient Example: Real estate agent • A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet) • A random sample of 10 houses is selected – Dependent variable (Y) = house price in $1000s
– Independent variable (X) = square feet ٩٥
Correlation - Example: Data House Price in $1000s (Y) 245
Square Feet (X)
(continued)
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700 ٩٦
٤٨
Statistics and Research Methodology Dr. Samir Safi
Correlation Example: Scatter Plot
(continued)
House Price ($1000s)
House price model: Scatter Plot 450 400 350 300 250 200 150 100 50 0 0
500
1000
1500
2000
2500
3000
Square Feet ٩٧
SPSS Output
(continued)
• Pearson Correlation Coefficient = 0.762, Sig. (Pvalue =0.010) • Decision: Reject H0: There is no significant relationship • Conclusion: There is sufficient evidence that there is positive relationship between the selling price of a home and its size at α = 0.05 ٩٨
٤٩
Statistics and Research Methodology Dr. Samir Safi
SPSS Output (continued)
• Pearson Correlation Coefficient = 0.705, Sig. (Pvalue =0.023) • Decision: Reject H0: There is no significant relationship • Conclusion: There is sufficient evidence that there is positive relationship between the selling price of a home and its size at α = 0.05 ٩٩
Introduction to Regression Analysis • Regression analysis is used to: – Predict the value of a dependent variable based on the value of at least one independent variable – Explain the impact of changes in an independent variable on the dependent variable
Dependent variable: the variable we wish to predict or explain Independent variable: the variable used to predict or explain the dependent variable
١٠٠
٥٠
Statistics and Research Methodology Dr. Samir Safi
Simple Linear Regression Model • Only one independent variable, X • Relationship between X and Y is described by a linear function • Changes in Y are assumed to be related to changes in X
١٠١
Types of Relationships Linear relationships Y
Curvilinear relationships Y
X Y
X Y
X
X
١٠٢
٥١
Statistics and Research Methodology Dr. Samir Safi
Types of Relationships (continued) Strong relationships Y
Weak relationships Y
X
X
Y
Y
X
X
١٠٣
Types of Relationships (continued) No relationship Y
X Y
X
١٠٤
٥٢
Statistics and Research Methodology Dr. Samir Safi
Simple Linear Regression Model Population Y intercept Dependent Variable
Population Slope Coefficient
Random Error term
Independent Variable
Yi = β0 + β1Xi + ε i Linear component
Random Error component ١٠٥
Simple Linear Regression Model Y
(continued)
Yi = β0 + β1Xi + ε i
Observed Value of Y for Xi
εi Predicted Value of Y for Xi
Slope = β1
Random Error for this Xi value
Intercept = β0
Xi
X
١٠٦
٥٣
Statistics and Research Methodology Dr. Samir Safi
Simple Linear Regression Equation (Prediction Line) The simple linear regression equation provides an estimate of the population regression line Estimated (or predicted) Y value for observation i
Estimate of the regression intercept
Estimate of the regression slope
Yˆi = b0 + b1Xi
Value of X for observation i
١٠٧
Finding the Equation • The coefficients b0 and b1 , and other regression results in this chapter, will be found using SPSS
Formulas are shown in the text for those who are interested ١٠٨
٥٤
Statistics and Research Methodology Dr. Samir Safi
Interpretation of the Slope and the Intercept • b0 is the estimated mean value of Y when the value of X is zero • b1 is the estimated change in the mean value of Y as a result of a one-unit change in X ١٠٩
Simple Linear Regression Example • A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet) • random sample of 10 houses is selected – Dependent variable (Y) = house price in $1000s
– Independent variable (X) = square feet
١١٠
٥٥
Statistics and Research Methodology Dr. Samir Safi
Simple Linear Regression Example: Data House Price in $1000s (Y) 245
Square Feet (X) 1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700 ١١١
Simple Linear Regression Example: Scatter Plot
House Price ($1000s)
House price model: Scatter Plot 450 400 350 300 250 200 150 100 50 0 0
500
1000
1500
2000
2500
3000
Square Feet ١١٢
٥٦
Statistics and Research Methodology Dr. Samir Safi
Example: Interpretation of bo house price = 98.24833 + 0.10977 (square feet)
• b0 is the estimated mean value of Y when the value of X is zero (if X = 0 is in the range of observed X values) • Because a house cannot have a square footage of 0, b0 has no practical application. ١١٣
Example: Interpreting b1 house price = 98.24833 + 0.10977 (square feet)
• b1 estimates the change in the mean value of Y as a result of a one-unit increase in X. – Here, b1 = 0.10977 tells us that the mean value of a house increases by 0.10977($1000) = $109.77, on average, for each additional one square foot of size ١١٤
٥٧
Statistics and Research Methodology Dr. Samir Safi
Example: Making Predictions Predict the price for a house with 2000 square feet:
house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098(2000) = 317.85 The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850 ١١٥
Example: Making Predictions • When using a regression model for prediction, only predict within the relevant range of data
House Price ($1000s)
Relevant range for interpolation 450 400 350 300 250 200 150 100 50 0 0
500
1000
1500
2000
Square Feet
2500
3000
Do not try to extrapolate beyond the range of observed X’s
١١٦
٥٨
Statistics and Research Methodology Dr. Samir Safi
Coefficient of Determination, r2 • The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable • The coefficient of determination is also called r-squared and is denoted as r2
note:
0 ≤ r2 ≤ 1 ١١٧
Example: Coefficient of Determination, r2
58.08% of the variation in house prices is explained by variation in square feet ١١٨
٥٩
Statistics and Research Methodology Dr. Samir Safi
Chi-Square Test of Independence • The Chi-square statistic is typically used whenever we are interested in examining the relationship between two categorical variables summarized in two-way table with r rows and c columns H0: The two categorical variables are independent (not related) (i.e., there is no relationship between them) H1: The two categorical variables are dependent (related) (i.e., there is a relationship between them). (Assumed: each cell in the contingency table has expected frequency of at least 5 for at least 80%) of the cells Reject H0 if p-value < α, this means that the two variables are dependent (related).
١١٩
Example (1) • The meal plan selected by 200 students is shown below: Number of meals per week Class Standing 20/week 10/week none Fresh. 24 32 14 Soph. 22 26 12 Junior 10 14 6 Senior 14 16 10 Total 70 88 42
Total 70 60 30 40 200 ١٢٠
٦٠
Statistics and Research Methodology Dr. Samir Safi
Example (1) (continued)
• The hypothesis to be tested is: H0: Meal plan and class standing are independent (i.e., there is no relationship between them) H1: Meal plan and class standing are dependent (i.e., there is a relationship between them)
١٢١
Example (1): Expected Cell Frequencies
(continued)
Observed: Class Standin g
Number of meals per week 20/w k
10/w k
none
Fresh.
24
32
14
70
Soph.
22
26
12
60
Junior
10
14
6
30
Senior
14
16
10
=
Total
Number of meals per week 20/w k
10/w k
none
40
Class Standin g
200
Fresh.
24.5
30.8
14.7
70
row total × column total n
Soph.
21.0
26.4
12.6
60
Junior
10.5
13.2
6.3
30
30 × 70 = 10.5 200
Senior
14.0
17.6
8.4
40
70
88
42
١٢٢ 200
Total 70 Example
fe =
Expected cell frequencies if H0 is true:
42 for 88 one cell:
Total
Total
٦١
Statistics and Research Methodology Dr. Samir Safi
Example: The Test Statistic (continued)
• The test statistic value is: 2 χ STAT =
( f o − f e )2
∑ all cells
=
fe
( 24 − 24.5 ) 2 ( 32 − 30.8 ) 2 ( 10 − 8.4 ) 2 + +⋯+ = 0.709 24.5 30.8 8. 4
χ 0.2 05
= 12.592 from the chi-squared distribution with (4 – 1)(3 – 1) = 6 degrees of freedom
١٢٣
SPSS Output
١٢٤
٦٢
Statistics and Research Methodology Dr. Samir Safi
Example (2): Are avid readers more likely to wear glasses than those who read less frequently? 300 men in the Korean army were selected at random and characterized as to whether they wore glasses and whether the amount of reading they did was above average, average, or below average. The results are presented in the following table. Wear Glasses Yes No 47 26 48 80 31 70
Amount of Reading Above Average Average Below Average
Test the null hypothesis that there is no association between the amount of reading you do and whether you wear glasses. ١٢٥
SPSS Output Amount of Reading * Wear Glasses Crosstabulation
Amount of Reading
Above Average Average Below Average
Total
Count % within Count % within Count % within Count % within
Wear Glasses Wear Glasses Wear Glasses Wear Glasses
Wear Glasses Yes No 47 26 37.3% 14.8% 48 80 38.1% 45.5% 31 70 24.6% 39.8% 126 176 100.0% 100.0%
Chi-Square Tests
Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases
Value 21.409a 21.354 18.326
2 2
Asymp. Sig. (2-sided) .000 .000
1
.000
df
302
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 30.46.
Total 73 24.2% 128 42.4% 101 33.4% 302 100.0%
Chi-Square = 21.409, Sig. (P-value) =0.000 (