Application of multinomial logistic regression to

0 downloads 0 Views 438KB Size Report
... Taylor & Francis. Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015 ... Kirkos sub-city of Addis Ababa, Ethiopia. ..... cally advantaged to have no formal schooling at all, than achieving at most grade 11 education. The odds .... survey study of factors affecting students' learning style, Int. J. Appl. Math. Inform.
Journal of Applied Statistics

ISSN: 0266-4763 (Print) 1360-0532 (Online) Journal homepage: http://www.tandfonline.com/loi/cjas20

Application of multinomial logistic regression to educational factors of the 2009 General Household Survey in South Africa Simon Monyai, 'Maseka Lesaoana, Timotheus Darikwa & Philimon Nyamugure To cite this article: Simon Monyai, 'Maseka Lesaoana, Timotheus Darikwa & Philimon Nyamugure (2015): Application of multinomial logistic regression to educational factors of the 2009 General Household Survey in South Africa, Journal of Applied Statistics, DOI: 10.1080/02664763.2015.1077941 To link to this article: http://dx.doi.org/10.1080/02664763.2015.1077941

Published online: 06 Oct 2015.

Submit your article to this journal

Article views: 16

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=cjas20 Download by: [Nat'l Univ of Sci & Tech]

Date: 27 October 2015, At: 23:29

Journal of Applied Statistics, 2015 http://dx.doi.org/10.1080/02664763.2015.1077941

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

Application of multinomial logistic regression to educational factors of the 2009 General Household Survey in South Africa Simon Monyaia , 'Maseka Lesaoanaa , Timotheus Darikwaa and Philimon Nyamugureb∗ a Department

of Statistics and Operations Research, University of Limpopo, Polokwane, South Africa; of Statistics and Operations Research, National University of Science and Technology, Bulawayo, Zimbabwe

b Department

(Received 16 September 2014; accepted 27 July 2015)

This paper combines factor analysis and multinomial logistic regression (MLR) in understanding the relationship between extracted factors of quality of life pertaining to education and variables of five key areas of the levels of development in the context of the South African 2009 General Household Survey. MLR was used to analyse the identified educational factors from factor analysis. It was also used to determine the extent to which these factors impact on educational level outcomes across South Africa. The overall classification accuracy rate displayed was 73.0% which is greater than the proportion by chance accuracy criteria of 57.0%. This means that the model improves on the proportion by chance accuracy rate of 25.0% or more so that the criterion for classification accuracy is satisfied and the model is adequate. Evidence is that being historically disadvantaged, absence of parental care, violence in schools and the perception that fees were too high generally have a negative influence on educational attainment. The results of this paper compare well with other household surveys conducted by other researchers. Keywords:

1.

factors; variables; household survey; multinomial logistic regression

Introduction

The General Household Survey (GHS) is a household survey that has been conducted annually by Statistics South Africa (Stats SA) since 2002. The survey is designed to determine the level of development in South Africa and measure the performance of programmes and projects on a regular basis. The survey also compares multiple facets of the living conditions of the South African households, and the quality of service delivery in a number of key service areas [31]. The GHS covers six key service sectors, namely: education, health, social development, housing, household access to services and facilities, food security and agriculture. Data on 185 variables *Corresponding author. Email: [email protected]

c 2015 Taylor & Francis 

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

2

S. Monyai et al.

across the six core areas have been collected through the GHS. In this paper the researchers used the unobserved variables identified by factor analysis (see [24]) to explore the variables that have an impact on the five key areas of development in South Africa. Multinomial logistic regression (MLR) was used to explore relevant factors that have an effect on educational level. Quality of life (QoL) consists of different components [5]. The components of interest in this study are the objective and subjective indicators as outlined by Brown et al. [6]. According to [6], the objective indicators include the standard of living, health and longevity, housing standards and neighbourhood characteristics. In addition, Brown et al. [6] stated that subjective indicators include life satisfaction and psychological well-being, morale, individual satisfaction and happiness, among others. MLR is used when the dependent or outcome variable has two or more categories [11]. MLR is used to classify subjects or outcomes based on values of a set of predictors. Unlike multiple logistic regression, the data used in MLR does not assume normality, linearity and homoscedasticity. MLR was used in this study to analyse the identified QoL factors related to the selected five key areas. It was also used to determine the extent to which the relationship between variables in the GHS varies across the entire South Africa. This paper is divided into five sections. Section 1 introduces the paper; Section 2 looks at some related literature; Section 3 describes the methodology used in this paper; Section 4 describes how the data was analysed; and Section 5 concludes and gives some recommendations to the South African government. 2.

Literature review

Tesfazghi et al. [34] presented a case study where the urban QoL at small scale was measured for Kirkos sub-city of Addis Ababa, Ethiopia. The researchers addressed the issue of the variability at small scale and the relationship between subjective QoL and objective QoL, which according to them is not well known. They applied factor analysis to household survey secondary data to establish an index of objective QoL, which is an external condition of life or the observable facts derived from secondary data such as level of education, household characteristics, crime and others. Their study was useful when creating an index of QoL for the GHS using factor scores and testing the significance of the factors that would be extracted from the GHS 2009 data. The study established that regression upon factor analysis permits systematic analysis of data in situations where there is multicollinearity or singularity. Education is defined by Maliki et al. [19] as the cornerstone of social development and a principal means of improving the character and pace of individual welfare. The study by Maliki et al. [19] showed that higher levels of education and enlarged access lead to productivity gains and more income. This, according to [19], resulted in reduced inequality and poverty. A study conducted by Hassan et al. [10] showed that the following unobserved factors affected learning style: (1) students’ attitude before and after class; (2) strategies used to comprehend the lecture; (3) the importance of a lecture; (4) class size and its condition; (5) efforts put by students outside class; (6) classroom convenient; and (7) importance of listening to the lecture. Majors and Sedlacek [18] used factor analysis to identify factors that affect students’ services. Few researches have determined factors related to either high or low educational attainment. The level attained by a student in a secondary school was found to be highly correlated to the way a student adjusts or settles in a University [20, 30]. MLR has been applied successfully to several other national household surveys in different countries, (see [1–3, 7–9, 13–16, 21, 23, 25, 26, 29, 30]). 3.

Methodology

MLR is an extension of the binary (or dichotomous) logistic regression and it is used when outcome variable is polytomous. The role played by an independent variable in differentiating

Journal of Applied Statistics

3

dependent variable groups is similar to that of a binary logistic regression. MLR has an added advantage in that it provides multiple interpretations for an independent variable. For the MLR with binary logistic regression, one category of the dependent variable is specified as the reference category and regression coefficients are estimated for each independent variable. Justification of the analysis used in this paper is shown in Appendix 1.

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

4.

Data analysis

A total of 25,361 households were sampled across the entire country, giving rise to 94,263 responses, and a total of 185 variables [31] across the five core areas were considered in this paper. The data was collected across all the nine provinces of South Africa, covering other crosscutting variables such as gender, age group, population group, marital status, and highest level of education, among other variables. Out of key areas that were considered in the GHS, the results on education, are presented in this paper, the rest of the key areas can be done in almost the same way. 4.1

Applying MLR to education factors

MLR is used to regress the education outcomes against the education factors identified by factor analysis [24]. A multinomial variable, originally encoded on a scale from grade 0 = 1 to tertiary level = 29, with no schooling = 98, was recoded into four categories: ‘No Schooling’, ‘Less grade 12’, ‘Grade 12’, ‘Above grade 12’. For this modelling the reference category chosen is ‘No Schooling’. The independent variables were FAC1_7 (Fees too high), FAC2_7 (Violence factor), FAC3_7 (Absence of parental care) and FAC4_7 (Historically advantaged). The results are shown in the sections that follow. 4.2

Overall test of relationship

The first step is to look at the overall relationship between the education level and the dependent factors identified by factor analysis. Table 1 is used to test the presence of such a relationship based on the chi-square distribution. The null hypothesis for this test is that there is no difference between the null model (i.e. the model without the independent variables) and the final full model (i.e. the model that includes the independent variables), versus the alternative hypothesis that there is a difference between the null model without the independent variables and the final model with the independent variables. In Table 1, the initial log likelihood value (65,374,295.141) is a measure of the null model (with constant or intercept only, but no independent variables). The final log likelihood value (41,554,698.417) is a measure of independent variables and it is computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value 23,819,596.724 (65,374,295.141 – 41,554,698.417) that is tested for statistical significance. The test statistic value, 23,819,596.724, has a significant Table 1. Model fitting information for education. Model fitting criteria

LRTs

Model

− 2 log likelihood

Chi-square

df

Sig.

Null Final

65,374,295.141 41,554,698.417

23,819,596.724

12

.000

4

S. Monyai et al. Table 2. Pseudo-R2 for education. Cox and snell 0.509

Nagelkerke 0.593

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

level less than 0.05 (i.e. p = .00 < .05). Thus, the null hypothesis that there is no difference between the null and final models is rejected and it is concluded that there is evidence to support that there is a relationship between the levels of education outcomes and the associated identified independent factors.

4.3

Strength of MLR relationships

The strength of the dependent and independent variables in the MLR model will be assessed in this section by considering the pseudo-R2 correlations, classification of accuracy measures, the likelihood ratio tests (LRTs) and the Wald test.

4.4

Pseudo-R2

The correlation measure provided by MLR analysis is the pseudo-R2 . The pseudo-R2 as appearing in Table 2, accounts for the amount of variance explained in the outcome variable by the independent variables. The Cox and Snell, and Nagelkerke pseudo-R2 in Table 2 suggest that the variation in the level of education outcomes explained by the education factors ranges between 50.9% and 59.3%. Thus, a relatively high level of variation is explained by the model.

4.5

Evaluating the usefulness of the MLR model

The pseudo-R2 has provided us with a measure of the extent of association between the independent and dependent variables of education, but what it fails to do is to provide the extent of the accuracy or the errors inherent in the model. This is exactly what the accuracy of classification measures do. This measure (pseudo-R2 ), which is also a measure of the extent of the strength of the relationship between the independent and dependent variables, assesses the accuracy of the model by comparing the predicted values of the model to the observed values. To calculate the accuracy of classification, we first consider the marginal frequencies for ‘No Schooling’, ‘Less grade 12’, ‘Grade 12’ and ‘Above grade 12’ which are 6.4%, 60.2%, 29.1% and 4.3%, respectively. These are then used to calculate the proportion by chance accuracy rating which was found to be 45.3% (i.e. 0.0642 + 0.6022 + 0.2912 + 0.0432 = 0.453). The benchmark that is used to characterise a MLR model as useful is a 25.0% improvement over the rate of accuracy achievable by chance alone. Thus, the classification accuracy rate should be at least 25.0% more than the proportion by chance accuracy rate of 45.3%, that is, it must be at least 57.0% for the MLR model to be adequate. Table 3 shows the comparison of the observed and the predicted levels of education outcomes and the extent to which they can be correctly predicted. There are two groups that have high levels of accurate prediction (92.6% for ‘No Schooling’ and 95.4% for ‘Less grade 12’). Correct classification is only 33.0% for ‘grade 12’ and 0.9% for ‘Above grade 12’. The correctly classified cases are on the diagonal and are given in bold font in Table 3. The overall correct classification for all cases is 73.0% and the groups of the dependent variables with the strongest predictions are ‘No Schooling’ and ‘Less grade 12’. This means that this model is more useful for those individuals whose highest level of education is

Journal of Applied Statistics

5

Table 3. Classification of accuracy for education. Predicted

Observed

No schooling

Less grade 12

Grade 12

Above grade 12

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

No schooling 1,993,290.80055103 55,870.78500309 101,415.23682665 1444.34908080 Less grade 6207.18783736 19,205,913.62670121 923,700.03412630 2412.39669195 12 Grade 12 48,386.75876271 6,420,652.92802807 3,214,222.01375571 46,601.01303766 Above grade 37,598.02957889 531,015.76882438 861,051.56959624 13,369.22594909 12 Overall 6.2% 78.3% 15.2% 0.2% percentage

Per cent correct (%) 92.6 95.4 33.0 0.9 73.0

Table 4. LRT for education. LRTs Effect Intercept FAC1_7 FAC2_7 FAC3_7 FAC4_7

Model fitting criteria − 2 log likelihood of reduced model

Chi-square

df

Sig.

85,560,035.236 51,671,794.690 50,486,798.649 54,138,915.640 46,083,545.172

44,005,336.819 10,117,096.273 8,932,100.232 12,584,217.223 4,528,846.754

3 3 3 3 3

.000 .000 .000 .000 .000

Notes: The chi-square statistic is the difference in − 2 log likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. FAC1_7, fees to high; FAC2_7, violence factor; FAC3_7, absence of parental care; FAC4_7, historically advantaged factor.

‘No Schooling’ and ‘Less grade 12’. This makes sense since beyond matriculation (last year of schooling in South Africa) some factors such as violence do not apply. The overall classification accuracy rate displayed in Table 3 is 73.0%, which is greater than the proportional by chance accuracy criteria of 57.0%. This means that the model improves on the proportion by chance accuracy rate of 25.0% or more, so that the criterion for classification accuracy is satisfied and the model is adequate.

4.6

Relationship and independent and dependent variables

The LRT in Table 4 presents the significance of each of the factors individually. It tests the improvement in the model fit with each of the factors. Each of the factor scores given in Table 4 has a p-value of .00, which is less than .05. This means that there is a relationship between the dependent variables and the independent education factors, hence all the four education factors should be included in the model. The LRT may have confirmed that the levels of education outcome have an association with each of the education factors, but this does not necessarily mean that each of the factors is statistically significant in distinguishing any of the two classified education level outcome variables. The Wald test, discussed in Section 3.5.4 in Appendix 1, is the one used to make this distinction.

95% confidence interval for Exp(B) Highest level of educationa Less grade 12

Grade 12

Above grade 12

Intercept FAC1_7 FAC2_7 FAC3_7 FAC4_7 Intercept FAC1_7 FAC2_7 FAC3_7 FAC4_7 Intercept FAC1_7 FAC2_7 FAC3_7 FAC4_7

B

Std. error

Wald

df

Sig.

49.350 − 58.140 − 221.511 − 10.009 − .928 38.359 − 44.041 − 163.485 − 8.729 .064 24.509 − 27.879 − 102.819 − 6.580 .920

.040 .051 .191 .007 .003 .040 .051 .189 .007 .003 .041 .053 .196 .008 .003

1,506,481.141 1,279,200.104 1,347,016.518 1,824,724.287 104,447.745 918,495.793 742,303.797 745,649.133 1,394,707.255 504.227 349,461.903 276,232.015 274,146.677 748,729.929 97,136.100

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000

Note: a The reference category is: No Schooling.

Exp(B)

Lower bound

Upper bound

1.000E − 013 1.000E − 013 4.501E − 005 .395

1.000E − 013 1.000E − 013 4.436E − 005 .393

1.000E − 013 1.000E − 013 4.567E − 005 .397

1.000E − 013 1.000E − 013 .000 1.066

1.000E − 013 1.000E − 013 .000 1.060

1.000E − 013 1.000E − 013 .000 1.072

8.802E − 013 1.000E − 013 .001 2.509

8.031E − 013 1.000E − 013 .001 2.495

9.656E − 013 1.000E − 013 .001 2.524

S. Monyai et al.

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

6

Table 5. Parameter estimates for education.

Journal of Applied Statistics

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

4.7

7

Test for statistically significant factors and parameter estimation

The results for fitting the MLR models to the four education factors are given in Table 5, which also shows the coefficient estimates (B), the standard errors, the Wald Statistic, the odds ratio (OR) represented by Exp(B) and their corresponding 95% confidence intervals for each level of education outcome. The Wald Statistic along with the significance value (p-value) is used to test the significance of each factor. In this case it tests if the education factor can significantly distinguish each education level against the reference level which is ‘No Schooling’. The Wald test is used to measure the improvement gained by adding each education factor on the intercept-only (null) model. All the Wald test p-values in Table 5 are equal to .00, and are hence all less than .05, which means that all the education factors are statistically able to differentiate between each of the education level outcomes and the reference variable of ‘No Schooling’. Thus, all the factors must be retained in the model. 4.8

Check for multicollinearity and numerical errors

The standard errors of the coefficient estimates (β) are used to check for numerical errors or multicollinearity in the solution of the MLR. A standard error that is greater than 2.0 indicates numerical problems such as multicollinearity among the independent education factors. A model loses precision when the confidence interval is wider. For the significant variables, small confidence intervals suggest greater precision of the variable. Confidence intervals displayed in Table 5 are very small, suggesting great precision of factors. Confidence intervals that include 1, indicate that there is no significant relationship between educational level and factor scores. We have to account for the increased Type I error because of the large number of statistical tests being run. This means that we have to avoid using standard critical value of p < .05. The correct value to be utilised is obtained by dividing .05 by the total number of predictors. For our model, the statistical significance should be determined at a value of p < .0125, obtained by dividing .05 by 4. 4.9

Interpretation of the MLR results

Looking at FAC4_7 (which is the historically advantaged factor) it can be deduced that, when holding other factors constant, the odds for someone who is historically advantaged of proceeding to complete tertiary (above grade 12) rather than remain with no formal schooling, are 2.51 times higher than the historically disadvantaged person. Holding other factors constant, it can be seen that the historically disadvantaged are 2.53 (1/0.395) times more likely than the historically advantaged to have no formal schooling at all, than achieving at most grade 11 education. The odds are almost the same though slightly higher, at 1.07, for the historically advantaged to complete grade 12 than those with no formal schooling. Respondents are significantly more unlikely to have any form of education compared with no schooling at all on FAC1_7 (Fees too high), FAC2_7 (Violence) and FAC3_7 (Absence of parental care). This is because the odd ratios between three groups (less grade 12, Grade 12 and above grade 12) when compared to ‘No schooling’, give odd ratios that are close to zero on these factors. Therefore, there is a huge difference between these three groups and the ‘No schooling’ on these three factors. 5.

Conclusion

This study focused on understanding the relationship between extracted factors of QoL and education level in the context of South Africa using the 2009 GHS. Analyses of parameter

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

8

S. Monyai et al.

estimates have shown that ‘High school fees’, ‘Violence’ and ‘Absence of parental care’, have negative influence on levels of education. As expected, respondents are significantly more unlikely to have any form of education compared with ‘No schooling at all’ on these factors. This is because the odd ratios between three group (‘Less grade 12’, ‘Grade 12’ and ‘Above grade 12’) when compared to ‘No schooling’ give odd ratios that are close to zero on these factors. MLR includes larger sample size for accurate estimation of parameters. Models tend to be difficult or impossible to interpret with more (i.e. four or more) groups to compare in the dependent variable. In MLR the coefficient estimates do not maximise any goodness-of-fit measure. This study ignored the goodness-of-fit test because of many cells with zero frequencies. In the current study goodness-of-fit test significance level is small (0.00 < 0.005), which may imply that the model does not adequately fit the data. But evidence from other research studies suggests that this measure can be misleading and analysis can proceed without compromising results. Although we used different techniques for data analysis, our results compare well with those obtained by Bowling and Windsor [5], Hassan et al. [10], Maliki et al. [19], Tesfazghi et al. [34]. Slight differences in the results observed in [1, 6], can be attributed to differences in the historical background of the countries involved. It is therefore recommended that the government of South Africa looks at the fees structure and try to assist those students who are failing to pay fees. Some governments with highest literacy rate in Africa have introduced free education especially at primary level so that everyone in those countries has some basic education.

Disclosure statement No potential conflict of interest was reported by the authors.

References [1] M.Z. Al-Agili, M. Mamat, L. Abdullah, and H.A. Maad, The factors influence students’ achievement in mathematics: A case for Libyan’s students, World Appl. Sci. J. 17 (2012), pp. 1224–1230. [2] O.O. Alawode and A.M. Lawal, Income inequality and self rated health in rural Nigeria, Agricultural Science 2 (2014), pp. 36–45. [3] M. Ariani, D. Kaluge, and D.S. Pratomo, Does vocational education matter for the labour market? (A case study in mining sector in East Kalimantan–Indonesia), J. Econ. Sust. Develop. 5 (2014), pp. 111–120. [4] V.K. Borooah, Logit and Probit: Ordered and Multinomial Models (No. 138), Sage, Newbury Park, CA, 2002. [5] A. Bowling and J. Windsor, Towards the good life: A population survey of dimensions of quality of life, J. Happin. Stud. 2 (2001), pp. 55–82. [6] J. Brown, A. Bowling, and T. Flynn, Models of Quality of Life: A Taxonomy, Overview and Systematic Review of the Literature, European Forum on Population Ageing Research, 2004. Available at: https://lemosandcrane. co.uk/resources/European%20Forum%20on%20Population%20Ageing%20Research%20-%20Models%20of%20 Quality%20of%20Life.pdf (Accessed 15 February 2013). [7] A. Exavery, A.M. Kanté, M. Njozi, K. Tani, H.V. Doctor, A. Hingora, and J.F. Phillips, Predictors of mistimed, and unwanted pregnancies among women of childbearing age in Rufiji, Kilombero, and Ulanga districts of Tanzania, Reprod. Heal. 11(63) (2014), pp. 1–9. [8] M.Z. Faridi and A.B. Basit, Factors determining rural labour supply: A micro analysis, Pakistan Econ. Soc. Rev. 49 (2011), pp. 91–108. [9] M. Hammill, Income poverty and unsatisfied basic needs. Social Development Unit, Subregional Headquarters of ECLAC/Mexico, 2009. [10] S. Hassan, N. Ismail, W.Y.W. Jaafar, K. Ghazali, K. Budin, D. Gabda, and A.S.A. Samad, Using factor analysis on survey study of factors affecting students’ learning style, Int. J. Appl. Math. Inform. 1 (2012), pp. 33–113. [11] D.W. Hosmer and S. Lemeshow, Applied Logistic Regression, J Wiley & Sons, New York, NY, 1989. [12] D.W. Hosmer and S. Lemeshow, Applied Logistic Regression, 2nd ed., John Wiley & Sons, New York, NY, 2000. [13] V.D. Joshi, Y.M. Chen, and J.F.Y. Lim, Public perceptions of the factors that constitute a good healthcare system, Singapore Med. J. 50(10) (2009), pp. 982–989.

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

Journal of Applied Statistics

9

[14] A. Khorshidi and M. Rezaloo, Recognizing effective in creating prevalent high-schools in Tehran, J. Appl. Environ. Biol. Sci. 1 (2011), pp. 688–694. [15] M. King, L. Marston, S. McManus, T. Brugha, H. Meltzer, and P. Bebbington, Religion, spirituality, and mental health: Results from a national study of English households, Brit. J. Psychiat. 202 (2013), pp. 68–73. [16] C. Kumar, S.P. Singh, and D.K. Nauriyal, Correlates and issues of academic course-selection in post-secondary education in India: Evidence from national sample survey, 2007, 08, Open Access Lib. J. 1 (2014), pp. 1–22. [17] M.H. Kutner, C.J. Nachtsheim, J. Neter, and W. Li, Applied Linear Statistical Models, 5th ed., Mcgraw-Hill Irwin, Toronto, ON, 2005. [18] M.S. Majors and W.E. Sedlacek, Using factor analysis to organize student services, J. Coll. Student Dev. 42 (2001), pp. 272–278. [19] S.B. Maliki, A. Benhabib, and A. Bouteldja, Quantification of the poverty-education relationship in Algeria: A multinomial econometric approach, Topics from Middle Eastern and African Economies, 2012, 14. Available at: www.luc.edu/orgs/meea/volume4/PDFS/Maliki-Benhabib-Bouteldja%20(2011).pdf (Accessed 15 February 2013). [20] S.J. Mann, Alternative perspectives on the student experience: Alienation and engagement, Stud. High. Edu. 26 (2001), pp. 7–19. [21] C.J. Mba, Gender disparities in living arrangements of older people in Ghana: Evidence from the 2003 Ghana Demographic and Health Survey, J. Int. Women Stud. 9 (2007), pp. 153–166. [22] S. Menard, Applied logistic regression analysis, Sage, Thousand Oaks, CA, 2000. [23] M. Mowafi, Z. Khadr, I. Kawachi, S.V. Subramanian, A. Hill, and G.G. Bennett, Socioeconomic status and obesity in Cairo, Egypt: A heavy burden for all, J. Epidemiol. Glob. Heal. 4 (2014), pp. 13–21. [24] P. Nyamugure, M. Lesaoana, and S. Monyai, Application of factor analysis to the 2009 General Household Survey of South Africa, ICASTOR J. Math. Sci. 5 (2011), pp. 133–150. [25] K. Odhav, South African post-apartheid Higher Education policy and its marginalisations: 1994–2002, SA-Edu. J. 6 (2009), pp. 33–57. [26] C.O. Odimegwu and J. Kekovole (eds.), Continuity and Change in Sub-Saharan African Demography, Routledge, New York, 2014. Available at: https://books.google.co.za/books?hl = en&lr = &id = 3fcABAAAQBAJ&oi = fnd& pg = PP1&dq = %22continuity + and + change + in + sub-saharan + african + demography%22&ots = 78Rqdbe HFd&sig = TJ62i0pLA6SKGCzyQfvXK8xyoRM#v = onepage&q = %22continuity%20and%20change%20in% 20sub-saharan%20african%20demography%22&f = false (Accessed 22 February 2013). [27] C.J. Petrucci, A primer for social worker researchers on how to conduct a multinomial logistic regression, J. Soc. Ser. Res. 35 (2009), pp. 193–205. [28] S.P. Reise, Using multilevel logistic regression to evaluate person-fit in IRT models, Multivar. Behav. Res. 35 (2000), pp. 543–568. [29] M. Sarstedt, M. Schwaiger, C.M. Ringle, and S. Gudergan, Satisfaction with services: An impact-performance analysis for soccer-fan satisfaction judgements. In Australian and New Zealand Academic Conference, 2009. Available at: http://www.duplication.net.au/ANZMAC09/papers/ANZMAC2009-577.pdf (Accessed 7 February 2012). [30] J. Sennett, G. Finchilescu, K. Gibson, and R. Strauss, Adjustment of black students at a historically white South African university, Educ. Psychol. 23 (2003), pp. 107–116. [31] Statistics South Africa General Household Survey, 2009, Statistical release P0318, 2010. [32] B.G. Tabachnick and L.S. Fidell, Using Multivariate Statistics, 5th ed., Pearson Education, Inc., Boston, MA, 2007. (Accessed 07 February 2012). [33] B.G. Tabachnick, L.S. Fidell, and S.J. Osterlind, Using Multivariate Statistics. Allyn and Bacon, Boston, MA, 2001. [34] E.S. Tesfazghi, J.A. Martinez, and J.J. Verplanke, Variability of quality of life at small scales: Addis Ababa, Kirkos Sub-City, Soc. Indicat. Res. 98 (2010), pp. 73–88.

Appendix 1 A.1

Evaluating the usefulness of logistic regression models

The benchmark used to characterise a MLR model as useful, is a 25% improvement over the rate of accuracy achievable by chance alone [27, 32]. According to [22], it is possible to have a model that fits well with the data, while doing poorly in predicting category membership. The proportional change in error measure of accuracy of prediction for selection of the model standard formula (2 × 2 prediction table with 1 degree of freedom) is shown by Menard [22] as: ϕp =

ad − bc , 0.5(a + b)(b + d) + (c + d)(a + c)

where a, b, c and d are the elements of the 2 × 2 prediction table.

(A1)

10

S. Monyai et al.

The best options for analysing the prediction (classification) tables provided by logistic regression packages involve proportional change in error measures of the form: predictive efficiency =

(error without model) − (error with model) . (error without model)

(A2)

For a classification model, an appropriate definition of the expected error without the model is: Errors without model =

N  fi (N − fi ) , N

(A3)

t=1

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

where N is the sample size and f i is the number of cases observed in category i. If the model improves our prediction of the dependent variable, the formula for prediction of efficiency is the same as a proportional reduction in error formula. When the model actually performs worse than it occurs, the predictive efficiency is negative and that increases the error [22].

A.2

Measuring Strength of Association (Pseudo-R2 )

The pseudo-R2 is defined by Borooah [4] as 1 − LLR+F /LLR and is bounded from below by 0 and from above by 1. A value of zero corresponds to the slope coefficients being zero, and a value of 1 corresponds to perfect prediction (i.e. to LLR = 0). The LLR is the value of the log-likelihood function when the only explanatory variable was constant term. LLR+F is the value of the log-likelihood function when all the explanatory variables were included. There are three commonly used R2 statistics, namely (a) Cox and Snell, (b) Nagelkerke and (c) McFadden, to measure the strength of association between dependent variable and the explanatory (independent) variables [28, 33]. MLR does compute correlation measures to estimate the strength of the relationship (pseudo-R2 measures).

A.3

Baseline-category logits for nominal response

For nominal response variables, an extension of logistic regression forms logits by pairing each category with a baseline category and each logit equation results in separate parameters. Y is a categorical (polytomous) response variable with J categories, taking on values 0, 1, . . . , J − 1. The assumptions for MLR are as follows: (1) Observations Y are statistically independent of each other, (2) Y are random sample from a population where Y has a multinomial distribution with probability parameters {π0 (x), . . . , πJ−1 (x)}, and (3) One category has to be set aside as a base category (hence J − 1 parameters). For the group data it will be convenient to introduce an auxiliary random variable representing counts of responses in various categories. For instance, if there are J response categories, then for the ith observation there will be J binary response variables, Yi0 , . . . , YiJ , where:  1 if case i response is category j . (A4) YII = 0 otherwise Since only one category can be selected for response i, then  Yij = 1,

(A5)

Let ni denote number of cases in the ith group and Yij denote the number of responses from the ith group that fall in jth group, with observed value yij . The probability distribution of the counts Yij given the total ni is given by multinomial distribution:

P(Y1 = y1 , . . . , yJ−1 ) =

⎧ ⎪ ⎪ ⎨

n! π0 (y), . . . , πJ−1 (y) yj ! . . . yj−1 !

⎪ ⎪ ⎩0

when

J−1  j=1

yj = n

,

(A6)

otherwise

Given a certain choice of J − 1 of these, the rest are redundant. For dependent variable with J categories, this requires calculation of J − 1 equations, one for each category relative to reference category, to describe the relationship between the dependent variable and the independent variables. Logit model pairs each response category with a baseline category, often the last or the most common one. The group coded Y = 0 will serve as the reference outcome value to form logit comparing Y = j, . . . ,J − 1 to it. To develop the model, let us assume that we have p covariates and a constant term,

Journal of Applied Statistics

11

denoted by the vector of covariate, x, of length p + 1. If the first category is the reference or Y = 0 the general formula for the logit function is given by:  P(Y = j|x) gj (x) = ln = βj0 + βj1 XJ1 + · · · + βjk XJP = x β j . (A7) P(Y = 0|x) Hence, for each case, there will be J − 1 predicted log odds. The intercept parameter (β j ) is the logits for success when Xj is zero and the slope parameter βj is the logit difference in indicating how much the log odds change with unit on the predictor [28]. Let us consider baseline-category logits given as follows (see [12]): πj (x) = P(Y = j|x),

for j = 0, 1, . . . , J − 1 with

J−1 

πj (x) = 1.

(A8)

j=0

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

Each of which is a function of the vector of p + 1 parameters β  = (βj0 , βj1 , . . . , βjp ). For the observation, the counts at J categories of Y are treated as multinomial with probabilities {π0 (x), . . . , πJ−1 (x)}. A general expression for conditional probability in J category model is given as follows: πj (x) = P(Y = j|x) =

1+

egj (x)

J−1 k=1

egk (x)

,

(A9)

where the vector β 00 = 0 and hence g0 (x) = 0.

A.4

Maximum likelihood estimation

For n independent observations the joint probability for the likelihood is given by: n

(πj (x))Yij .

(A10)

i=1

The conditional likelihood function for sample of n independent observations and J categories is: l(β) =

n J−1

[P(Y = j|x)]Yij =

i=1 j=0

n J−1

πj (x)Yij .

Taking the log and using the fact that one category is selected for response i the log-likelihood function is: ⎛ ⎡ ⎤⎞ n J−1 J−1     ⎝ (Yij x βj ) − loge ⎣1 + e(x βj ) ⎦⎠ . L(β) = j=1

(A11)

i=1 j=0

j=1

(A12)

j=1

The likelihood equations are found by taking the first partial derivatives of L(β) with respect to each of the p + 1 unknown parameters [17]. For simplicity of the notation let πij = πj (x).

(A13)

∂L(β)  = Xki = (Yij − πij ). ∂βjk

(A14)

The general form of this equation is as follows: n

i=1

For j = 0, 1, . . . , J − 1 and k = 0, 2, . . . , p. We recall that x0i = 1 for each subject. The maximum likelihood estimator, ˆ of these parameters are chosen to be those values which maximise (A14) and is obtained by setting the derivatives of β, the log-likelihood equation to zero and solving for β [12]. The J − 1 response function may be obtained by substituting the maximum likelihood estimates of the J − 1 parameter vectors into the expression in Equation (A11). As shown by Kutner et al. [17], the estimator is expressed as: πˆ j (x) =

1+

egj (x)

J−1 k=1



egk (x)

=

1+

ex βj

J−1 k=1

ex βˆ k 

.

(A15)

The estimator of the variances and covariances are obtained by evaluating var(β) at βˆ [12]. The standard errors of the estimated coefficients are: SE(βˆj ) = var(βˆj )1/2

where j = 0, 1, . . . , J − 1.

(A16)

Multicollinearity in the MLR solution is detected by examining the standard errors for the βˆ coefficients. A standard ˆ are the z-ratio of the estimated coefficients to their error larger than 2.0 indicates numerical problems. The exp(β) estimated standard errors. The z-ratios are asymptotically distributed as N(0,1) under the null hypothesis that associated coefficients are zero.

12 A.5 A.5.1

S. Monyai et al. Interpreting and assessing the significance of the estimated coefficient Odd ratios

In order to include the outcomes being compared as well as values of the covariate, the odds ratios in multinomial outcomes setting are generalised in this form: Assume that the outcome labelled with Y = 0 is the reference outcome. The subscript on the odds ratios is being compared to the reference outcome. That is, the odds ratio of outcome j versus 0 for covariate values of x = a versus x = b is: ⎤ ⎡ p(Y = j|x = a) ⎢ P(Y = 0|x = a) ⎥ ⎥ (A17) ORj = (a, b) = ⎢ ⎣ P(Y = j|x = b) ⎦ . P(Y = j|x = b)

Downloaded by [Nat'l Univ of Sci & Tech] at 23:29 27 October 2015

A.5.2

Confidence interval for βˆ

The general large-sample formula for 100(1 − α)% confidence interval for comparison of outcome level j versus the reference category, for any i levels of the independent variable is: exp[βˆj ± z1−α/2 SE(βˆj )]

where j = 0, 1, . . . , J − 1.

(A18)

One of the important suggestions imposed by Borooah [4] about measuring goodness-of-fit is that one should report the maximised value of the log-likelihood function. Since the hypothesis that all the slopes in the model are zero is often interesting, the results of comparing the full model with an intercept only model should be reported (see [4]).

A.5.3

Likelihood ratio comparison tests

The LRT is used to test hypotheses about the significance of the predictor variables (interaction terms). For MLR J − 1 parameter estimates are tested simultaneously for each independent variable. The effect of individual or groups of explanatory variable on response can be assessed by comparing the deviance statistics ( − 2LL) for two nested models. The resulting statistic is tested for significance using chi-square distribution with J − 1 degrees comparing the reduced model (model without variables) and full model (model with variables). The hypothesis test: H0 : βj = 0 j = 0, 1, . . . , J − 1, H 0 here is no difference between the fitted and full (intercept only) model. The test statistic − 2LLdiff = −2LLR − (−2LLR+F ) ∼ χ 2 ,

(A19)

where R is the reduced nested model and R + F is the full model. The degrees of freedom are equal the number of slope coefficients estimated. If the H 0 is rejected, the conclusion is that at least one or all J − 1 coefficients are significantly different.

A.5.4

Wald tests for βˆ

The alternative to LRT, is Wald test that tests for the statistical significance of individual coefficients (to determine which logits are significantly affected by X ). For MLR there are supposed to be J − 1 coefficients to be tested for each and every independent variable. The set of coefficients must either be retained or dropped. The hypothesis test: H0 : βj = 0 j = 0, 1, . . . J − 1, that is, null hypothesis that a given X has no effect on odd of Y = j versus Y = 0. The Wald statistic may be calculated as:  2 βˆ ∼ χ 2 where j = 0, 1, . . . , J − 1, (A20) z2 = ˆ SE(β) Equation (A20) arises from the fact that the square of a standard normal distribution is the χ12 distribution with 1 degree of freedom (the sum of two squared standard normal distribution would be a χ22 with 2 degrees of freedom and so on). Alternatively it follows the standard normal distribution. Thus: z=

βˆ SE(βˆj )

where j = 0, 1, . . . , J − 1,

(A21)

Equation (A21) is parallel to the t-ratio for coefficients in linear regression. Therefore, the test has one (number of restrictions) degree of freedom.

Suggest Documents