Multiple Logistic Regressions Modeling on Risk ...

7 downloads 0 Views 738KB Size Report
A multiple logistic regression model containing all the predictor variables is fitted and ... Keywords: Generalized model, Logistic regression, Diabetes, risk factors.
International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com

Multiple Logistic Regressions Modeling on Risk Factors of Diabetes Case Study of GITWE Hospital (2011-2013) 1

Niyikora Sylvere, 2Dr. Joseph K. Mung'atu, 3Dr. Marcel Ndengo

1

Students at Jomo Kenyata University of Agriculture and Technology (JKUAT/KIGALI CAMPUS), Master of Science in applied statistics 2 Lecturer at Jomo Kenyata University of Agriculture and Technology/Kenya 3 Lecturer at Jomo Kenyata University of Agriculture and Technology/ Kigali

Abstract: The number of people with diabetes is increasing all over the world. A misconception that diabetes is a disease for urban areas while rural areas are also concerned; this is the motivation of the study. In this paper, a multiple logistic model is used to fit the risk factors of diabetes. A three year period (2011 to 2013) data from Gitwe Hospital are used. The test of independence between the dependent variable (diabetes) and the independent variables is performed. It is found that older age, alcohol consumption, cholesterol level, occupation status and hypertension were associated with the outcome of having diabetes. The predictors like gender; smoking, family history of diabetes had negligible association with having diabetes. A multiple logistic regression model containing all the predictor variables is fitted and a test of significance on coefficients is performed. The Wald test reveals that on one hand, the significant predictors are: older age, Occupation status, Alcohol consumption, Cholesterol level and Hypertension. On the other hand, the predictors which are not statistically significant are: Gender, smoking and family history of diabetes. From the odds ratio results, older age persons, patients who consume alcohol, patients with high cholesterol level and hypertensive persons are highly susceptible for diabetes occurrence. Finally, a multiple logistic regression with only significant parameters was fitted. Based on their respective Receiver Operator Characteristic (ROC) curve and their overall explanatory strength the conclusion is that the reduced model fits better the data than the model with all predictor variables. Keywords: Generalized model, Logistic regression, Diabetes, risk factors.

1.

INTRODUCTION

1.1. Problem statement: The number of people with diabetes is increasing due to population growth, aging, urbanization, and increasing prevalence of obesity and physical inactivity. Quantifying the prevalence of diabetes and the number of people affected by diabetes, now and in the future, is important to allow rational planning and allocation of resources. (Sarah et Al., 2004). According to Shaw et al. (2010), the world prevalence of diabetes in 2010 among adults aged 20-79 years was estimated to 6.4%, affecting 285 millions of adults. Between 2010 and 2030, there is an expected 70% increase in number of adults with diabetes in developing countries and a 20% increase in developed countries. Each year more than 231,000 people in the United states and more than 3,96 million people worldwide die from diabetes and its complications (IDF, 2009) and this number is expected to increase by more than 50 percent over next decade .

Page | 48 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com Estimated global healthcare expenditures to treat and prevent diabetes and its complications was at least 376 billion US Dollar (USD) in 2010. By 2030, this number is projected to exceed some 490 billion USD. Environmental and lifestyle factors are the main causes of the dramatic increase in type 2 diabetes prevalence. Genetic factors probably identify those most vulnerable to these changes. (IDF, 2010). Diabetes is now truly a pandemic, and its effects are particularly severe in low and middle income countries. The following table shows the situation of diabetes over the world in 2013 and the projected percentage of increase in people with diabetes in 2035: Table 1.1: Number of people with diabetes by IDF regions, 2013 and projection in 2035

Region NORTH AMERICA AND CARIBEAN SOUTH AND CENTRAL AMERICA EUROPE MIDLE EAST AND NORTH AFRICA SUB-SAHARAN AFRICA SOUTH EAST ASIA WESTERN PACIFIC

Number of people with diabetes Year 2013(in millions) 37 24 56 35 20 72 138

Predicted percentage of increase in 2035 37.3 % 59.8 % 22.4% 96.2% 109.1% 70.6% 46%

Source: IDF Atlas 2013, page 11 to page 12

Here are some basic facts about diabetes worldwide, according to IDF (2013) : 1. Each year the number of people with diabetes increases by 7 millions in the world. 2. By 2035, about 592 million people will have diabetes , a number which was 382million in 2013 3. More than 79000 children developed type 1 diabetes 4. During 2013, diabetes killed about 5.1 million adults worldwide. 5. Diabetes leads to complications and severe disabilities, including kidney disease, blindness, heart attack, stroke and neural damage leading to amputation and the need for chronic care. 6. The trend in 2013 revealed that there are three new cases every 10 seconds. 7. More than 80% of spending on medical care for diabetes is in the world’s richest countries, even though 80% of the people with diabetes live in low and middle income countries, where 76% of the burden lies. 8. The burden of illness caused by diabetes and the reduction in life expectancy in sub-Saharan Africa will hinder the region’s economic growth. 9. Diabetes caused at least 548 billion USD in health expenditure in 2013 (it means 11% of the total health spending on adults and this amount is predicted to be 627 USD in 2035. 10. More than 21 million live births were affected by diabetes during pregnancy in 2013 Concerning Sub-Saharan Africa, Mbanya (2009) says: “Soon, four out of every five people with diabetes will live in developing countries. And the men and women most affected are of working age – the breadwinners of their families.” Diabetes was once considered as a rare disease in sub-Saharan Africa. But in that part of the word, in 2010; 12.1 millions adults were estimated to have diabetes and by 2030, it is estimated that 23.9 million adults in sub-Saharan Africa will have diabetes. Data of 2010 on the condition of people with diabetes in sub-Saharan Africa and the complications of diabetes that they suffer is very scarce. According to Ayesha Motala et al.(2010), it was estimated that at least: 1. 4.51 million people had eye complications. 2. 2.23 million people needed dialysis because of kidney damage. 3. 907,500 people had cardiovascular disease.

Page | 49 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com 4. 423,500 people were blind because of diabetes. 5. 399,300 people had cerebrovascular disease. 6. 169,400 people had lost a foot because of amputation. Concerning Rwanda, the number of deaths due to diabetes in 2013 was estimated to be 5464. In that same year, the prevalence in adults (20-79 years) was 4.38% and the total number of people living with diabetes was estimated to be 234000. (IDF, 2013) 1.2. Research objectives and hypothesis: This study has the following objectives: 1. To test for the association between the risk factors (older age, gender, smoking, occupation status, alcohol consumption, Cholesterol level, hypertension, and family history of diabetes) and diabetes. 2. To fit a multiple logistic model on the incidence of diabetes given the risk factors (older age, gender, smoking, occupation status, alcohol consumption, Cholesterol level, hypertension, and family history of diabetes) The following hypotheses were formulated in order to achieve the above objectives: : There is no association between having diabetes and risk factors like age, gender, smoking, occupation status, alcohol consumption, Cholesterol level, hypertension, and family history of diabetes. To test those hypotheses, the chi-square test of independence is used. And :

= 0 (it means the coefficient

in the fitted multiple logistic regression is not statistically significant)

To test those hypotheses, the Wald test is used.

2.

METHODOLOGY

2.1. GENERALISED LINEAR MODELS (GLMs) AND LOGISTIC REGRESION: The logistic regression model is an example of a broad class of models known as Generalized Linear Models (GLMs). For example, GLMs also include linear regression, ANOVA, Poisson regression, etc. There are three components to a Generalized Linear Model: -Random Component: The random component of a Generalized Linear Model identifies the response variable selects a probability distribution for it. Denote the observations on by Standard GLMs treat as independent.

and

-Systematic Component: The systematic component of a GLM specifies the explanatory variables. These enter linearly as predictors on the right-hand side of the model equation. That is, the systematic component specifies the variables that are the in the expression:

This linear combination of the explanatory variables is called the linear predictor. -Link Function: Let us denote the expected value of Y, the mean of its probability distribution, by The third component of a GLM, the link function, specifies a function

The link function

that relates

to the linear predictor as:

connects the random and systematic components.

2.2. LOGISTIC REGRESSION MODEL: 2.2.1. Introduction: In general, the logistic regression model is used to model the outcomes of a categorical dependent variable.

Page | 50 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com Logistic regression determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories. The logistic regression is the most popular multivariable method used in health science (Tetrault et al., 2008). 2.2.2. Binary Logistic Regression with single independent variable: Many categorical response variables have only two categories. Denote a binary response variable by possible outcomes by 1 (“success”) and 0 (“failure”). The distribution of

and its two

is specified by probabilities:

of success and

of failure. Its mean is

.

For n independent observations, the number of successes has the binomial distribution specified by the index and parameter . Although Generalized Linear Models can have multiple explanatory variables, let us start by introducing only one independent variable . The value of

can vary as the value of

changes, and

is will be replaced by

to describe that dependence

on .

Relationships between π(x) and x are usually nonlinear rather than linear. In the logistic regression model, the random component for the (success, failure) outcomes has a binomial distribution. The link function is the logit function , which is defined as the log of odds of success and symbolized by “logit(π).” Logistic regression models are often called logit models. Whereas π is restricted to the range [0,1], the logit can be any real number. The model: (

)

From equation (2.1), we deduce:

(

) (2.2)

2.2.3. Interpretation of regression coefficients: Consider the case in which the dependent variable may take only the values 1 (for success) and 0 (for failure) and a single independent variable . In this case, the logistic regression equation is: (

)

as given in equation (2.1)

Now, suppose we consider an impact of a unit increase in . The logistic regression equation becomes: (

)

( Subtracting equation

) from

(

)

we get: (

)

to arrive at: (

)

Page | 51 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com ( That is,

)

is the log of the ratio of the odds at

and .

Which may be also written as:

The regression coefficient to the original odds.

is interpreted as the log of the odds ratio comparing the odds after a one unit increase in

2.2. 4. Multiple Logistic Regression: 2.2.4.1. The model: Let us consider the general logistic regression model with multiple explanatory variables. Denote the k predictors for a binary response by . We use

to represent the probability that

, and

to represent the probability that

.

These probabilities are written in the following form:

The model for the log odds is: (

)

( (

)

(

)

)



which yields to: ∑ ∑

The parameter example, variables.

refers to the effect of on the log odds that , controlling the other predictor variables. For is the multiplicative effect on the odds of a one-unit increase in , at fixed levels of the other predictor

Thus we have constructed a logistic regression model that bounds the conditional mean between 0 and 1. 2.2.4.2. The Parameters estimation: The goal of logistic regression is to estimate the unknown parameters in equation 2.9. This is done with maximum likelihood estimation which entails finding the set of parameters for which the probability of the observed data is greatest. The maximum likelihood equation is derived from the binomial distribution of the dependent variable. For a set of observations in the data , the contribution to the likelihood function is where , and where . The following equation results for the contribution (call it ) to the likelihood function for the observation :

The equation 2.10 accounts for only one set of observations. The observations are assumed to be independent of each other so we can multiply their likelihood contributions to obtain the complete likelihood function. The result is given in equation (2.11): ∏ ∏

Page | 52 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com ∑





+∑

* Note that the equation (



and

give respectively: ∑



)

and



This leads us to write equation (2.11) as: )



(





)(





)









(

(





(







(

)(



)



)





)(

)

In the equation

is the collection of parameters and is the likelihood function of . The ̂ ̂ ̂ Maximum likelihood estimates (MLE's) can be obtained by calculating the which maximizes . However, to simplify the mathematics, let us take the logarithm of equation (2.12). As shown in equation (2.13), denotes the log likelihood expression. ( (

)

)

[( ∑

(













)( )

(

) ∑

] )

The critical points of a function (maxima and minima) occur when the first derivative equals 0. If the second derivative evaluated at that point is less than zero, then the critical point is a maximum. Thus, finding the maximum likelihood estimates requires computing the first derivative of the log likelihood function . Thus, differentiating equation (2.13) with respect to k L(  )   yi  k  0 i 1

0 

 jxj k 0

e 1 e

, we get:

K

0 

K

 jxj k 0

k L (  )   yi  k ( xi )  0 i 1

L (  )  0

k

  [ yi   ( xi )] i 1

Also, differentiating equation (2.13) with respect to

, we get: K

L (  )   j

k

k

y x i

i 1

L(  )   j

j 1

k

k

j

 k xj

k

L (  )   j

i 1

j 1

k

x y i 1

ik

i

k 0

K

j 1

y x i

e

0    j x j

1 e

0    j x j k 0

k

j

 k  x j ( xi ) j 1

  ( xi ) 

(2.15)

Page | 53 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com The maximum likelihood estimates ̂

̂ for

can be found by setting each of the equations respectively

(2.14) and (2.15) equal to zero and solving for each  j . It means, solving k

 [ y   ( x )]  0 i 1

i

i

and k

x y i 1

ik

i

  ( xi )   0

The solving of these likelihood equations requires special statistical software packages. 2.3. Sample size and sampling procedure: Gitwe Hospital and the three years 2011, 2012 and 2013 were purposively selected according to the objectives of the study. The target population of the study includes in total 311 patients from Gitwe Hospital (2011-2013) dispatched in the following six different sectors as: Table 2.1: Number of patients by sector

Sector

Ruhango

Kabagali

Mukingo

Kinihira

Bweramana

Busoro

TOTAL

Number of patients

18

46

43

35

166

3

311

Source: Researcher, March 2015.

The sample size for patients is determined using the Yamane (1967) formula which is:

Where N is the population size and e is the precision level. Concerning our case study; the total number of patients’ folders (N) is 311.Then, by the equation (2.18), the sample size is given as Table: 2.2: Calculation of sample size by Sector

PERCENTAGE FOR EACH SECTOR

SAMPLE SIZE BY SECTOR

RUHANGO: KABAGALI: MUKINGO: KINIHIRA: BWERAMANA: BUSORO: TOTAL SAMPLE SIZE Source: Researcher, March 2015

n=175

Page | 54 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com However, systematic random sampling has been used to select the patients’ folders to be included in the sample size of each Sector.

3.

RESULTS PRESENTATIONS

3.1. Chi-square test of association between the dependent and independent variables: Table: 3. 1:Chi-square test results

Factor Older age Gender Occupation status

p-value .000 .899 .001

Smoking Alcohol consumption

.679 .027

Cholesterol level

.001

Hypertension Family history of diabetes

.000 .289

Conclusion There is statistical evidence between age and the outcome of diabetes. No statistical evidence between gender and the outcome of diabetes. The outcome of diabetes is statistically associated with the occupation status. The outcome of diabetes is not statistically associated with the smoking. The outcome of diabetes is statistically associated with the alcohol consumption. There is statistical evidence of the association between the outcome of diabetes and the cholesterol level. The outcome of diabetes is statistically associated with the hypertension. The outcome of diabetes is not statistically associated with the family history of diabetes.

Source: Researcher, March 2015

The table 3.1 reveals that the risk factors with statistically significance association to the outcome of diabetes are older age, Occupation status, alcohol consumption, cholesterol level and Hypertension. Other factors like gender, smoking and family history of diabetes have no statistical significance to the outcome of the disease. 3.2. Multiple logistic regression model fitting: 3.2.1. The fitted model with all the predictor covariates: Table 3. 2: The Estimated coefficients, their standard error, and Wald test for the full model

Parameter

B

Std.

Wald

Sig.

Error Intercept

-47.549

13.853

11.781

.001

Age of patient

1.142

.259

19.459

.000

Gender of patient

.143

.408

.123

.726

Occupation Status

-1.208

.397

9.252

.002

Smoking

-.640

.678

0.891

.345

Alcohol consumption

.818

.387

4.466

.035

Cholesterol level

.991

.394

6.324

.012

Hypertension

1.028

.391

6.914

.009

Family history of diabetes

.482

.458

1.106

.293

Source: Researcher, March 2015

The table 3.2 displays parameter estimates in the B column, the standard error and the Wald test. Thus, using the estimates of the parameters in table 3.2, we get the following model:

Page | 55 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com (

)

(3.1) (

)

3.2.2. Testing for the significance of the individual parameters in the model: To test the hypothesis: H0 Versus

:

H1 : Consider Wald and Sig. column of the table 3.2. The information given by the table reveals that the significant predictors are: Older age (p-value = 0.000 < 0.05), Occupation status (p-value = 0.002 < 0.05), Alcohol consumption (p-value=0.035 < 0.05), Cholesterol level (p-value=0.012 < 0.05) and Hypertension (p-value=0.009 < 0.05). On the other hand, the predictors which are not statistically significant are: Gender (p-value = 0.726 > 0.05), smoking (p-value = 0.345 > 0.05) and family history of diabetes (p-value = 0.293 > 0.05). 3.2.3. Signs of coefficients analysis: The sign of the coefficients of the estimated logistic function in Table 3.2 above gives an explanation of the explanatory variables used, as given in Table 3.3. Table: 3. 3: The sign analysis

Covariate

Codes

Sign

Explanation

Older age

1 old

Positive

Older age increases the probability of having diabetes.

Positive

Male increases the probability of having diabetes.

Negative

To be employed decreases the probability of having diabetes.

Negative

Not Smoking decreases the probability of having diabetes.

Positive

Consumption of alcohol increases the probability of having diabetes.

Positive

High cholesterol level increases the probability of having diabetes.

Positive

Being hypertensive increases the probability of having diabetes.

Positive

Having a Family History of diabetes increases the probability of getting the disease.

0 young Gender

1 Male 0 Female

Occupation status

1 Employed 0 Unemployed

Smoking

1 No 0 Yes

Alcohol consumption

1 Yes 0 No

Cholesterol level

1 High 0 Law

Hypertension

1 Yes 0 No

Have a Family History of diabetes

1 Yes 0 No

Source: Researcher, March 2015

3.2.4. The odds ratio results: The Exp(B) column contains the exponential of parameter estimates. These values represent odds ratios for the corresponding predictor variables. In the table 3.4 bellow, the 95% Wald confidence limit shows the confidence interval (CI) for the odds ratio.

Page | 56 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com Table 3.4: Odds Ratios and 95% Confidence Intervals for Covariates

Variable

Exp(B)

Age of patient Gender Occupation status Smoking Alcohol consumption Cholesterol level Hypertension Have a Family History of diabetes

3.133 1.154 0.299 0.527 2.266 2.694 2.796 1.619

95% Confidence Interval for Exp(B) Lower Bound Upper Bound .192 .530 .519 2.566 1.537 7.287 .140 1.991 1.061 4.840 1.244 5.832 1.299 6.017 .660 3.976

Source: Researcher, March 2015 From Table 3.4 , it is evident that patients of older age, patients who consume alcohol, persons with high cholesterol level and hypertensive persons are highly susceptible for diabetes occurrence. 3.2.5. The full model assessment: Table: 3.5: Likelihood ratio test

Model Fitting Criteria -2 Log Likelihood 204.670 144.209

MODEL Intercept only Final

Likelihood Ratio Tests Chi-Square df

Sig.

60.461

.000

8

Source: Researcher, March 2015

The table 3.5 displays the Likelihood Ratio test. The -2 log likelihood for the constant only model obtain by fitting the constant only model is 204.670; and the -2 log likelihood for the overall model was 144.209. Thus the value of the likelihood ratio test is; G = 204.670 – 144.209= 60.461 The

null

hypothesis

is:

The results show that at least one of the predictors ' regressions coefficient is not equal to zero because of the small pvalues =0.000 which is less than 0.05. This would lead us to reject in favor of and we conclude that at least one and perhaps all beta's coefficient are different from zero. Table 3.6: Classification table for the model with all predictor variables.

Observed No Yes Overall percentage

No 43 16 33.7%

Predicted Yes 23 93 66.3%

Percent correct 65.2% 85.3% 77.7%

Source: Researcher, March 2015

From table 3.6, we conclude that: 65.2% of all patients who do not have diabetes are correctly classified and 34.8% are incorrectly classified. 85.3% from all patients who have diabetes are correctly classified and 14.7% are incorrectly classified. The overall correct percentage was 77.7% which reflects the model’s overall explanatory strength.

Page | 57 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com

Source: Researcher, March 2015 Figure 3. 1: Receiver Operating Characteristic (ROC) curve for the full model

By use of the ROC curve in figure 3.1 for the classification accuracy, it is found that the area under the ROC curve, which ranges from 0 to 1 provides the measure of the model’s ability to discriminate between those subject who experience the response of interest versus those who do not. The area under the ROC curve for the full model is 0.825 which may be considered as reasonable discrimination. Table 3.7: Hosmer and Lemeshow Test

Step

Chi-square

df

Sig.

1

7.037

8

.533

Source: Researcher, March 2015

By Hosmer and Lemeshow test, the table 3.7 gives the output from SPSS 15.0 . Our Hosmer-Lemeshow statistic has a significance of 0.533 which means that it is not statistically significant and we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model’s estimates fit the data at an acceptable level. 3.2.6 .The model with significant parameters only: The following step is the fitting of model with statistically significant parameters only (age, occupation status, alcohol consumption, cholesterol level and hypertension). Results are summarized in Table 3.8. Table 3. 8: Summarized results for the reduced model

Parameter

B

Intercept Age of patient Occupation Status Alcohol consumption Cholesterol level Hypertension

-44.679 1.089 -1.215 .825 .962 1.043

Std. Error 9.170 .249 .390 .381 .386 .386

Wald

Sig.

Exp(B)

95% Confidence Interval for Exp(B) Lower Bound Upper Bound

23.742 19.217 9.685 4.682 6.223 7.318

.000 .000 .002 .030 .013 .007

2.971 0.297 2.282 2.616 2.838

.207 1.568 1.081 1.229 1.333

.548 7.241 4.818 5.570 6.041

Source: Researcher, March 2015

From Table 3.8, the reduced model is written as follows:

Page | 58 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com (3.2) The results above indicates that: patients with older age are more susceptible to develop diabetes; An employed person is less susceptible to develop diabetes; consuming alcohol increases the susceptibility; persons with high cholesterol level are more susceptible than those with low cholesterol level and hypertensive patients are more likely to develop diabetes than those who are not hypertensive. The exponent (Exp (B)) in Table 3.8 is the odds ratio. Thus, for example: -The odds for patients who consume alcohol to those patients who do not take it to develop diabetes is 2.282. -The odds for patients with high cholesterol level to patients with low cholesterol level to develop the illness is 2.616. -The odds for hypertensive person to that one who is not hypertensive to develop diabetes is 2.838. Table 3. 9: Classification table for the reduced model

Observed The Patient diabetic

is No Yes

Overall Percentage

Predicted The Patient is diabetic No Yes

Percentage Correct No

41

15

62.1%

11 29.7%

98 64.6%

89.9% 79.4%

Source: Researcher, March 2015

Table 3.9 gives the classification table. The information from the same table is that observations are classified as follows:  62.1% of all patients who do not have diabetes are correctly classified, and 37.9% are incorrectly classified.  89.9% from all patients who have diabetes are correctly classified, 10.1% are incorrectly classified.  The overall correct percentage was 79.4%, which reflects the model's overall explanatory strength. Plotting sensitivity versus (1–specificity) over all possible cut-points is shown in the Figure 3.2 below .The area under the ROC curve for the full model is 0.843 this is considered reasonable discrimination.

Source: Researcher, March 2015 Figure: 3.2: Receiver Operating Characteristic (ROC) curve for the reduced model

Comparing the two models (model with all predictor covariates and the reduced model), area under the ROC curve has become a particularly important measure for evaluating models’ performance because it is the average sensitivity over all possible specificities. The larger the area, the better the model performs. (Bradley,1997).

Page | 59 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com We conclude that the reduced model (which has the area under the ROC curve of 0.843 and its overall explanatory strength is 79.4%) fits better the data than the model with all predictor variables (which has the area under the ROC curve of 0.825 and its overall explanatory strength is 77.7%).

4.

CONCLUSION AND RECOMMENDATIONS

In this study, risk factors of developing diabetes using logistic regression model were studied. The binary logistic regression model is used to estimate the probability of having diabetes. Firstly, the chi-square test of association between diabetes and all the predictor variables showed that older age, occupational status, alcohol consumption, cholesterol level and hypertension are statistically significant. Secondly, significance testing for the logistic coefficients using Wald test show that factors like older age, occupational status, alcohol consumption, cholesterol level and hypertension are significant as predictor variables of diabetes. The model fitted showed that getting diabetes does not depend significantly on the gender of a person, having a family history of diabetes and smoking. Instead, there is an increased risk of getting the diabetes as a person gets older. To assess the fitness of the model the maximum likelihood test and Hosmer and Lemeshow test are used. Based on the findings from this study, the following recommendations are formulated in order to give our contribution in fighting the most disabling disease like diabetes in people: -The people of rural areas would be aware of the diabetes and know that it is no longer a disease for rich persons or for elders but it has been common for all social classes and of all ages. - Continue the good habit of doing physical activities that many researchers have shown that the risk of diabetes and associated insulin resistance can be reduced significantly by trying to lose weight, especially for those who are severely obese (BMI > 35 kg/m2). - The nutrition should also play a vital positive role on health. By eating sufficient fruits and vegetables, one gets access to several health benefits, due to an assumed complex interaction of containing biological active compounds. - To go to healthcare centers regular tests for the occurrence of diabetes in the body. - To cease bad habits like smoking and alcohol consumption. - More follow up studies should be done to assess the benefits of different treatment modalities on control of cardiovascular risk factors such as blood pressure and lipids in diabetes patients to prevent serious complications in Rwanda. Especially, assessing the effect of the interventions based on healthy lifestyle such as increased physical activity, smoking cessation, weight loss and a healthy dietary pattern, and the rural area should be focused on. REFERENCES [1] Afifi, A., Clark, V.A., & May, S. (2004). Computer-aided multivariate analysis. Forth edition. Champman and Hall/CRC. [2] Alan, A. (2007). An introduction to categirical analysis.Second edition. Gainesville, Florida: Departement of Statistics, University of Florida. [3] Ayesha, M., & Kaushik, R. (2010). Diabetes, the hidden pandemic and its impact on sub-saharan Africa. Johannesburg: Diabetes leadership forum. [4] Belsley, D.A, Kuh, E., & Welsch, R.E. (1980). Regression diagnostics: identifying influential data and sources of collinearity. New York: wiley. [5] Bewick, V., Cheek, L., & Ball, J. (2005). Statistics review 14: Logistic regression.Critical care. London,England.: http//dx.doi.org/10.1186/cc3045. [6] Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. [7] Chao-Ying, J. (2002). Logistic regression analysis and reporting.Departement of canseling and educational psychology. Iindiana University-Bloominton.

Page | 60 Research Publish Journals

International Journal of Mathematics and Physical Sciences Research ISSN 2348-5736 (Online) Vol. 3, Issue 1, pp: (48-61), Month: April 2015 - September 2015, Available at: www.researchpublish.com [8] Elisa, T.L., & John, W.W. (2003). Statistical method for survival data analysis. Third edition. Wiley interscience. [9] Federation, I. D. (2009). Diabetes Atlas. Brussels. [10] Ferri, C., Flach, P., & Hernandez-Orallo, J. (2002). Learning decision trees using the area under the ROC curve. Morgan Kaufmann. [11] Hair, J.F., Anderson, R.E., Tatham, R.L., & Black, W.C. (1995). Multivariate data analysis, 3rd edition. New York: Macmillan. [12] Hosmer, D.w., & Lemeshow, S. (2000). Applied logistic regression. New York: John Wiley& sons Inc. [13] IDF. (2009). Diabetes Atlas. 4th edition. Brussels. [14] IDF. (2013). Diabetes Atlas. sixth edition. Brussels. [15] James, E. (2013). Theory of statistics. Virginia: Fairfax county. [16] Jennings, D. (1986). Judging inference adequacy in logistic regression. Journal of an the American Statistical Association. 81,471-476. [17] John, A. (1995). Mathematical statistics and data analysis.second edition. International Thomson publishing. [18] Katz, M. (1999). Multivariable analysis: Apractical guide for clinicians. Cambridge University press. [19] Lemeshow, S., & Hosmer, D.W. (1983). Estimation of odds ratios with categirical scaled covariates in multiple logistic regression analysis. American Journal of Epidemiology, 119,147-151. [20] Mbanya, J. (2009). Making a difference to global diabetes. Diabetes voice, 54. [21] Mccullagh, P. (1986). The conditional distribution of goodness-of-fit statistics for discrete data. . Journal of the American Statistical Association,81,104-107. [22] Morris, J.A., & Gardner, M.J. (1988). Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates. British medical Journal(clinical reasearch Ed.),296(6632),1313-1316. [23] Neter, J., Wasserman, w., & Kutner, M.H. (1989). Applied linear regression models. Homewood,IL:Irwin. [24] Ramachandran, K.M., & Chris, P.T. (2009). Mathematical statistics with applications. Elsevier Academic Press. [25] Ronald, C. (1990). Log-linear models and logistic regression. Second edition. springer. [26] Sahoo, P. (2013). Probability and mathematical statistics. Departement of statistics,. University of Lousville, KY40292 USA. [27] Sarah, W. (2004). Global prevalence of Diabetes: estimates for the year 2000 and the projections for 2030. [28] Scott, A. (2010). Maximum likelihood estimation of logistic regression models: Theory and implementation. http://czep.net/contact.html.Accessed on Sept 21, 2014. [29] Shaw,J.E., Sicree, R.A., & Zimmet, P.Z. (2010). Global estimates of the prevalence of diabetes for 2010 and 2030. Diabetes Clin Pract 2010. [30] Stella, A. (2012). Multiple regression analysis to determine risk factors for the clinical diagnosis of diabetes. Case study: Komfo Anokye teaching hospital (2008-2009). [31] Tetrault, J.M., Sauler, M., Wells, C.K., & Concato,J. (2008). reporting of multivariable methods in the medical literature. Journal of investigate medicine,56. [32] WHO. (2011). Global Atlas on cardiovascular disease prevention and control. Mendis,S., Puska, P., Norrving, B. editors.( in collaboration with the World Heart Hedereration and World Stroke Organisation). Geneva. [33] Yamane, T. (1967). Statistics: An introductory Analysis, 2nd Ed. New York: Harper and Row.

Page | 61 Research Publish Journals

Suggest Documents