A Comparison of Some Predictive Models for ...

0 downloads 0 Views 384KB Size Report
Modeling Abortion Rate in Russia .... step is to select the optimal model from the list of models ..... abortion rate is the Far Eastern district (174.85) and the.
A Comparison of Some Predictive Models for Modeling Abortion Rate in Russia Sergey Soshnikov

Vasiliy Vlassov

Dept. of Medical and Social Problems, CPHRI, 11, Str. Dobrolyubova, Moscow, 127254, RU [email protected]

Department Public health and Preventive Medicine, I.M. Sechenov First Moscow State Medical University, RU

Carl Lee

Public Administration (MC/MPA), Harvard University, USA

Department of Mathematics, Central Michigan University, USA [email protected] Abstract: Predictive modeling techniques are popular methods for building models to predict a target of interest. In many modeling problems, however, the focus is to identify possible factors that have significant association with the target. For this type of problem, it is very easy to stretch the interpretation of an association relationship to a causation relationship. Practitioners must pay special attention to such a misinterpretation when data are observational data. In addition, the process of data collection and cleansing are critical in order to produce quality data for modeling. In this article, an observational study is conducted to illustrate the issues about data quality and model building to identify potential important factors associated with abortion rate using data collected in Russia from year 2000 to 2009. Some pitfalls and cautions of applying predictive modeling techniques are discussed. Keywords: Data Quality, Decision Tree, Ensemble, Gradient Boosting, LASSO, Neural Network, Partial Least Squares

I. INTRODUCTION Predictive modeling techniques are popular methods for building models to predict a target of interest. Most of the data used for model building are observational data, and come from a variety of different sources. Each part of the data was originally collected for a specific purpose. As a consequence, the process of merging and integrating different data sources often creates some unexpected difficulties. The data quality of the combined data is not easy to control. The validation of the data integrity and accuracy require extra effort in order to have a good quality of data for model building. Once an appropriate data set is prepared, the next step is data exploratory analysis and manipulation. Some important tasks that need to be addressed may include exploring the insights of individual inputs, correlation among inputs and relationships between inputs and target, handling missing data and extreme cases, conducting variable standardization, variable transformations and preliminary variable selections. At the modeling building stage, an immediate task is to determine the modeling techniques and make comparison among the best models to choose the final ‘best’ model. Model building is an iterative process, which requires advanced knowledge about the modeling technique in order to determine criteria for building and selecting models. A ‘black box’ approach is not adequate for building useful predictive models.

Maria Gaidar Sergey Vladimirov Independent Laboratory SQLab, RU During the entire process of data collection, manipulation, and modeling building, one must keep in mind the goal of the project. The process is iterative, and needs to take into consideration of the context of the project in order to obtain a ‘useful’ model. As Professor George Box said ‘All models are wrong, some are useful’, the usefulness of a model relies upon solid data quality, valid modeling methodology, and a close connection with the purpose of the project. Interpretation of a model requires a good understanding of the context of the problem and solid understanding of the methodology. For most analytics modeling, the focus is often on the prediction. Hence, the relationships between target and selected inputs are association relations, not causation relations. Thus, interpretation of inputs should be very careful, and do not stretch the interpretation as cause-effect relations. II. ISSUES OF DATA QUALITY There are several major issues related to the process of data cleansing, which may be classified into issues during data production and those at data manipulation. The process of data production in an analytic project refers to the process of data source identification, data collection, data integration, and data extraction. This process often takes over 75% of the time for a project [1]. McKnight [2] summarized seven sources of poor data quality during this process, which includes (1) entry quality, (2) process quality, (3) identification quality, (4) integration quality, (5) usage quality, (6) aging quality and (7) organizational quality. Radhakrishna et al. [3] considered the components of data quality and provided a check list for checking eight components of data quality as follows (1) validity (2) reliability (3) objectivity (4) integrity (5) generalizability (6) completeness (7) relevance and (8) utility. Once a data set is prepared for model building, there is another process of data exploratory analysis and manipulation prior to model building. The process often applies a variety of descriptive analysis and graphical displays to investigate the insights of individual variables, relationships among inputs as well as between inputs and target. This process often also involves with a preliminary variable selection and transformation so that the selected inputs are statistically and practically meaningful, and relevant to the target. Typically, the first task is to look for practically irrelevant inputs, erroneous data values and variables with very high percentage of missing data for

elimination. The next task is to apply some basic and computationally efficient methods to determine a subset of variables, and perform proper variable transformation so that the selected inputs are statistically relevant to the target. The selection criteria often are set to be very conservative to ensure possible relevant inputs are kept. Missing data imputation is another critical issue. For most modeling techniques, missing one data value of an input means losing the entire case. If missing data are randomly occurred among variables and cases, then, it is possible that there are only very few cases or worst, no valid case remain for model building. Imputation of missing is needed prior to model building. Missing data imputation can be misleading, if not done properly. For example, missing of prices of a product should be handled differently from missing of percentage of education level in a region. Some types of missing cannot be imputed and some missing should be treated as zero. Yet, some missing is due to the sensitivity of the question asked. Thus, individuals who chose not to answer such questions should be grouped and indexed so that one can investigate the insight of this group compared with others. III. CHOICE OF MODELING TECHNIQUES AND INTERPRETATIONS OF RESULTS There are many different modeling techniques available. Each method has its strength and weakness. In this article, we apply the following modeling techniques to model the abortion rate in Russia: (1) Decision Tree (DT), (2) Linear Regression(LR), (3) Adaptive Least Absolute Shrinkage and Selection Operator (ALASSO), (4) Gradient Boosting (GB), (5) Neural Network (NN), (6) Partial Least Square (PLS), and (7) Ensemble of DT, LR, ALASSO and GB. A general strategy of predictive model building involves steps of (1) data partitioning, (2) model building and (3) model optimization. The data partition step partitions the entire data into Training set for building models, Validation data set for optimizing the final model, and Test data set for evaluating the model independent from the data sets used for modeling. The model building step is aimed at selecting important variables for predicting the target based on a predefined criterion. For interval target, a common criterion is the average squared error (ASE). The model optimization step is to select the optimal model from the list of models built in the model building step by applying each model to the Validation data set based on a given assessment criterion such as AIC, BIC, and validation error. The final model selected has the optimal value based on the assessment criterion. The main purpose is to prevent overfitting. The seven modeling techniques applied for the case study have different strength and weakness. • DT: It is a rule-based modeling technique. It is easy to apply and interpret, it allows missing data (that is, no imputation is needed), and it automatically takes the interaction among inputs into account. The major weakness is that it discretizes interval target, thus, the prediction is no longer an interval scale. DT is sensitive to the choices of inputs in order. It will give totally different rules when a different input is selected in the early stage of the splitting step. • LR: It is a linear parametric modeling technique. It is parametric, thus, easy to interpret. It is additive and hence,

the parameter estimates can be interpreted as the pure contribution of the input to the target. One can perform inference on the parameter estimates. Some weakness are (1) it is only good for capturing the general pattern of linear relation between inputs and target, (2) there is a tendency of getting many superficial significant inputs due to the nature of hypothesis testing when sample size is large. Validation data and model optimization step help to reduce the risk of selecting too many ‘significant inputs’. • ALASSO [4]: ALASSO regression model coefficients β = ( β1 , β 2 , β p ) are the solution to the constrained optimization problem: Minimize y− Xβ

2

p

subject to ∑ βi / | βˆi | ≤ t , where y is the target, X i =1

are standardized, and βˆ i is an estimate of the parameter in the true model. An initial βˆ is required, and the weight at jth iteration is the estimate from the (j-1) iteration. The method implicitly serves as a model selection process and the result is a regression model with estimated parameters that can be interpret. However, it assumes that the full model, the model includes all input variables, as the ‘true model’ could be troublesome. The typical confidence interval and hypothesis tests do not apply. The final selected model may be overfitting. • GD: This method is a boosting approach that resamples the data set several times to generate results that form a weighted average of the re-sampled data set. The GD approach we apply is a series of decision trees by fitting the residual of the prediction from the earlier tree in the series, which are combined by a series of weights [5]. Unlike single DT, this method combines a series of DTs. The final tree is not as sensitive to the order of the input selected. For prediction purpose, it has lower ASE than a single DT. However, the combination of trees may be overfitting. • NN: This method can be considered as a two-stage nonlinear or classification model. The target Y i . is modeled as the function of the linear combination of hidden layers H defined as Y = f (W ' H ') + ε , where f is the activation i.

i.

i.

function connecting hidden layers with the target. Since the target is an interval variable, the activation function f is taken to an identity function. A hidden layer H is a linear combination of the inputs: H = g ( Z ' X ') , i = 1, 2,… , N , where i.

g is the activation function and

i.

Z is the weight matrix of

the inputs. The hidden layer activation function g is taken to be hyperbolic tangent function. The NN modeling technique often results in too many inputs and the estimated weights for selected inputs cannot be interpreted, known as ‘Black Box’. • PLS: This method applies the following modeling strategy: 1) The input variables {X1 , X 2 , X m} are first transformed and combined into a smaller number of orthogonal ‘principal components‘. Each principal component is a linear combination of the inputs, m , which explains a H = w X , i =1,2, k with k = 18 /10,000; Crimes consumption Registered /100,000; Murders & Attempted /10,000; Success Murders /100,000; Juveniles Crimes /10,000 related factors

VII. PROPERTIES OF TARGET: ABORATION RATE PER 10,000 The target is the abortion rate per 10,000 in each region. The abortion rate is approximately normally distributed (ShapiroWilk’s test statistic for normality is 0.997 (p-value .228)). The summary statistics for each year are presented in Table 2. There is a clear decreasing trend from year 2000 to 2009 (16.23% decrease).The standard deviation also shows a decreasing trend starting from 2003

to 2009 (41.38 to 30.43). Figure 1 is the district map of the Russian Federation. The summary statistics for each federal district are presented in Table 3. The highest average abortion rate is the Far Eastern district (174.85) and the lowest is North Caucasus (65.03).

TABLE 2. THE ABORATION RATES BY YEAR Years

N

Average

Median

Std. Dev.

2000

75

151.42

152.97

37.125

2001

75

144.38

144.98

37.89

2002

75

140.48

141.26

37.06

2003

75

137.50

134.13

41.38

2004

75

132.03

127.63

38.37

2005

75

125.63

123.67

38.21

2006

76

119.67

115.70

35.55

2007

76

112.74

106.93

32.61

2008

77

106.45

103.08

31.38

2009

78

100.12

98.08

30.43

All Years

757

126.85

126.52

39.41

VIII. CHARACTERISTICS OF INPUTS A total of 52 inputs are considered for the modeling. These inputs vary greatly in terms of their characteristics. Several steps are conducted to finalize the list of inputs: • The properties of each input are explored.

Figure 1: Administrative map of Russia, with marked Federal Districts. TABLE 3. ABORATION RATE BY DISTRICTS Federal Districts

ID

N

Average

Median

SD

N. Caucasus(NCFD)

8

70

64.82

65.03

26.00

Southern (SFD)

2

50

107.91

108.34

21.61

Central (CFD)

1

153 16.00

3.957

16.01

Northwestern (NWD)

3

90

3.912

14.72

Volga (VFD)

6

144 130.24

125.91

32.22

Urals (UFD)

7

40

155.47

157.13

23.58

Siberian (SFD)

5

111 149.42

150.83

27.43

Far Eastern (FED)

4

72

174.85

30.87

• • • •

15.86

171.33

Variable transformation is carried out. Strategy for handling missing data is determined. Preliminary analysis of the relationship between each input and the target is conducted. Preliminary input variables are selected.

Most of the input variables are skewed to the right and require some types of variable transformation. For interval scale inputs, we determine the appropriate transformation to best fit normal distribution (maximizing normality). For class inputs, we apply the rare group collapsing method by grouping categories with less than 0.1% of cases into a new group. At the early stage of data collection, a great deal of time and efforts are spent to ensure the data quality and reliability. As a result, this data set does not have many missing data. The only issue about the missing data is the variables “% of Social Income”, “% of Business Income”, “% of Property Income”, “% of Salary Income” and “% of Other Income”. We use the fact that the sum of these income components should be 100% and that the proportion of income follows time sequence with certain pattern from year to year. For the missing time series data, we employ the nearest neighbor technique to impute the missing using the average of two nearest years of the data from the same region. The correlations between target and inputs are obtained. Table 4 gives the Pearson’s correlation coefficients between the abortion rate and 24 inputs with significance at 5% level. It is noticed that several inputs have high positive correlation with abortion rate (Crimes Due to Alcohol, Juveniles Crimes, and number of Hospital Beds). Prior to model building, we employ a simple forward selection procedure using regression to screen out inputs with extra R2 less than .05%. This was done to further reduce the number of inputs without losing potentially important inputs for the model building. A total of 36 inputs are finalized for the subsequent modeling. IX. METHODOLOGY FOR MODELING THE ABORTION RATE IN RUSSIA Due to the wide variety of different measuring units of the inputs, all interval inputs are standardized using range transformation. An indicator variable is created for each level of a class input. Seven modeling techniques are applied to build the model. A brief description of each method is given in section III. The best model from each technique is obtained and compared using average squared error computed from validation data. For details of the predictive modeling techniques, one may refer to the literatures [19]. Table 5 gives the error for the best model obtained by each modeling technique. It is expected that PLS model outperforms all other models, since it includes all inputs in the model (a total of 36 inputs). The ensemble model that combines DT, GB, ALASSO and LR is the second best model, which does not identify important inputs. It is strictly for prediction purpose. The model GB combines a series of DTs. There are more than 25 out of the total 36 inputs are identified as important variables for the GB model. The GB model has some advantage for prediction purpose. However, it would not be much useful for identifying important factors, since there are too many inputs selected for the GD model. Both ALASSO and LASSO perform better than LR. The important inputs identified from ALASSO, LASSO, LR and DT are given in Table 6, respectively.

TABLE 4. PEARSON’S CORRELATION BETWEEN ABORTION RATE AND 24 INPUTS (p-value 60 years old Work pop/Non-working pop Growth urban pop/1000 Emissions air pollutants in ton/10 Divorces/1000 July temperature Alcohol consumption, Absolute alcohol liters /1 person (per year) Crimes Juveniles crimes /10,000 Crimes in alcohol intoxication /10000 Economic % Salary income Education % Real cash income to 1999 infrastructure % of Economical active pop % Social income Educational institutions /10,000 Non-food price index Unemployment rate

ALASSO 4* 15 13 17 21 8 22 18 6

LASSO 7 9 14 8 16 17 3 12

LR 7 9

14 7 6 2

1 7 3 2 12 23

14 9 16 5 10 19 11 20

DT 10 4

11 14 7 17 18 15 3

6 19 18 15 2 4 1 11 6 5 10 13

12 8 1 10 5 11 3 4 12 13

2 1 5 9

13 16 19

REFERENCES [1] M. Berry and G. Linoff, Data Mining techniques, 2nd Ed., Wiley Publishing, 2004. [2] W. McKnight,” Information Management: 7 sources of poor data quality”, 2009. [Online]: www.informationmanagement.com . [3] R. Radhakrishna, D. Tobin, M. Brennan & J. Thomson,(2012). Ensuring data quality in extension Research and evaluation studies. Journal of Extension, Vol. 50(3), Article # 3TOT1. 2012. [Online]: http://www.joe.org/joe2012june/tt1p.shtml. [4] H. Zou, “The adaptive Lasso and its oracle properties”. The Journal of American Statistical Association, Vol. 101(467), pp. 1418-1429, 2006. [5] J. H. Friedman,(2002). “Stochastic gradient boosting”. Computational Statistics & Data Analysis, Vol. 38, pp. 367378, 2002. [6] Rosstat –Federal Service of National Statistics of Russian Federation. [Online]: http://www.gks.ru . [7] P. Whelan “Abortion rates and universal health care”. New England Journal of Medicine, Vol. 362(13), e45(1)e45(3),DOI: 10.1056/NEJMp1002985, 2010. [8] E. Oliveras, U. Larsen and P. H. David, “Client satisfaction with abortion care in three Russian cities”. Journal of biosocial science, Vol. 37(5), pp. 585-601, 2005. [9] Regions of Russia. Social-Economics Indicators. 2011. [Online]: http://www.gks.ru/wps/wcm/connect/rosstat/ rosstatsite/main/publishing/catalog/statisticCollections/doc_11 38623506156 . [10] The Main Characteristics of regions of the Russian Federation. Federal Statistical Service of Russia. 2011. [Online]: http://www.gks.ru/wps/wcm/connect/rosstat/ rosstatsite/main/publishing/catalog/statisticCollections/doc_11 38625359016 .

[11] Demographic Data. Federal Stat Service of Russia, 2011. [Online]: http://www.gks.ru/wps/wcm/connect/rosstat/ rosstatsite/main/publishing/catalog/statisticCollections/doc_11 37674209312. [12] Social status and standards of living of the Russian population. Federal Statistical Service of Russia, 2011. http://www.gks.ru/wps/wcm/connect/rosstat/rosstatsite/main/p ublishing/ catalog/statisticCollections/doc_1138698314188. [13] Economic activity of population in Russia, 2011. [Online]: http://www.gks.ru/wps/wcm/connect/rosstat/rosstatsite/main/p ublishing/catalog/statisticCollections/doc_1139918584312. [14] Healthcare in Russia.2011. [Online]: http://www.gks.ru/ wps/wcm/connect/rosstat/rosstatsite/main/publishing/catalog/ statisticCollections/doc_1139919134734 . [15] Russian State Statistical Agency. Healthcare in Russia 2011. [Online]: http://www.gks.ru/wps/wcm/connect/ rosstat/rosstatsite/main/publishing/catalog/statisticCollections /doc_1139919134734 [16] The Central Statistical Database (CSDB). 2011. [Online]: http://www.gks.ru/dbscripts/Cbsd/DBInet.cgi . [17] Violence and Alcohol in the Russian Federation, WHO Report. [Online]: www.euro.who.int/document/e88757.pdf. [18] Alcohol abuse in the Russian Federation: social economic consequences and measures of counteraction. Moscow: Public Council of the Central Federal District; 2009. Committee on social and demographic policy. [19] J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques, 3rd Ed. The Morgan Kaufmann Publisher, NY, 2011.