Prioritization of Public Expenditure for a Better Return on Social Development: A Data Mining Approach Hisham M. Abdelsalam1,*, Abdoulrahman Al-shaar1, Areej M. Zaki1, Nahla El-Sebai2, Mohamed Saleh1, and Miral H. khodeir1 1
Faculty of Computers and Information, Cairo University Cairo, Egypt 2 Faculty of Economics and Political Science, Cairo University, Cairo, Egypt
[email protected]
Abstract. Public expenditure affects people both directly, through subsidies and transfers, and indirectly through affecting consumption and production activities. The effects of public expenditure depend not only on its absolute values but also on both its composition and the efficiency of this spending. This paper uses data mining techniques to reach a model that maximizes social develoment through efficient allocation of public expenditure and assesses the current state of Egypt with respect to the model reached. Out of five tested models, decision tree was the one found more appropriate given this research focus and data available.
1
Introduction
Public expenditure policies, as a key component of fiscal policies, play an important role in the economy in terms of their ability to allocate resources among various economic sectors. Public expenditure plays an important role in pursuing economic growth objectives while ensuring that gains are widely distributed to promote broadbased increases in living standards. Governments’ relative fiscal positions, how much they spend, and the composition of that spending are likely to make a difference in achieving these objectives [1]. Governments that want to improve their citizens' well-being can spend their financial resources in different ways. The effect of each type of expenditure differs from the other; on the one hand spending on areas such as research and development, education, and infrastructure may facilitate the achievement of economic growth in the long term but at the same time it is possible to ignore those who do not reach the fruits of growth in the short term. On the other hand, spending on health and cash transfers to the poor will meet the immediate needs of the poor but may neglect productive investments. Hence, policymakers should consider different types of government spending and the impact of each type on development, and the time range in which the yield of each type of expenditure achieved when determining the priorities *
Corresponding author.
A.E. Hassanien et al. (Eds.): AMLTA 2014, CCIS 488, pp. 523–530, 2014. © Springer International Publishing Switzerland 2014
524
H.M. Abdelsalam et al.
of this spending [2]. As such, policy recommendations regarding the impact of each type of government spending must be built depending on the circumstances of each country and must be based on applied studies [3]. Reviewing published economic research we came across many empirical studies that linked government expenditure to long term economic growth, e.g. [4][5] [6]. Literature analyzing public expenditure effects on economic development is much scarcer, and to the best of our knowledge there is no previous research that attempted to prioritize different types of government expenditure according to their effects on human development. Hence, this study is trying to fill an important gap in the available literature. Given the inadequacy of public economics theory in providing the necessary guidance on expenditure allocation to policy-makers and development practitioners it is important to think about how a government should allocate public expenditure across various sectors to maximize prospects for achievement of its development objectives [3]. In the current research we use data mining tools for building a quantitative model that helps determining the best possible composition of public expenditure in order to maximize its benefits for all the society. The human development index (HDI) is used as an indicator for those benefits and the effect of five good governance indicators will be tested.
2
Methodology
This objective of this study is to determine the best allocation of public expenditure that would lead to higher HDI. To do so, data mining will be used to reach the model based on the use of data variables relevant to a large number of countries (all countries or years of data is available) without prior hypotheses about the nature of the relationship between them. This study follows a genetic data mining process found in literature (e.g. [7]) that consists of the following nine phases: (1) understanding the domain and goal of the application also collecting prior knowledge about the study; (2) determining a target data set, starting data gathering and selection; (3) data cleansing and preparation; (4) finding useful variables and reduction of data; (5) selecting suitable functions for data mining; (6) selecting a data mining algorithm(s); (7) data mining process, searching for useful and meaningful patterns; (8) evaluating and understanding the patterns and presenting them by an understandable way; and (9) using the discovered hidden patterns and knowledge. The issue under investigation was initiated by our fellow economists. As such, for the first phase, they provided needed knowledge and explanations regarding the problem formulation and worked with the data mining technical team throughout the following phases schematically illustrated by Figure 1. For the purpose of the study, needed data was collected from several data sources and, then, was aggregated into Microsoft Excel® sheets [8] in which data cleansing and preparation took place (phases 2 and 3) using Visual Basic for Applications
Prioritization of Public Expenditure for a Better Return on Social Development
525
(VBA®) and Macros [9] to ensure the accuracy of the data and avoid any duplication. Then, Toad® software (Tool for Oracle Application Developers) [10] was used to read and extract cleansed data into a developed database and then into SAS environment. SAS® Enterprise Guide® was then used [11] to replace any missing data with dashes so that the data mining software can deal with it and converted the excel tables into SAS tables. SAS tables were then introduced into SAS® Enterprise Miner [11] for the modeling phases (6 and 7) can take place. The two final phases included reviewing the results and coming out with the suitable recommendations.
Fig. 1. Implementation Phases – Schematic Illustration
2.1
Data Gathering
Data items of this study were divided into three main categories: public expenditure allocation of various countries, governance factors of these countries, and their Human Development Index (HDI). Table 1 lists these items – referred to hereinafter as ‘factors’. For detailed definition, kindly refer to [12] [13] [14][15]. These data items were gathered from various different sources to ensure that they include all available countries and also to enhance accuracy, these sources include: World Bank Data [12], Human Development Report (Data set) [13], World Governance Indicators[14], and Ministry of Finance of Egypt. Cross Sectional (all countries) and time series data (from 1990 to 2010) were compiled.
526
H.M. Abdelsalam et al. Table 1. Factors Definition
No. 1 2 3 4 5 6 7 8 9 10
2.2
ITEM (Factor) Public health expenditure as % of total Public expenditure. Public health expenditure as % of GDP Public Expenditure on education as % of total public expenditure. Public Expenditure on education as % of GDP Military Public expenditure as % of Total Public Expenditure. Military expenditure as % of GDP Public Expenditures on research and development as % of total public expenditure. Public Expenditures on research and development as % of GDP Public expenditure on subsidies and other transfers. A statistical index used to measure a country's overall achievement in its social and economic dimensions. Data Cleansing and Preparation
A database was designed and constructed to include all the data gathered and to provide the base for the checking and preparation process. The database for the study was designed to include the following tables: Country; Factors; User Modification; Users; and Country Factor Facts. A star schema infrastructure was deployed to help relating tables with each other. The schema was designed in order to minimize the number of tables in the database and so ease the process on the user [16]. An excel sheet was created containing all different country-related data and VBA® and Macros® on excel were used to specify the factors and countries with primary keys to be uniquely identified; thus avoiding any duplication in the countries' names. The same steps were repeated for the all factors' names with different and unique keys. Several checkups have been made on the data to avoid inconsistency, conflicts and missing data. Data cleansing was conducted on continuous basis which is the process of detecting and correcting (or removing) corrupted or inaccurate records from a record set, table, or database. It is used to identify incomplete, incorrect, inaccurate and irrelevant parts of the data and then replacing, modifying, or deleting this data. The inconsistencies detected or removed may have been originally caused by user entry errors; or by corruption in transmission or storage but after data cleansing all this inconsistency was removed. The next step included using Toad® to start migrating the modified excel sheets into the database developed. Toad® is a software application from Quest Software used for developing and managing different relational databases using SQL®, as SQL® toad is used to conduct some quires on the data. It was used as a simple application to support inserting the data from excel sheets into the database rapidly and efficiently. Microsoft Excel® sheets were imported into Toad® software; then the processes of classification and insertion in the database were conducted.
Prioritization of Public Expenditure for a Better Return on Social Development
527
For the replacement of the missing data algorithms will be used with different tested models, and SAS will automatically select the algorithm leading to the best results. The two algorithms are: Most Correlated Branch algorithm, and the Largest Branch algorithm.
3
Implementation
SAS® Enterprise Miner was deployed for the modeling part to streamline the data mining process to create highly accurate predictive and descriptive models based on large volumes of data. SAS is a recognized industry leader in business analytics software (including data mining). Given its availability for the researchers and their past experience working with it, it was selected to be used in the current study. Figure 2 shows the model diagram. An input Data node was added including the data source in which in this input data the property of each variable will be identified.
Fig. 2. Model Diagram - snapshot
A Data Partitioning node was added to enable partitioning data sets into training, test, and validation data sets. The training data set is used for preliminary model fitting, 60% of the data was considered as training data. The validation data set is used to tune and monitor the model weights during the running estimation and assures that the built model fits a real and valid data set, 20% of the data was considered as validation data. The test data set is used for model assessment and making the final comparisons to the data, 20% of the data was considered as testing data. Five different Models were developed each representing a specific technique or algorithm in data mining, theses were: (1) Decision Tree; (2) Neural Network; (3) Auto Neural; (4) Regression; and (5) D-mine Regression. Different configurations (parameters’ settings) for the Decision Tree algorithm were tested with different characteristics. The aim was to check whether there is a better representation in the characteristics of the Decision Tree algorithm; or the default characteristics representation is the best. These were: Default Decision Tree; CHAID like Decision Tree; GINI Decision
528
H.M. Abdelsalam et al.
Tree; CART like Class Probability Decision Tree; and CHAID LIKE and Valid Decision Tree. Detailed configurations are provided in Appendix B.
4
Results
Different models were introduced to the Model Comparison node. The Comparison node compared all the models according to their accuracy. The results showed that the default Decision Tree Model was the most accurate and efficient model in showing and representing the data in the most meaningful way. Table 2 shows the comparison between the different models according to their selection criteria, which is based on the test average squared root error value. The most efficient model is the model with the least selection criteria value which is the Default Decision Tree. Table 2. Selection Criteria for all Models
Model Description
Target
Test: Average Squared Error
Default Decision Tree
HDI
0.01752
CHAID Decision Tree
HDI
0.01763
DT Most Correlated Alg.
HDI
0.01857
Dmine Regression
HDI
0.01878
Neural
HDI
0.01985
GINI Decision Tree
HDI
0.02036
DT Class Prop Alg.
HDI
0.02036
CHAD Decision Tree
HDI
0.02183
DT Largest Branch Alg.
HDI
0.02231
Regression
HDI
0.04840
AutoNeural
HDI
0.05217
As an example, figure 3 will be used to represent the main points of the tree characteristics. The first node (Health Expenditure GDP) is the parent node that shows that the trained records were about 231 records and will apply the prioritized factors on them; it is also given that the average HDI is about 0.86. The tree have been split into two branches. The first branch on the left hand side introduces the records that had the Health Expenditure as % of GDP greater than or equal 5.3588 % for 221 records with 0.86 average HDI units. The second branch indicates that only 10 records have Health Expenditure as % of GDP less than 5.3588 with 0.78 average HDI. The difference in the colored boxes indicates that the dark node is most preferable than the lighter one. To conclude, this figure indicates that it is preferable for the government to spend more than or equal 5.36% of GDP on Health. Table 3 shows the importance of each factor, the importance rating of the factors starting from the factor with the highest importance which in this case will be the Subsidi Expenditure EXP factor to the least importance factor which will be the Education GDP factor. The best path, shown in Figure 4, was identified showing the best
Prioritization of Public Expenditure for a Better Return on Social Development
529
preferable path (indicated by stars) leading to the country with the best HDI values. The figure indicates that it is preferable to spend greater than or equal 29.5654 % on the Subsidies, then spending greater than or equal 5.1234 % of GDP on Health, then spending over than or equal 1.0601% of GDP on Research and Development.
Fig. 3. Sample section of the entire tree - snapshot Table 3. Importance of factors
ITEM (Factor)
Importance
Public expenditure on subsidies and other transfers.
1.000
Public health expenditure as % of GDP
0.758
Public Expenditures on research and development as % of GDP
0.463
Military expenditure as % of GDP
0.249
Public Expenditure on education as % of GDP
0.201
Fig. 4. The Best Path - snapshot
530
5
H.M. Abdelsalam et al.
Conclusions
The main objective of the current study was to identify the best distribution of public expenditure on different areas of education, health, research and development and other, which maximizes the benefit to society (social development) under the same volume of spending. In other words, to determine the best percentages of spending on different areas of the total expenditure, in addition to the identification of the best amount to spend on different areas as a percentage of GDP. The paper used Data Mining, which is based on the use of data with large number of records for a large number of countries to draw a particular pattern for the distribution of public expenditure for all countries that share a certain level of human development. For that purpose, the Human Development Index (HDI) was used as an indicator of approximate (Proxy) to the return on society. SAS Enterprise Miner was used to test several data mining algorithms to reach a model that achieves the best results. These included advanced regression models, as well as models of networks of artificial neurons and decision tree models.
References 1. Dewan, S., Ettlinger, M.: Comparing public spending and priorities across OECD countries. Cent. Am. Prog (2009), http://www.BoellOrgdownloadsewanEittinglerComparingPublicSpe nding.Pdf 2. Ali, A.G.A., Fan, S.: Public policy and poverty reduction in the Arab region. Arab Planning Institute (2007) 3. Paternostro, S., Rajaram, A., Tiongson, E.R.: How does the composition of public spending matter? Oxf. Dev. Stud. 35(1), 47–82 (2007) 4. Barro, R.J.: Government spending in a simple model of endogenous growth. National Bureau of Economic Research Cambridge, Mass (1991) 5. Landau, D.: Government expenditure and economic growth: a cross-country study. South. Econ. J., 783–792 (1983) 6. Aschauer, D.: Is government spending productive? J. Monet. Econ. 23(2), 177–200 (1989) 7. Ranjan, J.: Applications of Data Mining techniques in Pharmaceutical Industry. J. Theor. Appl. Inf. Technol. 3(4), 61–65 (2007) 8. Microsoft: Excel, http://office.microsoft.com/en-001/excel 9. Visual Basic for Applications (VBA) macros, http://msdn.microsoft.com/ en-us/office/ff688774.aspx 10. Toad World, https://www.toadworld.com 11. SAS, http://www.sas.com/en_us/home.html 12. The World Bank, http://data.worldbank.org 13. United Nations Development Programme (UNDP): Human Development Data API, http://hdr.undp.org/en/data/ap 14. The World Bank: Worldwide Governance Indicators, http://data.worldbank.org/ data-catalog/worldwide-governance-indicators 15. Kaufmann, D., Kraay, A., Mastruzzi, M.: Governance matters VII: Aggregate and individual governance indicators 1996–2007. World Bank, Washington DC (2008) 16. Giovinazzo, W.A.: Object-oriented data warehouse design: Building a star schema. Prentice Hall PTR (2000)