Forecasting Model for the Students' Job Turnover in Thai Industries

2 downloads 0 Views 80KB Size Report
May 9, 2000 - organizations that shows in Table II. TABLE II. SIX SAMPLING PRIVATE ORGANIZATIONS. Company. Count. Charoen Pokphand. 270. Dtac.
Forecasting Model for the Students’ Job Turnover in Thai Industries Pirapat Chantron1 Prasong Praneetpolgrang2 1, 2 Master of Science Program in Information Technology, Graduate School Sripatum University, Bangkok, Thailand 10900 Email: [email protected], [email protected] Abstract- The purpose of this study is to forecast the model for the students’ job turnover in Thai industries. The proposed model is based on data mining technique. Then, the best results were used to compare with results of the Multiple Regression Analysis. In addition, the researcher collected data from the students who work in Thai industries. The research result found that the forecasting model with using technique of Bayesian Network indicated key variable that affect the change of job to high accuracy in forecasting from the model that allows key variable that affect to the change of students’ working in industries are major, position and salary. Keywords- Bayesian, Data Forecasting Model, Networks

Mining,

I. INTRODUCTION Many students in the university are not aware whether they should choose to study, any field of studies that match for them in order to work directly with their interests. A number of students transfer their majors of studies or change their majors, drop or resign from the university. After graduating from the university and get into work, a number of students change their work or resign for the reasons that they cannot find the appropriate or proper work with their major of studies or their interests. These are the reasons that students do not have experience and lack of information in their majors of studies. They unknown individual disciplines well enough, and they found afterward that their studies or their majors and their work didn’t fit with them. It is too late for them to start again.

The education institutes have a data of student in very large but cannot to use. The Data mining is based on statistical analysis, it has been

used in finding and describing structural patterns in data segmentation and predictions. This technique has been applied extensively in many industries including banking and finances, education, medical sciences and manufacturing. Xenos [1] proposed Bayesian Networks for modeling student behaviors in order to enable prediction, status assessment and decision-making. Accuracate and useful results can be obtained. Garcia and et al. [2] evaluated Bayesian Networks precision for representing and detecting students’ learning styles in a Webbased education system. The Bayesian Networks could be estimated with high precision the categorizing students to pre defined dimensions. Yingkuachat and et al. [3] proposed the prediction of education accomplishment by using data mining technique, the Bayesian Networks. Result shows that Bayesian Networks is able to determine important variables for the prediction of the result of education accomplishment and high prediction accuracy. Mukoolskunpibal and Kitisin[4] compared the efficiency of C4.5, ADTree and Naïve Bayes algorithms on international postal mail and packages on the prediction of concealed narcotics.Performance comparisons used Hold Out and k-fold cross-validation methods. Correct rate of ADTree algorithm is the best. Yamansabideen and et al. [5] used data mining to develop of Customer Relationship

The Sixth International Conference on eLearning for Knowledge-Based Society, 17-18 December 2009, Thailand

39.1

Pirapat Chantron and Prasong Praneetpolgrang

Management for the student, by using Decision by Tree. Result can be used as decision supporting data to solve the problems concerning students who would be nearly eliminated. Sun and Shenoy [6] used Bayesian Networks for bankruptcy prediction based on a 10-fold crossvalidation. Results are shown that the model’s performance is the best. Hidekazu and et al. [7] used Decision Tree for estimating sentence types. The representative Decision Tree algorithm C4.5 was revised. The gain ratio criterion was changed, and the hill climbing method was replaced with a genetic algorithm. Result was shown high accuracy performance. Waiyamai et al. [8] applied data mining technique to analyze undergraduate student. The purpose was to match undergraduate students’ skills to their selected study program. By using the technique of association rule both Decision Tree and Bayesian Networks seems to be good classifiers. This paper presents a comparison of Decision Tree and Bayesian Networks for predicting the factors that affect to the change of employees’ working in an organization. II. CLASSIFICATION ALGORITHMS A. Bayesian Networks Bayesian belief network is called bayes net that is leaning method by decreasing limit of Naive Bayesian in suppose for independent characteristic of data. Generally learning simple bayes supposed characteristics are independent but realistic we have met some characteristic has dependent so we should put in the model. We have used bayes net explain condition independent between variable (Naive bayes is used “variable” instead of “characteristic”) and this model we can use (1) prior knowledge of dependent between variable with (2) example learning for process learning efficiency by puting prior knowledge into Bayesian belief network in form network and dependent probability table.

A Bayesian Networks is a specific type of graphical model which is a directed acyclic graph. That is, all of the edges in the graph are directed and there are no cycles. A Bayesian Networks can be used to compute the conditional probability of one node, given values assigned to the other nodes. A Bayesian Networks can be used as a classifier that gives the posterior probability distribution of the class node given the values of other attributes. Y

X

P(a/x,z) Z

A B

P(z/y)

P(b/a)

Fig. 1 Example of Bayesian Networks

Fig. 1 illustrates a Bayesian Networks. Its set of edges is E={(B,A),(B,C)}. The edges in the Bayesian Networks encode a particular factorization of the joint distribution. In this example, the joint distribution of all the variables, as factorized by this Bayesian Networks, is P ( A, B, C ) = P ( A | B ) • P ( B ) • P (C | B )

(1)

III. EVALUATION INDICES An easy way to comply with the conferecne paper formatting requirements is to use this the document as a template and simple type your text into it. To evaluate classifiers used in this work, we applied evaluation indices as follows: Correct Percentage = the number of correct classifications divided by the total number of classifications: Precision = number of documents relevant and retrieved divided by total number of documents that are retrieved;

Special Issue of the International Journal of the Computer, the Internet and Management, Vol. 17 No. SP3, December, 2009

39.2

Forecasting Model for the Students’ Job Turnover in Thai Industries

Recall = number of documents relevant and retrieved divided by total number of documents that are relevant; and F-Measure (F) = performance based on precision and recall as in equation 2. F1 =

2× p× r p+r

(2)

IV. DATA MINING TECHNIQUES 1) Data Selection is specification source data that is used in data mining. 2) Data Preprocessing is prepared data such separate null data, error data, and repeated data or eliminate data is not accord of data selection. 3) Data Transformation is transformed irregular data to appropriate data for using analyze for data mining algorithm. 4) Data mining is used generally technical of make data mining that can separation 2 types. 4.1 Predictive Data Mining is used to forecast or estimate clear data by using old data. 4.2 Descriptive Data Mining is found the model for explaining some characteristic of data that majority is separated data. 5) Interpretation and Evaluation is stepped to interpret, evaluate and result that appropriate or same need objective yes or no, generally it’s have simple visualized. V. RESEARCH EXPERIMENT DESCRIPTION Missing or incomplete data, such as students with no grades recorded or evidence of registration were deleted. Students were then classified into classes according to their grade point averages. The Waikato Environment for knowledge Analysis(WEKA,version3.5.6) was employed as the primary tool for analysis. Those data were tested(Cross-Validation) by categorizing into two sets: training set and testing set with the 10-fold Cross-Validation method. By this method, the data were separated equally into 10 groups then setting one group as testing set and the others as training set, testing two categories. Setting and testing process were

done repeatedly but changing to the other groups setting until each of ten groups was set as testing set. From 10 accurate outcomes were calculated an average [12]. The model was developed on the basis of Bayesian Network algorithm and was used to predict the class for predicting the turnover of employees in Public and Private Organizations. Measurement indices were True positive rates, False positive rates, Precision, Recall and F-Measure. The analysis and conclusion results from this experiment were used to construct the model for predicting the turnover of employees in Public and Private Organizations. Fig.2 shows the suitable algorithm process in research framework.

Student Database

Data Pre-processing

Bayesian Network

Classification Algorithm

Evaluation

Model Fig. 2 Research framework

Fig. 2 in the model for study plan demonstrates in three steps: 1) Data Pre-processing – a process for cleaning student data in order to set the data into the right format. This ‘cleaned’ data will then be used in the next step. 2) Data Mining – The data mining technique of Bayesian Network will be applied to identify and to analyze important information gleaned from the pre-processed data. The output from this step is the managing the study plan which represents

The Sixth International Conference on eLearning for Knowledge-Based Society, 17-18 December 2009, Thailand

39.3

Pirapat Chantron and Prasong Praneetpolgrang

the characteristics of and relationships between student data. 3) Post-processing – The developed knowledge (model) will be tested and evaluated for its validity. In this section, we use Bayesian Networks by using Hold-out cross-validation. The performance measurements are Correct Percentage, Precision, Recall and F-Measure.

In this research, the data were obtained from 2,536 students, the 22 variable were defined as shown in Table III. All variables were used to construct the model, the most accurate values from sets were conducted to the model. TABLE III ATTRIBUTE OF DATASET Attribute Names

VI. DATASET Data mining techniques (Data Mining) were used in this research to create a relationship model between their majors, having and changing their jobs of persons in public and private organizations by studying from academic performance, profiles, and work background. Data from the total sample set were 2,536. 1) Six universities consist of 3 public and 3 private universities that show in Table I.

Gender

Gender

Major

TimeFindWork

Field of Education Accumulate Grade point average at the last semester Period of experience

Position

Position of the job

Company_Name

Company

Salary

ApplyEdu

Job salary rate Function match with the studying field Knowledge application

Cause

Filling of working

BROTHER

No. of Children

BROTHER_AT

Rank order in the family

GpaLevel

MatchEdu TABLE I SIX SAMPLING UNIVERSITIES Universities

Count

Description

Kasetsart University

237

Phranakhon Rajabhat University

241

STATUS

Student status

Thepsatri Rajabhat University

228

LOCATION

Location

Sripatum University

245

LOCATION_CHAR

Location char

Dhurakij Pundit University

242

DOMICILE

Home town

Saint john's University

238

PARENT_STATUS

Parent status

2) Private organizations consist of 6 organizations that shows in Table II.

OCC_FAT

Father occupation

OCC_MOT

Mother occupation

TABLE II SIX SAMPLING PRIVATE ORGANIZATIONS

OCC_Type

Type of work

FAM_INCOME

Family income

WorkChange

Work Changing

Company

Count

Charoen Pokphand

270

Dtac

130

Department of Land Transport

190

Thai Airways

252

Cooperative Auditing Department

137

Bangkhen District office

126

VII. RESEARCH RESULTS This research has led to data analysis algorithms that are optimized. The trial will be divided data into each group and measured by comparing the performance of individual algorithms by Percentage Correct Precision Recall F-measure and display in Fig. 3.

Special Issue of the International Journal of the Computer, the Internet and Management, Vol. 17 No. SP3, December, 2009

39.4

Forecasting Model for the Students’ Job Turnover in Thai Industries

TABLE IV THE RESULT FROM THE MODEL (CONTS.) Instances: 2536 Attributes: 22 Gender Major GpaLevel TimeFindWork Position Company_ Name Salary MatchEdu ApplyEdu Cause BROTHER BROTHER_AT STATUS LOCATION LOCATION_CHAR DOMICILE PARENT_STATUS OCC_FAT OCC_MOT OCC_Type FAM_INCOME WorkChange

Bayes Net

100 80 60 40 20 0

Percentag e Correct

Precision

Recall

F-measure

97.26

0.93

0.89

0.91

Bayes Net

Fig. 3 The graphs results of Bayesian Networks

Results from experiments show that algorithms Bayesian Networks shown percentage of prediction accuracy (Percentage Correct) 97.26% values that can answer most queries (Precision) 0.93 up from the detected information (Recall) 0.89 and values shows the relationship between the value that can answer queries about the maximum value from the detected data (F-measure) 0.91. WEKA was used to construct the model Fig.4 shows the variable that effect to the work changing. This model was constructed by using the 10-fold Cross Validation in testing the data which is well-known and enables to evaluate the error. The result from model testing indicated that the accuracy is 97.26% and the result from the model by using WEKA as Shown in Table IV.

Test mode: 10-fold cross-validation === Classifier model (full training set) === Bayes Networks Classifier not using ADTree === Summary === Correctly Classified Instances 97.2634 % Incorrectly Classified Instances 2.7366 % Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

Work Change

Salary

Major Position

Fig. 4 The result from the model TABLE IV THE RESULT FROM THE MODEL ===Run information === Scheme: weka.classifiers.bayes.BayesNet -D -Q weka.classifiers.bayes.net.search.local.K2 -- -P 3 -E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5

2280 256 0.8633 0.0742 0.1872 25.1745 % 48.9402 % 2536.0000

The predicting model for work changing was constructed in order to prove the accuracy of data mining technique by using Bayesian Networks. The result indicated that the accuracy was 97.26%. This study suggests the graduated student to used the factors that effect to his working, those are field of study, Major, Position and Salary. These variables are suitable for model constructing to predict the changing of work opportunity.

The Sixth International Conference on eLearning for Knowledge-Based Society, 17-18 December 2009, Thailand

39.5

Pirapat Chantron and Prasong Praneetpolgrang

Our future work is applying data mining technique for prediction. In order to increase the prediction power of classification, alternative feature selection such as Genetic Algorithm might be apply to select importance attributes before classification. In this research has analyzed the positive multiple regression (Multiple Regression Analysis) in Table V that showed the decision coefficient (R Square) prediction of the change of variables is 61.60% that is variable that can describe the effect of changing the variables, i.e. Work Chang was 61.60%. TABLE V MODEL SUMMARY Model

R

1

R Square

Adjusted R Square

.616

.612

Std.Error of the Estimate

TABLE VII COEFFICIENTS Model

1 (Constant) Major Position Salary

1 Regression Residual Total

df

47.40 44.40 91.80

2 230 232

.62

F

Beta

.43 .35 .42 .46

.12 .05 .05 .06

.40 .44 .47

t

Sig

3.72 7.69 8.38 9.05

.00 .00 .00 .00

TABLE VIII SHOWS THE RELATIONSHIP BETWEEN THE TYPE OF UNIVERSITY TO THE WORK

.639

Sig

The match of work

U. Types Match

TABLE VI ANOVA Mean Square

Std.Error

After that we collect data from the questionnaires and orderly arrangement data then we analyzed data by using SPSS (Statistical Package for Social Science) for Windows.

Table V showed that the coefficient of decision confidence values can be described approximately 60% and in consideration of the Anova table in Table VI.

Sum of Squares

B

a. Dependent Varible: WorkChange

a. Predictors:(Constant),Major,Position,Salary b. Dependent Varible: WorkChange

Model

Unstandardzed coefficients

Not match

Total

count

%

count

%

count

%

Public

630

24.8

591

23.3

1221

48.1

Private

668

26.3

647

25.5

1315

51.9

Total

1298

51.2

1238

48.8

2536

100

a. U = University

.61

a. Predictors:(Constant),Major,Position ,Salary b. Dependent Varible: WorkChange

From Table VI showed that the value of statistical significance. (Significant) to zero (0.000) explained that an independent variable that affect graduate students are major, position of the job and job salary rate that affect the change of work. In Table VII represent coefficients in the regression equation found that the value of statistical significance of all variables to zero. Independent variables that have correlated with all dependent variables.

From Table VIII showed that the sample of employees graduated from the Public University worked straight or match for their field of study for 24.8% and employees graduated from the Public University, their work do not match their fields of studies for 23.3%.The employees graduated from a private university worked in their straight fields of study for 26.3% and employees graduate from private universities did not work directly with their field of study for 25.5%.

Special Issue of the International Journal of the Computer, the Internet and Management, Vol. 17 No. SP3, December, 2009

39.6

Forecasting Model for the Students’ Job Turnover in Thai Industries

TABLE IX THE RELATIONSHIP BETWEEN THE WORK AND THE CHANGE OF WORK The Change of work No change

Change The match of work Match Not match Total

count

%

count

found that all variables are consistent with results from the Bayesian Networks techniques. It is believed that the variables derived from modeling techniques, Bayesian Networks is reliable enough to use this modeling.

Total

%

count

%

309

12.2

989

39

1298

51.2

630

24.8

608

24

1238

48.8

939

37.0

1597

63

2536

100

IX REFERENCES

From Table IX, showed that the sample of employees works directly with their field of study and they changed their work for 12.2%. The employees work directly with their field of study and they didn’t change their work for 39.0%. The employees work indirectly with their field of study and changed their work for 24.8%. The employees work indirectly with their field of study and didn’t change their work for 24.0 % .

Count Public 388 Private 551 Total 939 a. U = University

% 15.3 21.7 37.0

No change Count 883 664 1597

Total

%

Count

32.8 30.1 63.0

1221 1315 2536

[3] J. Yingkuachat et. al. “A Prediction of Higher Education Students’ Graduation with Bayesian Learning and Data Mining,” Research and Innovations for Sustainable Development Conference, 2006.

[5] P. Yamansabideen et. al. “An Application of Data Mining in Customer Relationship Management for Higher Education Students,” Research and Innovations for Sustainable Development Conference, 2006.

The Change of work Change

[2] P. Garcia, A. Amandi, S. Schiaffino and M. Campo, ”Evaluating Bayesian networks’ precision for detecting students’ learning styles,” Computer & Education Volume 49, Issue 3, pp.794-808, 2007.

[4] S. Mukoolskunpibal and S. Kitisin, “Performance comparison of C4.5, ADTree and Naïve Bayes classification algorithms on international postages with narcotics,” National Joint Conference on Computer Science & Software Engineering, 2007.

TABLE X THE EMPLOYEES WHO GRADUATED FROM PUBLIC UNIVERSITIES CHANGED THEIR WORK U. Types

[1] M. Xenos, “Prediction and assessment of student behavior in open and distance education in computers using Bayesian networks,” Computer & Education Volume 43, Issue 4, pp. 345-359, 2004.

% 48.1 51.9 100

From Table X, showed that the employees who graduated from public universities changed their work for 15.3 % and 32.8 % didn’t change their work. The employees who graduated from private universities changed their work for 21.7 % and 30.1% didn’t change their work. VIII. CONCLUSION In conclusion, it was found that variables effect the description of the factors affecting the change of the job: major, position of the job and job salary rate. Moreover, it was revealed that the decision coefficient (R Square) value up 61.6% statistically significant. The significant is to zero (0.00) and was

[6] L. Sun and P.P. Shenoy, “Using Bayesian networks for bankruptcy prediction: Some methodological issues,” European Journal of Operational Research,Volume 180, Issue 2, pp.738-753, 2007. [7] T. Hidekazu, A. El-Sayed, F. Masao, M. Kazuhiro, T. Kazuhiko and J. Aoe, “ Estimating sentence types in computer related new product bulletins using a decision tree,” Information Sciences, Volume 168, Issues 1-4, pp.185-200, 2004. [8] K. Waiyamai, T. Rakthanmanon and C. Songsiri, “Data Mining Techniques for Developing Education in Engineering Faculty,” NECTEC Technical Journal, volume III, no.11, pp. 134142,2001. [9] Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, The Morgan Kaufmann Publishers, 2001. [10] G. John Hendricks, “An Analysis of Student Graduation Trends in Texas State Technical Colleges Utilizing Data Mining and Other Statistical Techniques,” Doctoral Thesis of

The Sixth International Conference on eLearning for Knowledge-Based Society, 17-18 December 2009, Thailand

39.7

Pirapat Chantron and Prasong Praneetpolgrang

Educational Administration, Baylor University, Texas, U.S.A., 2000. [11] R. Remco Bouckaert, “Bayesian Networks Classifiers in Weka” Department of Computing Science, University of aikato, New Zealand, 2005. [12] Rechard Kirkby and Eibe Frank, Weka Explorer User Guide, University of Waikato, New Zealand, 2005. [13] J.Richard Roiger and W. Michael Geatz, Data Mining: ATutorial – Based Primer, Addison Wesley Publishing Company, 2003. [14] Y. Yang and G. I. Webb, “Proportional knterval Discretization for Naïve Bayes Classifiers,” In L. de Raedt and P. Flach, editors, Proceedings of the Twelfth European Conference on Machine Learning, Freiburg, Germany, Berlin: SpringerVerlag, pp. 564-575, 2001. [15] D. Heckerman, D. Geiger, and D. M. “Chickering, Learning Bayesian Networks: The Combination of Knowledge and Statistical Data,” Machine Learning, 20 (3), pp. 197-243, 1995. [16] D. Grossman, and P. Domingos, “Learning Bayesian Networks Classifiers by Maximizing Conditional Likelihood,” In R. Greiner and D. Schuurmans, editors, Proceedings of the Twenty-First International Conference on Machine Learning, Banff, Alberta, Canada. New York: ACM, pp. 361-368, 2004. [17] D. J. Hand, H. Mannila and P. Smyth, Principles of Data Mining, Cambridge, MA: MIT Press, 2001. [18] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, New York: Springer-Verlag, 2001. [19] G. H. John, Enhancements to the Data Mining Process, Ph.D Dissertation, Computer Science Department, Stanford University, Stanford, CA. 1997. [20] Z. Zheng and G. Webb, “Lazy learning of Bayesian rules,” Machine Learning 41(1), pp. 53-84, 2000. [21] J. Swets, “Measuring the accuracy of Diagnostic Systems,” Science, 240: 1285-1293. 1998. [22] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, “A Bayesian Approach to Filtering Junk e-Mail,” Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, WI. Menlo Park, CA: AAAI Press, pp. 55-62, 1998.

of the American Statistical Association, 90,pp 928-934. 1995. [24] M. V. Johns, An Empirical Bayes Approach to Nonparametric Two-way Classification, H. Solomon, editor, Studies in item analysis and prediction, Palo Alto, CA: “Stanford University Press”, 1961. [25] G.H.John,“EstimatingContinuous Distributions in Bayesian Classifiers,” In P. Besnard and S. Hanks, Editors, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, Canada. San Francisco: Morgan Kaufmann, pp. 338-345, 1995. [26] J. W. Kim, B. H. Lee, M. J. Shaw, H. Chang and M. Nelson, “Application of Decision-Tree Induction Techniques to Personalized Advertisements on Internet Storefronts,” International Journal of Electronic Commerce, Volume 5, Issue 3, pp. 45-62, 2001. [27] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” In J. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings of the International Conference on Very Large Databases, Santiago, Chile. San Francisco: Morgan Kaufmann, pp. 478-499. 1994. [28] R. Agrawal, T.Imielinski and A. Swami, “Database Mining: A Performance Perspective,” IEEE Transactions on Knowlegde and Data Engineering, 5(6), 914-925, 1993. [29] M. Danielson, L. Ekenberg and A. Larsson, “Distribution of expected utility in decision trees,” International Journal of Approximate Reasoning, Volume 46, Issue 2, pp. 387-407, 2007. [30] Quinlan, J. R. 1986. Induction of Decision Trees, Machine Learning, 1(1), pp. 81-106, 1999. [31] J. W. Kim, B. H. Lee, M. J. Shaw, H. Chang and M. Nelson, “Application of Decision-Tree Induction Techniques to Personalized Advertisements on Internet Storefronts,” International Journal of Electronic Commerce, Volume 5, Issue 3, pp. 45-62, 2001. [32] D. Kalles, A. Papagelis and E. Ntoutsi, “Induction of decision trees in numeric domains using set-valued attributes,” Intelligent Data Analysis, Volume 4, Issue 3,4, pp. 323-347. 2000. [33] R. Potharst and J. C. Bioch, “Decision trees for ordinal classification,” Intelligent Data Analysis, Volume 4, Issue 2, pp. 97-111, 2000.

[23] R. Kass and L. Wasserman, “A Reference Bayesian Test for Nested Hypotheses and Its Relationship to the Schwarz Criterion,” Journal

Special Issue of the International Journal of the Computer, the Internet and Management, Vol. 17 No. SP3, December, 2009

39.8