Predicting business failure using classification and ...

19 downloads 55098 Views 616KB Size Report
dict the class of the target company, a categorical dependent vari- able, from ..... application of the two methods in predicting business failure. The two data ...
Expert Systems with Applications 37 (2010) 5895–5904

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Predicting business failure using classification and regression tree: An empirical comparison with popular classical statistical methods and top classification mining methods Hui Li *, Jie Sun, Jian Wu School of Business Administration, Zhejiang Normal University, 91 Subbox in P.O. Box 62, YingBinDaDao 688, Jinhua City 321004, Zhejiang Province, PR China

a r t i c l e

i n f o

Keywords: Business failure prediction (BFP) Data mining Classification and regression tree (CART)

a b s t r a c t Predicting business failure is a very critical task for government officials, stock holders, managers, employees, investors and researchers, especially in nowadays competitive economic environment. Several top 10 data mining methods have become very popular alternatives in business failure prediction (BFP), e.g., support vector machine and k nearest neighbor. In comparison with the other classification mining methods, advantages of classification and regression tree (CART) methods include: simplicity of results, easy implementation, nonlinear estimation, being non-parametric, accuracy and stable. However, there are seldom researches in the area of BFP that witness the applicability of CART, another method among the top 10 algorithms in data mining. The aim of this research is to explore the performance of BFP using the commonly discussed data mining technique of CART. To demonstrate the effectiveness of BFP using CART, business failure predicting tasks were performed on the data set collected from companies listed in the Shanghai Stock Exchange and Shenzhen Stock Exchange. Thirty times’ hold-out method was employed as the assessment, and the two commonly used methods in the top 10 data mining algorithms, i.e., support vector machine and k nearest neighbor, and the two baseline benchmark methods from statistic area, i.e., multiple discriminant analysis (MDA) and logistics regression, were employed as comparative methods. For comparative methods, stepwise method of MDA was employed to select optimal feature subset. Empirical results indicated that the optimal algorithm of CART outperforms all the comparative methods in terms of predictive performance and significance test in short-term BFP of Chinese listed companies.  2010 Elsevier Ltd. All rights reserved.

1. Introduction 1.1. Data mining Recently, the field of data mining has seen an explosion of interest from various areas. It refers to the process of discovering interesting knowledge from large amounts of data. Classification problems for a variety of applications are the chief task that data mining techniques have been developed to solve. Classification predicts categorical labels which is the target of various areas, e.g., business failure prediction (BFP) (Ding, Song, & Zeng, 2008; Etemadi, Rostamy, & Dehkordi, 2009; Huang, Tsai, Yen, & Cheng, 2008; Hung & Chen, 2009; Lin, et al., 2009; Min & Jeong, 2009; Min & Lee, 2008; Ravi & Pramodh, 2008; Sun & Li, 2008a, 2008b; Wu, 2010) and credit scoring (David, 2008; Martens, Baesens, Gestel, & Vanthienen, 2007; Min & Lee, 2005; Tsai & Wu, 2008; Yu, Wang, & Lai, 2008; Zhu, He, Starzyk, et al., 2007). To solve this type of task, previous experiences are firstly collected and represented;

* Corresponding author. Tel.: +86 158 8899 3616. E-mail addresses: [email protected] (H. Li), [email protected] (J. Sun). 0957-4174/$ - see front matter  2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.02.016

then, a predictive classifier will be trained describing the predetermined set of data concepts; when facing the new problem, corresponding solutions can be recommended by the classifier. Since the class label of each sample is required to be provided in the training process, data classification belongs to the category of supervised learning. When deciding whether or not a classifier is applicable in a specific area, the estimation of predictive accuracy of the classifier is often used. There are lots of techniques that can be employed in the task of data mining. Wu, Kumar, Quinlan, et al. (2008) attempted to identify the top 10 data mining algorithms by the IEEE International Conference on Data Mining (ICDM) in December 2006. They first invited the IEEE ICDM Research Contribution Award and ACM KDD Innovation Award winners to each nominate up to top 10 algorithms in data mining. Then, they verified each nomination for its citations on Google Scholar and keep those nominations that have at least 50 citations. Finally, they took an open vote with all attendees of ICDM’06 on the top 10 algorithms. Among the top 10 algorithms in data mining, support vector machine (SVM), classification and regression tree (CART), C4.5, k nearest neighbors (kNN), and naïve Bayes are commonly used techniques for classification mining.

5896

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

1.2. Business failure prediction (BFP) BFP is a specific classification-type problem. It attempts to predict the class of the target company, a categorical dependent variable, from several continuous and/or categorical predictor features. These features are usually financial ratios reflecting business status of companies. Predicting business failure of companies is crucial for financial institutions, government, investing institutions, and academies. It is a useful tool to control the risk when investing, governing, and jobbing. It is a more important task nowadays since the USA is experiencing the financial crisis, which is resulted from lots of business failures. BFP has gained lots of interest from industries and academies. Researches in BFP belong to trial-and-error processes to find more accurate predictors and models, each of which has distinct assumptions and specific computational complexities. Over the last 40 years, BFP has become a major research area within corporate finance. The objective of BFP methods is to identify whether or not a target company yields to fall into distress in a specific time. Lots of useful techniques have been developed by industrial appliers and academic researchers to solve the problem in decision-making. The most popular methods in this area are the classic statistical methods, e.g., univariate analysis, risk index models, multivariate discriminant analysis (MDA) and logistic regressions (Logit) in particular. MDA is by far the most dominant classic cross-sectional statistical methods in this area, followed by Logit (Altman & Saunders, 1998). The shortcomings of MDAbased BFP are the assumptions of multivariate normally distributed independent variables, equal variance–covariance matrices across the failure and healthy class, prior probability of failure and misclassification costs, and the absence of multicollinearity (Balcaen & Ooghe, 2006). These assumptions are often not satisfied in real-world applications of BFP, which results in the application of MDA in an inappropriate way and the resulting MDA models’ not suitable for generalization. Martin (1977) and Ohlson (1980) pioneered using Logit in BFP. The shortcomings of Logit are the assumption of variation homogeneity of data (Lee, Chiu, Chou, & Lu, 2006) and the sensitivity to multicollinearity (Doumpos & Zopoudinis, 1999), which has limited its application in handling the problem of BFP. Though there are indeed some other techniques that have been applied to solve the problem, e.g., probit analysis, yet the other statistical methods are not popular in the area because they requires more computations (Dimitras, Zanakis, & Zopoudinis, 1996) or do not acts so accurate. In addition to MDA and Logit, recent developed data mining techniques take a key role in BFP since 1990s. Data mining techniques seldom require the assumptions that classical statistical methods are founded on. Among the five classification techniques in top 10 data mining methods, kNN was first applied to predict business failure in late 1990s. Jo, Han, and Lee (1997) pioneered the research using case-based reasoning (CBR) with kNN as its heart in BFP. Their researches are followed by lots of studies about using kNN to predict business failure, e.g., (Elhadi, 2000; Li & Sun, 2008; Nanni and Lumini, 2009; Park & Han, 2002; Sun & Hui, 2006; Yip, 2004). The algorithm of kNN belongs to instance-based reasoning, which is very easily understandable for industrial users. Thus, the predictive results of kNN for BFP yield to be interpreted to human beings. This characteristic guarantees the superiority of kNN to other method and its popularity in this area. The shortcomings of kNN are as follows: (a) there are no models constructed in the algorithm of kNN, which causes the predicting process to be very time-consuming; and (b) the number of nearest neighbors often has to be selected empirically. SVM, another top five algorithms for classification mining, provides an alternative to classical statistical methods and kNN, particularly in situations where the dependent and independent variables exhibit complex nonlinear

relationships. The solid foundation of SVM on statistical learning theory guarantees its high performance in various areas. Even though SVM has better predictive performance in BFP than some other methods (Hua, Wang, Xu, Zhang, & Liang, 2007; Hui & Sun, 2006; Min & Lee, 2005; Min, Lee, & Han, 2006; Shin, Lee, & Kim, 2005; Wu, Tzeng, Goo, & Fang, 2007); yet it is also criticized for the difficulty in selecting optimal kernel functions, the long training process in searching corresponding optimal parameters, and inability to identify the relative importance of variables. Meanwhile, the predictive results of SVM are not easy to be interpreted to industrial users. Except for the popularity of the two top five data mining method in BFP, an operational guidance for constructing naïve Bayesian network models to predict business failure is presented very recently (Sun & Shenoy, 2007). In fact, an earlier attempt to use Bayesian models for early warning of bank failures (Sarkar & Sriram, 2001) was reported six years before. However, this method is still not popular in the area of BFP till now. 1.3. The issue addressed in this research Among the top five classification algorithms in data mining, there are another two algorithms that have not been employed to predict business failure, i.e., CART and C4.5. Both CART and C4.5 are specific implementations of decision tree. The chief differences between CART and C4.5 include (Breiman, Friedman, Olshen, & Stone, 1984; Wu et al., 2008): (a) CART is designed to solve binary task, while C4.5 allows more outcomes; (b) CART employs the Gini index to rank tests, but C4.5 uses information gain; (c) CART prunes trees by a cost-complexity model, whereas C4.5 uses a single-pas algorithm. We focus using CART to predict business failure in this research for the following reasons:  The proposal of CART in 1984 represents a major milestone in the development of data mining, machine learning, and artificial intelligence.  The basic algorithm of CART is provided by lots of code-developing platform, e.g., Matlab, which is easy to be further implemented to compare with other methods.  CART citations can be found within almost any area, e.g., financial research and medical topics.  CART is more popular in financial topics than other tree methods, including C4.5, which topic the task of BFP belongs to. Owning to the above-described drawbacks of MDA, Logit, kNN, and SVM, the aim of this research is to investigate the performance of predicting business failure by using the commonly discussed CART, another top 10 algorithm in data mining area. The reason why we focus on using CART to predict business failure is not only because CART is one of the top five classification algorithms in data mining area and however is not popular in the area of BFP, but also because the following advantages of CART over the two popular classical statistical methods and the two popular classification mining method in predicting business failure:  Ease of interpretation to human beings of predictive results, which is superior to MDA, Logit, and SVM.  Capability of generating if-then rules for BFP, which is superior to MDA, Logit, kNN, and SVM.  Invariance to monotonic transformations of the explanatory features used for BFP, which is superior to kNN and SVM.  Capability of modeling complex relationship between independent and dependent features in the task without strong model assumptions, which is superior to MDA and Logit.  Ease and robustness to be constructed in BFP without a long training process or a long testing process, which is superior to SVM and kNN.

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

 Capability of identifying significant independent features by itself in the process of BFP, which is superior to MDA, Logit, kNN, and SVM.  No parameters to be selected and optimized in the training process in BFP, which is superior to kNN and SVM. These advantages make CART very applicable to predicting business failure with real-world data. For comparison, we follow the principle that comparative methods should be very popular in both areas of data mining (or statistic) and BFP. Thus, we used the two most popular classical statistical methods of MDA and Logit, and the two top five classification mining algorithm of kNN and SVM. They are all popular methods not only in the area of BFP, but also their initial areas. The reason why we do not employ naïve Bayes to make a comparison for BFP is as follows: The citation of naïve Bayes in Google Scholar was just 51 based on the material provided by Wu et al. (2008), which is far smaller than the citations of all the other classification algorithms listed in the top 10 data mining methods; we checked the recent citations of the five classification mining methods, with the results showing that naïve Bayes is still the least cited; meanwhile, the citation of naïve Bayes was also the smallest among the 18 candidate algorithms; since naïve Bayes is not as popular as the other four methods in data mining area and it is also not popular in the area of BFP, the method of naïve Bayes is not considered to make a comparison for now in this research. The specific problem of BFP is situated in predicting business failure of Chinese listed companies. The data are collected from the Shenzhen Stock Exchange and Shanghai Stock Exchange. Since the two classical statistical methods of MDA and Logit and the two popular classification mining algorithms of kNN and SVM are sensitive to the choice of features, we attempt to employ stepwise method of MDA to generate the optimal feature subset for the four methods. It has been verified by Li and Sun’s (2008) research that stepwise method of MDA is an optimal filter for short-term BFP of Chinese listed companies. To assess predictive performance of the five classification methods for the generation purpose, we attempt to use 30 times’ hold out method (HOM) as the final assessment. This research is organized as follows. Section 2 presents the research background summarizing basic concepts of MDA, Logit, kNN, and SVM. Section 3 addresses the usage of CART in BFP with empirical data from China, followed by the address of predictive results and analysis on Chinese listed companies’ short-term BFP in Section 4. Section 5 makes the conclusion and discusses further researches. 2. Research background 2.1. The classical statistical methods popular in business failure prediction (BFP) MDA and Logit were respective popular in the area of BFP, though there are lots of arguments on whether or not the assumptions are satisfied by real-world data. Nowadays, abundant researches that attempt to construct novel intelligent methods to predicting business failure still employ MDA and Logit to make a comparison with the method they construct, e.g., (Huang et al., 2008; Hui & Sun, 2006; Jo et al., 1997; Lin & McClean, 2001; Min & Lee, 2005; Yang, Platt, & Platt, 1999; Yip, 2004). 2.1.1. Multivariate discriminant analysis (MDA) MDA has been the most commonly used data mining technique for solving classification tasks (Lee, Sung, & Chang, 1999). The researches of Beaver (1967) and Altman (1968) used discriminant analysis to identify the best financial ratios distinguishing failure

5897

companies from healthy companies and to investigate predictive performance of the model with year going away from the year failure happened. In the beginning of 1980s, the linear MDA handed out the domination of the applications for predicting business failure. Nowadays, it has become a standard baseline method for comparative researches (Altman & Narayanan, 1997). MDA models consist of a linear combination of predictors. The discriminant function aims to provide the best distinction between the failure and healthy companies, which can be described as follows:

Di ¼ d0 þ d1 X i1 þ d2 X i2 þ    þ dj X ij þ    þ dn X in ;

ð1Þ

where Di is the discriminant score of the ith company; X ij denotes the value of the jth (j = 1, 2, . . . , n) feature of the ith company; dj is corresponding linear discriminant coefficient of the jth feature; d0 denotes the intercept; and n is the number of features. The score of Di indicates the status of the ith company, a low discriminant value of which indicates that the company is in failure. The value of Di is compared with the cut-off point when generating a prediction. The assumptions of multivariate normally distributed independent variables and equal variance–covariance matrices across the failure and healthy class are very important for the application of MDA. However, most studies on MDA-based BFP do not check whether or not the real-world data satisfy the assumptions. In fact, these two assumptions are usually violated in real-world application (Balcaen & Ooghe, 2006). Meanwhile, the third assumption of prior probabilities of failure is often ignored in the definition of the optimal cut-off score of MDA, which may result in a misleading estimation of predictive performance. 2.1.2. Logistic regression (Logit) Logit is another classical statistical model that has been used to solve classification problems. The nonlinear model of Logit became the dominator in the area of BFP after MDA in 1980s. The assumption of Logit models is that there is a logistic distribution in the data (Jones & Hensher, 2007). When constructing a Logit model, a nonlinear maximum likelihood estimation process is utilized to estimate corresponding parameters. It can be described as follows:

Pj ðX i Þ ¼ 1=½1 þ expðDj Þ;

ð2Þ

where Pj(Xj) denotes the probability of business failure calculated by using the value of the jth feature, i.e., Xj. From the Logit function we can find that the probability of business failure, i.e., Pj, is in the range of [0,1]. The probability of business failure follows the logistic distribution. In the case that 1 indicates a high business failure probability; a company will be classified into the failure class if corresponding Logit score exceeds the cut-off value. Comparing with MDA, Logit models require no assumptions of prior probabilities of business failure, multivariate normal distributed features, and equal dispersion matrices (Ohlson, 1980). All these are very important assumptions that MDA models are founded on. Moreover, Logit models can fit the phenomenon of nonlinear in data for BFP more than MDA models do. Whatever, there are still two basic assumptions, i.e., the dependent variable being dichotomous and the consideration of the cost of type I and type II error rates. Though there are some argues that the two assumptions are not serious problems in its applications (Koh, 1992), yet Logit models are still sensitive to extreme nonnormality (McLeay & Omar, 2000). 2.2. The hot data mining methods popular in business failure prediction (BFP) SVM and kNN, among the top 10 data mining algorithms, are popular in the area of BFP nowadays. Abundant researches, e.g.,

5898

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

Hua et al. (2007), and Li and Sun (2008), attempt to enhance the application of the two methods in predicting business failure. The two data mining algorithms also belong to the family of statistical learning techniques. 2.2.1. k nearest neighbors (kNN) The algorithm of nearest neighbor is one of the most widely used and simplest methods in the task of data mining (Cover & Hart, 1968; Kuncheva, 2004). There are three core components of the algorithm of kNN, i.e., the set of historical cases, the mechanism of similarity measure, and the number of nearest neighbors. The algorithm of kNN belongs to non-parametric classification. It can be expressed elegant and simple in theory. Assume that there are m points in Rn, denoting C = {c1, c2, . . . , ci, . . . , cm}. "ci e C, corresponding label is known, i.e., l(ci) e L. In order to identify the class of an input point c0, the k nearest neighbors are retrieved together with corresponding class labels. When retrieving nearest neighbors, the concept of similarity is introduced. Classically, the similarity measure between a pair of cases is fulfilled by Euclidean distance, which is expressed as follow:

Similarityðc0 ; ci Þ ¼ 1=½1 þ a  Edðc0 ; ci Þ; !1=2 j X Edðc0 ; ci Þ ¼ wj  dj ðc0 ; ci Þ2 ; j ¼ ð1; 2; . . . ; nÞ

ð4Þ

dj ðc0 ; ci Þ ¼ jX 0j  X ij j:

ð5Þ

ð3Þ

The class of the input point is labeled on the value of consensus of the neighbors. The number of nearest neighbors is usually determined empirically in real-world applications. Once all nearest neighbors are retrieved, the consensus among neighbors can be calculated by the principle of pure majority voting. It is expressed as follows:

lðc0 Þ ¼ arg max z

X

Lðz ¼ lðci ÞÞ;

ð6Þ

ðci ;lðci Þ2DÞ

where D is the set of nearest neighbors to the target case; z is a class label; ci denotes the ith nearest neighbor in this definition; and L() is an indicator function that only returns 1 or 0. If each nearest neighbor is put on different weights in the process of voting for class label of target case, it is called weighted majority voting. Beside the popularity of kNN in the domain of BFP (Elhadi, 2000; Jo et al., 1997; Li & Sun, 2008; Park & Han, 2002; Sun & Hui, 2006; Yip, 2004), it has already been applied into many other areas, e.g., caner classification (Okun & Priisalu, 2009), heart disease detection (Polat, S ß ahan, & Günesß, 2007), and enzyme subfamily classification (Huang, Chen, Hwang, & Ho, 2007). 2.2.2. Support vector machine (SVM) In data mining applications nowadays, SVM is a must try method because of its excellent robustness and accuracy among alternative algorithms (Vapnik, 1995; Wu et al., 2008). Consider the binary classification problem of predicting business failure with the training data set of {ci, l(ci)}, where ci e Rn is the input vector of SVM and l(ci) e {1,1} is corresponding label. The objective of SVM is to find a function f(c) = w  U(c) + b which minimizes the following formula (Li & Yu, 2008) n X ½1  lðci Þ  ðw  Uðci Þ þ bÞþ þ kkwk2 ;

ð7Þ

i¼1

where U(x) is a nonlinear function to map input data to a highdimensional feature space; b is a constant; w is the directional vector; the subscript + denotes the positive part of the argument, and k is the tuning parameter that controls the trade-off between minimizing the loss of [1  l(c)f(c)]+ and maximizing the margin. The kernel function can be expressed as follows:

Kðci ; c0 Þ ¼ Uðci ÞT Uðc0 Þ;

ð8Þ

The most commonly used and frequently recommended kernel function is RBF kernel (Hua et al., 2007; Hui & Sun, 2006; Min & Lee, 2005; Min et al., 2006; Shin et al., 2005; Wu et al., 2007), which is expressed as follows:

  Kðci ; c0 Þ ¼ exp mkci  c0 k2 ;

ð9Þ

SVM is a very effective classification and regression algorithm for data mining tasks. Besides its application and popularity in the area of BFP (Ding et al., 2008; Hua et al., 2007; Hui & Sun, 2006; Min & Lee, 2005; Min et al., 2006; Shin et al., 2005; Wu et al., 2007), SVM has also been applied into some other areas, e.g., cancer classification (Hong and Cho, 2008). 3. Research methodology 3.1. The contribution of this research This paper contributes to the demonstration of the applicability of CART in the area of BFP, another method among the top five classification mining algorithm or the popular statistical learning methods. Meanwhile, we witness the more applicability of CART in short-term BFP of Chinese listed companies than the two classical statistical methods of MDA and Logit, and the two top five classification mining method of kNN and SVM. All of the four methods are already popular in both areas of data mining and BFP. In order to fulfill the purpose, we construct two different algorithms implementing CART. One is to construct the tree entirely on the optimal feature set selected by the filter approach of MDA, which method is used to generate optimal features for the other four comparative methods. The other one is to construct the tree with all available features and to prune the tree with cross validation to find the minimum-cost tree. To compare predictive performance of the five methods in BFP of Chinese listed companies from the view of generation purpose, we repeat HOM for thirty times to obtain a series of results, on the basis of which, further statistical analysis on whether or not there are significant performance differences between each pair of methods is carried out. 3.2. Research design The objective of research is to demonstrate the applicability of the top five classification mining method of CART in BFP, comparing with the other four popular methods in this area, i.e., the two classical statistical methods of MDA and Logit and the two top five classification mining methods of SVM and kNN. Since all the four comparative methods are sensitive to the choice of features, we attempt to use a filter approach for feature selection to pick out significant features distinguishing failure companies from those healthy. One algorithm implementing CART is to be constructed entirely on the same optimal features with the other four popular methods, which is to be compared with other methods on the same input space without taking the advantage of the merit of CART on finding significant features by itself. Meanwhile, the algorithm of CART under the attempt of seeking minimum-cost tree by cross validation in training data set is also to be produced for comparison, which fully takes advantage of the merits insider CART. The research design is illustrated in Fig. 1. 3.3. Initial sample cases and optimal features selected by stepwise method of MDA The dataset we collected is used to predict short-term business failure of Chinese listed companies. Sample companies are listed in

5899

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

Fig. 1. Research design.

the Shenzhen Stock Exchange and Shanghai Stock Exchange. If a listed company is specially treated by China Securities Supervision and Management Committee (CSSMC), we consider the company in business failure. If a company has never been specially treated by CSSMC, we consider the company in health. We selected 135 pairs of companies in failure and healthy companies. The values of financial ratios of them one year prior to failure are used to represent the companies. After data preprocessing to eliminate samples with missing values and outliners, we finally got 153 sample companies in distress and health. All the data values were scaled into the range of [0, 1] by min–max normalization. Initial samples involve thirty financial ratios, including profitability ratios, activity ratios, liability ratios, growth ratios, structure ratios, per share items and yields. The filter approach of stepwise method of MDA was used to select optimal features for MDA, Logit, SVM, kNN, and one implementation of CART. Selected features, which can be named MDA feature set (MDAFS), are listed in Table 1. 3.4. The method of classification and regression tree (CART) CART, which is a tree-structured analysis, was popularized by Breiman et al. (1984). In the area of BFP, the purpose of the analysis via CART is to determine a set of if-then rules that permit accurate

Table 1 Optimal features selected by stepwise method of MDA. No.

Financial ratio

Category

1 2 3 4

Total asset turnover Asset-liability ratio Total asset growth rate Earning per share

Activity ratios Liability ratios Growth ratios Per share items and fields

classification of a sample company on whether or not it will fall into business failure. In order to achieve the best possible predictive accuracy, CART is constructed to minimize the misclassification cost which takes both misclassification rate and variance into consideration. It is an important step to select the splits on the features that are used to predict membership in corresponding class of companies. The computational detail of CART involves itself in finding the best split rules to construct a simple yet accurate and informative tree. The CART considers all variables as independent in the computation of split with the training data set. The ith sample can be ex  pressed as X i1 ; X i2 ; . . . ; X ij ; . . . ; Y i , where X ij is the value of the ith sample company on the jth feature and Yi is the label value of the sample. Since CART is based on binary recursive partitioning

5900

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

that splits a leaf of the data into two sub-leaves, the values of Yi are binary for classification problem, e.g., 1 or 1. In the process of splitting, CART follows the rule that a sample goes right if a ‘feature value X ij SVM > kNN > MDAFS-CART > MDA > Logit in short-term BFP of Chinese listed companies from the view of mean accuracy. However, CART produces the same predictive performance as SVM does from the view of median accuracy. MDAFS-CART and MDA also produce the same median accuracy. MDA, kNN, and MDA-CART achieve the max value of min accuracy, i.e., 80%, among the 30 times’ of testing. The three methods of SVM, kNN, and CART all produce the max value of max accuracy, i.e., 97.78%, on the 30

Table 2 Predictive accuracies (%) of the six methods under 30 times’ HOM. Data sets

MDA

Logit

SVM

kNN

MDAFSCART

CART Fig. 3. The map of max accuracy of the six methods.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

84.44 88.89 93.33 84.44 93.33 80.00 86.67 88.89 93.33 84.44 84.44 91.11 91.11 82.22 91.11 80.00 88.89 91.11 80.00 88.89 88.89 93.33 93.33 86.67 95.56 88.89 84.44 88.89 82.22 91.11

77.78 88.89 93.33 82.22 82.22 77.78 86.67 86.67 93.33 86.67 75.56 91.11 88.89 84.44 88.89 80.00 82.22 91.11 84.44 91.11 88.89 91.11 93.33 84.44 95.56 88.89 86.67 88.89 82.22 93.33

91.11 91.11 91.11 86.67 91.11 86.67 88.89 86.67 91.11 86.67 86.67 91.11 91.11 86.67 93.33 77.78 88.89 91.11 82.22 91.11 86.67 97.78 91.11 88.89 91.11 93.33 86.67 91.11 88.89 95.56

91.11 91.11 91.11 86.67 93.33 86.67 88.89 86.67 93.33 84.44 80.oo 88.89 91.11 82.22 91.11 82.22 91.11 91.11 84.44 91.11 82.22 97.78 93.33 86.67 88.89 95.56 84.44 91.11 82.22 95.56

88.89 84.44 91.11 84.44 86.67 86.67 80.00 84.44 88.89 86.67 82.22 86.67 91.11 86.67 88.89 80.00 91.11 88.89 91.11 91.11 86.67 91.11 93.33 91.11 91.11 93.33 88.89 93.33 82.22 91.11

93.33 93.33 91.11 93.33 91.11 93.33 86.67 84.44 91.11 88.89 88.89 88.89 95.56 77.78 88.89 82.22 82.22 88.89 91.11 93.33 88.89 95.56 93.33 91.11 93.33 95.56 86.67 88.89 93.33 97.78

Min Max Mean Median Variance

80.00 95.56 88.00 88.90 19.96

75.56 95.56 86.89 87.78 27.02

77.78 97.78 89.41 91.11 14.54

80.00 97.78 88.82 90.00 20.94

80.00 93.33 88.07 88.90 14.47

77.78 97.78 90.30 91.11 19.92

Times achieving the best statistics

1

0

2

2

2

3

Fig. 4. The map of mean accuracy of the six methods.

testing data sets. Only CART produces the max value of mean accuracy, i.e., 90.30%, in the experiment. Both CART and SVM perform best from the view of median accuracy, the hit ratio of which is 91.11%. From the view of times achieving the best statistics, CART outperforms all the other five methods by hitting the ratio three times, especially achieving both max values of mean and median accuracies. Thus, the finding of this empirical research is that CART can generate superior performance to the two statistical methods of MDA and Logit and the two hot data mining methods of SVM and kNN in short-term BFP of Chinese listed companies. When detailed analysis is made between CART and the two classical statistical methods of MDA and Logit and the two hot

5902

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904 Table 4 Mean accuracies and results of significance test of CART and MDA. Methods

Mean accuracy (%)

Significance test on difference

Hypothesis

CART MDA

90.30 88.00

Significant at the level of 5%

Reject H1

Table 5 Mean accuracies and results of significance test of CART and Logit. Methods

Mean accuracy (%)

Significance test on difference

Hypothesis

CART Logit

90.30 86.89

Significant at the level of 1%

Reject H2

Fig. 5. The map of median accuracy of the six methods. Table 6 Mean accuracies and results of significance test of CART and SVM. Methods

Mean accuracy (%)

Significance test on difference

Hypothesis

CART SVM

90.30 89.41

No significance

Accept H3

Table 7 Mean accuracies and results of significance test of CART and kNN. Methods

Mean accuracy (%)

Significance test on difference

Hypothesis

CART kNN

90.30 88.82

Significant at the level of 10%

Reject H4

Fig. 6. The map of variance of accuracy of the six methods. Table 8 Mean accuracies and results of significance test of CART and MDAFS-CART.

algorithms of kNN and SVM, we make corresponding comparison tables from Tables 2 and 3 and Figs. 2–6 to summarize predictive performance between CART and the other methods. See Tables 4–8.

4.2.2. Analysis between CART and the two popular classical statistical methods of MDA and Logit From Table 4, we can find that CART outperforms MDA on the mean accuracy by 2.3. The significance level on the hypothesis that there is no significant difference between the two methods of CART and MDA is 5%. Thus, the hypothesis of H1 is rejected by the result. It means that CART outperforms MDA significantly in statistic when they are applied to predicting short-term business failure of Chinese listed companies. From Table 5, we can find that Logit is outperformed by CART on the mean accuracy by 3.41. The hypothesis that there is no significant difference between the two methods of CART and Logit is re-

Methods

Mean accuracy (%)

Significance test on difference

Hypothesis

CART MDACART

90.30 88.07

Significant at the level of 5%

Reject H5

jected under the significance level of 1%. It means that CART significantly outperforms Logit statistically from the view shortterm BFP of Chinese listed companies. 4.2.3. Analysis between CART and the two hot data mining methods of SVM and kNN We can find from Table 6 that CART outperforms SVM on the mean accuracy only by 0.89. The significance test on the hypothesis that there is no significant difference between the two methods of CART and SVM is that there is indeed no significant difference. It

Table 3 Results of significance test between each pair of methods on BFP. Methods Logit SVM kNN MDAFS-CART CART a

MDA 1.795(0.083) 2.567(0.016)** 1.322a(0.197b) 0.096(0.924) 2.448(0.021)**

t statistic. p value. Significant at the level of 10%. ** Significant at the level of 5%. *** Significant at the level of 1%. b

*

*

Logit

SVM

kNN

MDAFS-CART

– 3.138(0.004)*** 2.264(0.031)** 1.427(0.164) 3.235(0.003)***

– – 1.246(0.223) 1.989(0.056)* 1.263(0.216)

– – – 1.069(0.294) 1.853(0.074)*

– – – – 2.649(0.013)**

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

means that the hypothesis of H3 is accepted. Thus, CART only outperforms SVM marginally from the view of this empirical research of short-term BFP of Chinese listed companies. However, the training process of SVM occupies much more time than CART does, since the parameters of kernel function in SVM should be determined in the training process. The long-time training of SVM does not guarantee its superiority to CART in predictive accuracy. Take this factor into consideration, CART outperforms SVM. From Table 7, we can find that the algorithm of kNN is outperformed by CART on the mean accuracy by 1.48. The hypothesis that there is no significant difference between the two methods of CART and Logit is rejected under the significance level of 10%. Thus, CART significantly outperforms the algorithm of kNN in statistic when they are applied in to predict short-term business failure of Chinese listed companies. Note that, the number of nearest neighbors in the algorithm of kNN is predetermined by us empirically. In contrast, CART can find its best form automatically. Take this factor into account, CART also outperforms kNN. 4.2.4. Analysis inside CARTs The difference between CART and MDA-CART is that the former is constructed with all available features and be pruned to be the best tree with min-cost. While the latter is constructed under the feature space produced by stepwise method of MDA. Though the latter tree is pruned, yet the algorithm of finding the min-cost tree by cross validation is not used. The predictive results indicate that CART can produce superior accuracy than MDA-CART by 2.23. The significance test indicates that there is significant difference on predictive performance between the two methods on the level of 5%. Thus, H5 is rejected by the evidence from short-term BFP of Chinese listed companies. 5. Conclusion and further research From both the views of predictive accuracy and significance test, we can conclude that CART is applicable in the area of BFP. From the empirical results of short-term BFP of Chinese listed companies, CART outperformed the two classical statistical methods of MDA and Logit significantly at least at the level of 5%. It produced superior predictive performance to kNN at the significant level of 10%, and also marginally outperformed SVM in short-term BFP of Chinese listed companies. The employment of stepwise method of MDA did not help CART produce more accurate accuracy, which indicates that it may be more suitable for CART to construct the predicting tree with all available features. Note that all findings drawn out should be taken into consideration under the specific experimental design. In this research, we demonstrated the feasibility of using CART to predict business with data collected from China. Further research should be conducted to demonstrate the applicability of CART for BFP with data collected from other countries. Since CART is regarded as a milestone in the evolution of data mining, artificial intelligence, and statistical learning, we used CART to predict business failure. It should be demonstrated that whether or not the other famous implementation of decision tree, i.e., C4.5 (C5.0 for now) is applicable in the area of BFP. And the performance of the two famous decision tree methods should be compared. Acknowledgements This research is partially supported by the National Natural Science Foundation of China (No. 70801055) and the Zhejiang Provincial Natural Science Foundation of China (No. Y6090392). The authors gratefully thank editors and anonymous referees for their comments and recommendations.

5903

References Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23, 589–609. Altman, E. I., & Narayanan, P. (1997). An international survey of business failure classification models. Institutions and Instruments, 6(2), 1–57. Altman, E. I., & Saunders, A. (1998). Credit risk measurement: Developments over the last 20 years. Journal of Banking and Finance, 21(11–12), 1721–1742. Balcaen, S., & Ooghe, H. (2006). Thirty-five years of studies on business failure: An overview of the classic statistical methodologies and their related problems. The British Accounting Review, 38, 63–93. Beaver, W. (1967). Financial ratios as predictors of failure. Empirical research in accounting: Selected studies 1966. Journal of Accounting Research, 4, 71–111. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Cover, T. M., & Hart, P. E. (1968). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27. David, A. B. (2008). Rule effectiveness in rule-based systems: A credit scoring case study. Expert Systems with Applications, 34(4), 2783–2788. Dimitras, A., Zanakis, S., & Zopoudinis, C. (1996). A survey of business failures with an emphasis on failure prediction methods and industrial applications. European Journal of Operational Research, 90(3), 487–513. Ding, Y., Song, X., & Zeng, Y. (2008). Forecasting financial condition of Chinese listed companies based on support vector machine. Expert Systems with Applications, 34(4), 3081–3089. Doumpos, M., & Zopoudinis, C. (1999). A multicriteria discrimination method for the prediction of financial distress: The case of Greece. Multinational Finance Journal, 3(2), 71–101. Elhadi, M. T. (2000). Bankruptcy support system: Taking advantage of information retrieval and case-based reasoning. Expert Systems with Applications, 18(3), 215–219. Etemadi, H., Rostamy, A., & Dehkordi, H. (2009). A genetic programming model for bankruptcy prediction: Empirical evidence from Iran. Expert Systems with Applications, 36(2), 3199–3207. Hong, J., & Cho, S. (2008). A probabilistic multi-class strategy of one-vs.-rest support vector machines for cancer classification. Nerocomputing, 71(16–18), 3275–3281. Hua, Z., Wang, Y., Xu, X., Zhang, B., & Liang, L. (2007). Predicting corporate financial distress based on integration of support vector machine and logistic regression. Expert Systems with Applications, 33(2), 434–440. Huang, W.-L., Chen, H., Hwang, S., & Ho, S. (2007). Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems, 90(2), 405–413. Huang, S., Tsai, C.-F., Yen, D., & Cheng, Y. (2008). A hybrid financial analysis model for business failure prediction. Expert Systems with Applications, 35(3), 1034–1040. Hui, X.-F., & Sun, J. (2006). An application of support vector machine to companies’ financial distress prediction. In V. Torra, Y. Narukawa, & A. Valls, et al. (Eds.), Modeling decisions for artificial intelligence (pp. 274–282). Berlin: Springer Verlag. Hung, C., & Chen, J. (2009). A selective ensemble based on expected probabilities for bankruptcy prediction. Expert Systems with Applications, 36(3), 5297–5303. Jo, H., Han, I., & Lee, H. (1997). Bankruptcy prediction using case-based reasoning neural network and discriminant analysis for bankruptcy prediction. Expert Systems with Applications, 13(2), 97–108. Jones, S., & Hensher, D. A. (2007). Modelling corporate failure: A multinomial nested logit analysis for unordered outcomes. The British Accounting Review, 39, 89–107. Koh, H. C. (1992). The sensitivity of optimal cutoff points to misclassification costs of Type I and Type II errors in the going-concern prediction context. Journal of Business Finance and Accounting, 19(2), 187–197. Kuncheva, L. I. (2004). Combining pattern classifiers: Methods and algorithms. New Jersey: Wiley. Lee, T.-S., Chiu, C.-C., Chou, Y.-C., & Lu, C.-J. (2006). Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics and Data Analysis, 50(4), 1113–1130. Lee, G., Sung, T. K., & Chang, N. (1999). Dynamics of modeling in data mining interpretive approach to bankruptcy prediction. Journal of Management Information Systems, 16, 63–85. Li, H., & Sun, J. (2008). Ranking-order case-based reasoning for financial distress prediction. Knowledge-Based Systems, 21(8). Li, B., & Yu, Q. (2008). Classification of functional data: A segmentation approach. Computational Statistics and Data Analysis, 52(10), 4790–4800. Lin, F.-Y., & McClean, S. (2001). A data mining approach to the prediction of corporate failure. Knowledge-Based Systems, 14, 189–195. Lin, R., Wang, Y., Wu, C., et al. (2009). Developing a business failure prediction model via RST, GRA and CBR. Expert Systems with Applications, 36(2), 1593–1600. Martens, D., Baesens, B., Gestel, T. V., & Vanthienen, J. (2007). Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research, 183(3), 1466–1476. Martin, D. (1977). Early warning of bank failure: A logit regression approach. Journal of Banking and Finance, 1, 249–276.

5904

H. Li et al. / Expert Systems with Applications 37 (2010) 5895–5904

McLeay, S., & Omar, A. (2000). The sensitivity of prediction models tot the nonnormality of bounded an unbounded financial ratio. British Accounting Review, 32, 213–230. Min, J., & Jeong, C. (2009). A binary classification method for bankruptcy prediction. Expert Systems with Applications, 36(3), 5256–5263. Min, J.-H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28, 603–614. Min, J.-H., & Lee, Y.-C. (2008). A practical approach to credit scoring. Expert Systems with Applications, 35(4), 1762–1770. Min, S.-H., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Systems with Applications, 31(3), 652–660. Nanni, L., & Lumini, A. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36(2), 3028–3033. Ohlson, J. A. (1980). Financial rations and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 18, 109–131. Okun, O., & Priisalu, H. (2009). Dataset complexity in gene expression based cancer classification using ensembles of k -nearest neighbors. Artificial Intelligence in Medicine, 45(2–3), 151–162. Park, C.-S., & Han, I. (2002). A case-based reasoning with the feature weights derived by analytic hierarchy process for bankruptcy prediction. Expert Systems with Applications, 23(3), 255–264. Polat, K., Sßahan, S., & Günesß, S. (2007). Automatic detection of heart disease using an artificial immune recognition system (AIRS) with fuzzy resource allocation mechanism and knn (nearest neighbour) based weighting preprocessing. Expert Systems with Applications, 32(2), 625–631. Ravi, V., & Pramodh, C. (2008). Threshold accepting trained principal component neural network and feature subset selection: Application to bankruptcy prediction in banks. Applied Soft Computing, 8(4), 1539–1548. Sarkar, S., & Sriram, R. S. (2001). Bayesian models for early warning of bank failures. Management Science, 47(11), 1457–1475. Shin, K.-S., Lee, T.-S., & Kim, H.-J. (2005). An application of support vector machines in bankruptcy prediction model. Expert Systems with Application, 28(1), 127–135.

Sun, J., & Hui, X.-F. (2006). Financial distress prediction based on similarity weighted voting CBR. In X. Li, R. Zaiane, & Z. Li (Eds.), Advanced data mining and applications (pp. 947–958). Berlin: Springer Verlag. Sun, J., & Li, H. (2008a). Listed companies’ financial distress prediction based on weighted majority voting combination of multiple classifiers. Expert Systems with Applications, 35(3), 818–827. Sun, J., & Li, H. (2008b). Data mining method for listed companies’ financial distress prediction. Knowledge-Based Systems, 21(1), 1–5. Sun, L., & Shenoy, P. P. (2007). Using Bayesian networks for bankruptcy prediction: Some methodological issues. European Journal of Operational Research, 180(2), 738–753. Tsai, C.-F., & Wu, J.-W. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34(4), 2639–2649. Vapnik, V. N. (1995). The nature of statistical learning theory. New York, Inc.: Springer Verlag. Wu, W.-W. (2010). Beyond business failure prediction. Expert Systems with Applications, 37(3), 2371–2376. Wu, X., Kumar, V., Quinlan, J., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–47. Wu, C.-H., Tzeng, G.-H., Goo, Y.-J., & Fang, W.-C. (2007). A real-valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy. Expert Systems with Applications, 32(2), 397–408. Yang, Z. R., Platt, M. B., & Platt, H. D. (1999). Probability neural network in bankruptcy prediction. Journal of Business Research, 44, 67–74. Yip, A. Y. N. (2004). Predicting business failure with a case-based reasoning approach. In M. Negoita, R. Howlett, & L. Jain, et al. (Eds.), Knowledge-based intelligent information and engineering systems (pp. 665–671). Berlin: Springer Verlag. Yu, L., Wang, S., & Lai, K. (2008). Credit risk assessment with a multistage neural network ensemble learning approach. Expert Systems with Applications, 34, 1434–1444. Zhu, Z., He, H., Starzyk, J. A., et al. (2007). Self-organizing learning array and its application to economic and financial problems. Information Sciences, 177(5), 1180–1192.

Suggest Documents