African Journal of Business Management Vol.5 (11), pp. 4107-4120, 4 June, 2011 Available online at http://www.academicjournals.org/AJBM ISSN 1993-8233 ©2011 Academic Journals
Full Length Research Paper
Mining business failure predictive knowledge using two-step clustering Hui Li* and Jie Sun School of Economics and Management, Zhejiang Normal University, P. O. Box 62, 688 YingBinDaDao, Jinhua, Zhejiang 321004, China. Accepted 25 August, 2010
Despite increasing researches on business failure prediction by employing statistical techniques and intelligent ones, how to generate reasoning knowledge that can helps enterprise managers, investors, employees and governmental officials intuitively distinguish companies in distress from healthy ones has been only cursorily studied. The objective of this research is to fill this gap by utilizing the data mining technique of two-step clustering to outline relationships between listed companies’ various financial states and their financial ratios in China. Reasoning knowledge implying these relationships can be used as an ‘early warning’ expert system latter on. When assessing a company’s financial state before three years, companies whose values of these financial ratios, (net profit to fixed assets, account payable turnover, total assets turnover, the ratio of cash to current liability, ratio of liability to market value of equity, the proportion of fixed assets and net assets per share), are around 0.2059, 11.9769, 0.5923, 0.1940, 174.4857, 0.3540 and 2.7490, respectively, yield to be healthy in at least three years. While those are around 0.1145, 8.3363, 0.4469, 0.0212, 258.6049, 0.2697 and 2.3027, respectively, are possible to fall into distress in three years. For listed companies in China, long-time liability, activity, short-time liability, per share items and yields and structure ratios are important in descending sequence to guarantee them healthy companies. While activity, short-time liability, profitability and structural ratios are important in descending sequence to avoid them falling into distress. Key words: Business failure predictive knowledge, data mining, two-step clustering, expert system. INTRODUCTION Listed companies are key economic entities contributing to economy development of every nation. Business failure of these companies is a highly significant problem which results in high social costs. Attempts to develop methods and models of business failure prediction (BFP) began in the late 1960’s and continue to today. In general, there are two distinct types of techniques that have been employed to predict BFP: (1) statistical ones, that is, discriminate analysis (DA) (Altman, 1968; Beaver, 1966), Logit (Martin, 1977; Ohlson, 1980), among others (Nam et al., 2008; Hwang et al., 2010) and (2) intelligent ones, that is, decision tree (DT) (Frydman et al., 1985; Li et al., 2010), neural network (NN) (Serrano-Cinca, 1996; Wilson and Sharda, 1994; Ravi and Pramodh, 2008;
*Corresponding author. E-mail:
[email protected]. Tel: +86158-8899-3616.
Ravisankar et al., 2010; Youn and Gu, 2010), case-based reasoning (CBR) (Bryant, 1997; Jo et al., 1997; Ahn and Kim, 2009; Li and Ho, 2009; Lin et al., 2009; Li and Sun, 2010), support vector machine (SVM) (Min and Lee, 2005; Shin et al., 2005; Hardle et al., 2010), After a broad overview on researches of BFP in last four decades, it is believed by Kumar and Ravi (2007) that, an important trend in this area has been to build hybrid intelligent systems for the problem. To understand this idea in a more broadly way, both statistical techniques and intelligent techniques are essential to be employed to build a hybrid system. There are two ways for hybrid: (1) Using one technique as the main modular in BFP and employing other techniques as assistants when necessary; (2) using various techniques equally to generate pre-classification results that could be combined by appropriate mechanisms to generate the final forecast. Both of the two hybrid means need to be developed for BFP.
4108
Afr. J. Bus. Manage.
Data mining refers to mining knowledge from large amount of data (Han and Kamber, 2001). Implementation methods of data mining are particularly oriented to uncover interesting data patterns hidden in data sets of various domains. It consists of several iterative sequent procedures: data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation. In each procedure, whether statistical techniques or intelligent techniques can be employed to build mining methods through both standalone mode and hybrid mode. Hence, data mining could be viewed as a methodology implemented by a hybrid means of statistical techniques and intelligent techniques on the whole. It is an appropriate means to carry out Kumar and Ravi’s (2007) recommendation. As a matter of fact, some systematic data mining approaches with hybrid models (Lin and McClean, 2001; Sun and Li, 2008) have already been carried out to find hidden patterns in financial data sets. Whereas, despite increasing researches on BFP carried on by employing statistical techniques, intelligent techniques or even hybrid techniques since last four decades, how to generate reasoning rules of classification that can be used as an early warning expert system and help enterprise managers, investors, employees and governmental officials intuitively distinguish companies in distress from those healthy has been only cursorily studied. Till now, DT is one of the main techniques that produce ‘If-Then’ rules from supervised pattern learning. But it is not focused in the area of BFP because its predictive accuracy is outperformed by other intelligent techniques such as NN and SVM. Even if DT sometimes takes the role of retrieval technique in CBR, attention is still paid to predictive accuracy rather than produce ‘If-Then’ rules when using DT-based CBR in BFP as Bryant (1997) did. While, Sun and Li (2008) have used the data mining method combined by attributeoriented induction, information gain and DT to generate ‘If-Then’ rules of business failure for prediction before two years. In fact, rules generated by DT can be viewed as the description on the boundary of companies’ two financial states. For example, it can be drawn out from Sun and Li (2008) research that, if the value of net profit growth rate of a listed company in China is lower than 1.5276, then the company may run into business failure in than less two years. In practice, companies on borderline are hard to be classified as healthy one or the one in distress. There are also no descriptions on major characteristics of companies in distress and health in rules generated by DT. In order to fetch up shortcomings of DT-generated rules and the supplements for each other, we attempt to utilize the data mining technique of two-step clustering to outline relationships between listed companies’ various financial states and their financial ratios in China. By conducting a two-step clustering analysis to identify reasoning rules for BFP in China, our research contributes to this domain by:
(1) Considering both continuous and categorical features; (2) automatically choosing optimal number of clusters; (3) generating rules that describe characteristics of various financial states of companies. The breakdown of this paper is organized as follows; the next section gives an explanation on why the data mining technique of two-step clustering is adopted. Section 3 gives a summary on specific model of two-step clustering. Section 4 proposes the new method of mining hidden knowledge from data of listed companies in China by two-step clustering. Characteristics of various clusters of listed companies and generated reasoning rules for BFP knowledge are analyzed in detail in the fourth section and financial ratios that are important for a company to stay away from business failure are also analyzed at the same time. Section 5 makes conclusions. WHY TWO-STEP CLUSTERING IS CHOSEN Clustering is a widely used technique in data mining applications to uncover hidden patterns in the underlying data. There are several popular clustering algorithms such as K-means clustering, hierarchical clustering, etc. Why the technique of two-step clustering rather than those traditional ones is chosen? Main reasons are explained as follows. The two-step clustering is an exploratory tool designed to reveal natural clusters within a data set that would otherwise not be apparent. The algorithm employed by two-step clustering has several desirable features that differentiate it from traditional clustering techniques. Firstly, it can effectively handle categorical and continuous variables at the same time. Most traditional clustering algorithms mentioned earlier are limited to handling data sets which contain only either continuous or categorical features. Though traditional clustering algorithm, for example, K-means clustering, can deal with a dummy feature that can be considered continuous, which is translated from a categorical feature, yet Guha et al. (1999) have illustrated that, the Euclidean distance can be a poor measure of similarity under this situation. While, two-step clustering can handle data sets with mixed types of features effectively by introducing a model-based distance measure. Considering the area of mining BFP knowledge, although, data sets consisting of financial ratios mainly contain continuous features, yet there are indeed some categorical features involved. For example, whether stock market price of a company (Beaver, 1968), net liquidation value of a company (Santomero and Vinso, 1977), or net profit of a company (Sun and Li, 2008) is negative or not, are always introduced to be a criterion to induce financial state of the company, because there are no unified well-specified theories of how and why corporations fail to develop till now. Features such as whether or not net profit of company is negative and are categorical with nominal
Li and Sun
values. That is to say, features of data sets for mining BFP knowledge are of mixed types. It is a situation that two-step clustering can handle effectively. Secondly, only summary statistics are needed for distance calculation in two-step clustering. Therefore, the requirement is removed that the data fit into the main memory of a computer. Thus, two-step clustering allows us to analyze large data sets by constructing a clustering features tree that summarizes the records. It may be more efficient. Thirdly, the optimal number of clusters can be determined automatically in two-step clustering by comparing the values of a model-choice criterion across different clustering solutions. On the contrast, this issue is not directly addressed by traditional clustering algorithms such as K-means clustering. As a result, the technique of two-step clustering rather than other traditional clustering algorithms is adopted to carry out the generation of reasoning rules of BFP knowledge before three years in China in this paper.
4109
Lk is the number of levels for the k-th categorical variable, Njkl is the number of observations in cluster j whose k-th categorical variable takes the l-th level. The log-likelihood distance measure is based on the following three assumptions: Variables in the cluster model are independent; each categorical variable is assumed to have a multinomial distribution; each continuous variable is assumed to have a normal distribution. Note that, the technique of two-step clustering is fairly robust to violations of both the assumption of independence and the distributional assumptions. Empirical internal testing has indicated that, two-step clustering is fairly robust to violations of both the assumption of independence and the distributional assumptions (Norusis, 2004), but we had better try to be aware of how well these assumptions are met in application researches. Just as its name indicates, two-step clustering involves two steps, that is, pre-clustering and clustering. They can be summarized as follows. Step 1: Pre-clustering
TWO-STEP CLUSTERING In order to handle categorical and continuous variables synchronously, the technique of two-step clustering described by Chiu et al. (2001) was utilized to loglikelihood distance measure, which is a probability based distance. The distance between two clusters is related to the decrease in log-likelihood as they are combined into one cluster. Considering there are two clusters, that is, cluster j and cluster s, the distance between the two clusters can be defined as follows:
d( j,s) =ξj +ξs −ξ< j,s>
(1)
Where
K A log(σ k2 + σ 2jk ) K B ξ j = −N j ∑ + ∑ E jk 2 k =1 k =1 Lk
Ejk = −∑ l =1
Njkl Nj
log
(2)
A sequential clustering approach is used in the preclustering step. It begins with the construction of a cluster features (CF) tree. The CF tree consists of several levels of nodes and each node contains a number of entries. Each successive case forms a new node or is added to an existing node based on its similarity to existing nodes. Distance measure is employed as the similarity criterion. If the record is within a distance threshold of the closest leaf entry, it is absorbed into the leaf entry and at the same time the CF of that leaf entry is updated, otherwise, it starts its own leaf entry. For details of CF-tree construction, please refer to BIRCH by Zhang et al. (1996). Finally, every entry in the leaf node denotes a final sub-cluster, which is characterized by its CF that consists of the entry’s number of records, mean and variance of each continuous variable and counts for each category of each categorical variable. A node which contains multiple records contains a summary of variable information about those records. Thus, the CF tree provides a capsule summary of the data set, which is much smaller and more able to be stored in main memory.
Njkl Nj
Step 2: Clustering (3)
In which, d(j,s) denotes distance between clusters j and s., is the index that represents the cluster formed by combining clusters j and s, Nj is the number of A observations in cluster j, K is the number of continuous variables employed in total, KB is the number of 2 categorical variables employed in total, σ k is the variance of the k-th continuous variable in the original data set, σ2jk is the variance of the k-th continuous variable in cluster j,
The leaf nodes of the CF tree are then, grouped using an agglomerative clustering algorithm. The agglomerative clustering can be used to produce a range of solutions. Considering traditional clustering algorithms, in order to determine which number of clusters is optimal, hierarchical clustering produces a sequence of partitions in one run: 1, 2, 3, … clusters; a K-means algorithm needs to run multiple times to generate the sequence. While in two-step clustering, the number of optimal
4110
Afr. J. Bus. Manage.
There are four steps in data mining with two-step clustering for mining Chinese listed companies’ BFP knowledge, this includes, data collection, data preprocessing, mining knowledge by two-step clustering and knowledge presentation.
(Li and Sun, 2010). The mixed types of features are listed in Table 1. We selected 133 pairs of Chinese companies listed in Shanghai stock exchange and Shenzhen stock exchange as initial sample companies. Values of financial ratios three years before companies are specially treated because there are negative net profits in continuous two years and values of corresponding match-pair companies are collected in this research. Traditionally, BFP knowledge mining has been studied as a supervised learning problem for the last four decades. Because there are no mature theories of how and why corporations fall into business failure developed till now, whether or not net profit of a company is negative in continuous two years, or the other two rules mentioned earlier, is always employed as a decision criterion to determine the company’s financial state. This research area of mining BFP knowledge of economic entities is fairly different from those on physic entities like clustering analysis on mobile Internet adopters (Okazaki, 2006). For example, mobile Internet adopters can be easily classified into two groups, (mobile e-mail users and nonusers). If people use mobile e-mail, they are users. If people do not use mobile e-mail, they are non-users. Considering BFP of a company, which is an economic entity, we can not draw the conclusion that the company is in business failure if it falls into business failure. Though this proposition is correct absolutely, yet it makes no contributions on the development of this research area. The feature that there is negative net profit of a company in continuous two years is always considered as a major characteristic of the company in business failure in China. Therefore, this characteristic is taken as a substitute for the label on whether or not a company is in business failure in traditional researches. The characteristic that there are net profits of a company in continuous two years is just one of all the characteristics of a company in business failure, even if it is the most important one. It is not the actual label of a company. So, this feature is very important to clustering result. We add this categorical feature in data set of financial ratios to form mixed types of features in this research.
Data collection
Data preprocessing
In China, if a listed company has negative net profit in continuous two years, the company will be specially treated (ST). These companies are regarded as those in business failure. Financial ratio values and whether or not net profit of a company is negative can both be obtained from listed companies’ public information. Financial ratios constitute continuous features, while whether or not net profit of a company is negative in continuous two years constitutes the categorical feature. Thus, a data set with mixed types of features for mining BFP knowledge is formed. Considering continuous features, initial financial ratios consist of 30 financial ratios which cover profitability, activity, liability, growth and structural ratios
Initial data collected for mining BFP knowledge tends to be incomplete and inconsistent. Hence, data preprocessing is employed to help improve the quality of the data we collected. There are four commonly used techniques of data preprocessing, that is, data cleaning, data integration, data reduction and data transformation. In this research, all techniques employed in the procedure of data preprocessing are to tackle with those assumptions of two-step clustering. Missing values must be handled because the technique of two-step clustering does not allow missing values. Although, two-step clustering is robust to violations of both the assumption of independence and the distributional assumptions, we had
clusters is determined automatically by using a two-step procedure. In the first step, the Bayesian / Akaike Information Criterion (BIC / AIC) for each number of clusters within a specified range is calculated and used to find the initial estimate for the number of clusters. In the second step, the initial estimate is refined by finding the largest increase in distance between the two closest clusters in each hierarchical clustering stage. The BIC and AIC for J clusters are defined in the following way; J
BIC ( J ) = − 2 ∑ ξ j + m J log( N ) j =1
Where; J
AIC ( J ) = − 2 ∑ ξ j + 2 m J j =1
m J = J 2 K
( L K − 1) k =1 K
A
+
B
∑
In which, N is the number of data records in total. The clustering step takes sub-clusters resulting from the pre-clustering step as input and then, groups them into the desired number of clusters. Since the number of subclusters is much less than the number of original records, the traditional clustering methods can be applied effectively. DATA MINING USING TWO-STEP CLUSTERING FOR BFP
Li and Sun
4111
Table 1. Mixed types of features and variables of initial financial ratios.
Category
Profitability
Variable V1 V2 V3 V4 V5 V6 V7 V8
Features Gross income to sales Net income to sales Earning before interest and tax to total asset Net profit to total assets Net profit to current assets Net profit to fixed assets Profit margin Net profit to equity
Activity
V9 V10 V11 V12 V13 V14
Account receivables turnover Inventory turnover Account payable turnover Total assets turnover Current assets turnover Fixed assets turnover
Short-time liability
V15 V16
Current ratio The ratio of cash to current liability
Long-time liability
V17 V18 V19 V20 V21
Asset-liability ratios Equity to debt ratio Ratio of liability to tangible net asset Ratio of liability to market value of equity Interest coverage ratio
Growth
V22 V23
Growth rate of primary business Growth rate of total assets
Structural ratios
V24 V25 V26 V27
The proportion of current assets The proportion of fixed assets The proportion of equity to fixed assets The proportion of current liability
Per share items and yields
V28 V29 V30
Earning per share Net assets per share Cash flow per share
Categorical feature
V31
Negative net profit in continuous two years
better be aware of how well these assumptions are met. Hence, these three sub-procedures of data cleaning, data reduction and data verification are employed in this research. Data cleaning It routines work to clean the data by filling in missing values. Missing value analysis of the data set we collected is listed in Table 2. There are mainly five
effective methods to go about filling in the missingvalues, which are described as follows: 1. Fill in the missing value manually. It is more timeconsuming and may be not feasible if there is a large data set with lots of missing values. At the same time, the reason why the feature value is missing maybe that it is hard to obtain. At a result, it may be difficult to fill in the exact value of such missing ones manually. 2. Fill in the missing value by a global constant. It means replacing all missing feature values by the same
4112
Afr. J. Bus. Manage.
Table 2. Missing value analysis.
Variable V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31
Negative net profit in continuous two years Missing N Mean S.D. Count Percent 132 0.1938 0.3401 1 0.8 132 0.1096 0.9380 1 0.8 119 0.0575 0.0867 14 10.5 133 0.0266 0.0738 0 0.0 133 0.0597 0.2075 0 0.0 133 0.1725 0.7123 0 0.0 132 0.8877 1.1960 1 1.5 132 0.0029 0.4661 1 0.8 132 70.6070 694.1598 1 0.8 132 28.6048 269.7253 1 0.8 132 17.7618 60.6035 1 0.8 132 0.4761 0.4134 1 0.8 132 0.8707 0.9427 1 0.8 132 3.1400 4.6015 1 0.8 125 1.5427 0.8019 8 6.0 125 0.0317 0.2232 8 6.0 133 0.5432 0.5068 0 0.0 133 1.3786 1.3229 0 0.0 131 2.4181 9.0402 2 1.5 133 279.6873 226.2069 0 0.0 119 4.4787 7.0805 14 10.5 101 0.1101 1.2023 32 24.1 103 0.1726 0.3208 30 22.6 133 0.5959 0.1574 0 0.0 133 0.2759 0.1475 0 0.0 133 3.4869 10.2189 0 0.0 133 0.9051 0.1194 0 0.0 125 0.0832 0.2184 8 6.0 125 2.2372 1.2620 8 6.0 125 0.0459 0.4194 8 6.0 133 0 0.0
constant, whether nominal or numeric. In fact, if replacing missing feature values by the same nominal constant such as ‘Unknown’, it is still missing for continuous features. If replacing missing feature values by the same numeric constant, how to choose the value of the constant is intractable. And it may not be reasonable because each record may belong to different groups. Note that, Chinese listed companies can be classified into two groups by whether or not the company’s net profit is negative for two continuous years. Hence, this method is not used in this research though it is simple. 3. Fill in the missing value by feature mean. In this method, it provides a way to choose a numeric value by compute mean of a feature. But it is still not reasonable to replace missing feature values of all the records belonging to different category by the same feature
Positive net profit in continuous two years Missing N Mean S.D. Count Percent 132 0.2349 0.1650 1 0.8 132 0.1181 0.3335 1 0.8 116 0.0613 0.0873 14 10.5 133 0.0411 0.0805 0 0.0 133 0.0882 0.2134 0 0.0 133 0.3012 1.2299 0 0.0 132 0.8549 0.3165 1 1.5 133 0.0767 0.1388 1 0.8 132 67.0649 457.5714 1 0.8 132 7.4448 17.8258 1 0.8 132 16.4544 28.4537 1 0.8 132 0.6770 0.5574 1 0.8 132 1.3729 1.1541 1 0.8 132 5.1361 17.2123 1 0.8 125 1.8229 0.9875 8 6.0 125 0.1970 0.3378 8 6.0 133 0.3828 0.1578 0 0.0 133 2.2650 2.3517 0 0.0 133 0.9359 1.0370 2 1.5 133 186.9719 151.5320 0 0.0 116 12.9774 43.7058 14 10.5 102 0.3242 1.3079 32 24.1 103 0.1622 0.4168 30 22.6 133 0.5279 0.1853 0 0.0 133 0.3482 0.1802 0 0.0 133 4.2668 11.4368 0 0.0 133 0.8871 0.1587 0 0.0 125 0.1785 0.1932 8 6.0 125 2.7412 1.0114 8 6.0 125 0.2018 0.3909 8 6.0 133 0 0.0
mean. 4. Fill in the missing value by the most probable value. In this method, inference-based tools using Bayesian formalism or decision tree induction need to be employed to determine the so-called probable value. Although, this method sounds attractive, it is more complicated for application. 5. Fill in the missing value by feature mean for all samples belonging to the same class. The last method sounds reasonable and simple. So, it is employed in this research. And because number of companies in business failure is relatively small compared with those in health, all the companies collected are planned to be used in this research. So, all missing values are filled in by the method we choose.
Li and Sun
4113
Table 3. Reduced representation of the data set by forward stepwise of log it.
Category Profitability
Variable V6
Features Net profit to fixed assets
Activity
V11 V12
Account payable turnover Total assets turnover
Short-time liability Long-time liability Structural ratios Per share items and yields Categorical feature
V16 V20 V25 V29 V31
The ratio of cash to current liability Ratio of liability to market value of equity The proportion of fixed assets Net assets per share Negative net profit in continuous two years
Table 4. Statistical tests on multi-collinearity among variables.
Variables
t
Sig.
V6 V11 V12 V16 V20 V25 V29
-2.6693 3.6897 -4.0363 -4.3195 3.7952 -3.7542 -2.4170
0.008 0.000 0.000 0.000 0.000 0.000 0.016
Data reduction This technique is applied to get a reduced representation of the data set for BFP knowledge that is much smaller in volume, yet closely maintains the integrity of the original data. The strategy of dimension reduction realized by removing irrelevant, weakly relevant or redundant features through stepwise discriminant analysis is utilized. In this research, forward stepwise method of Logit is employed to get a reduced representation of the data set, which is listed in Table 3. Note that, we have to use the criteria whether or not a company has negative net profit in continuous two years as a substitute of a company’s actual financial state in this step, because it can not be told whether a company is actually in distress or not in theory. It is such a quandary for researchers of the area of business failure mining because this research area is very valuable for various people in the society such as stakeholders, auditors, investors, lenders, employees and managers, but there are no unified wellspecified theories of how and why corporations fail to develop. Data verification After the step of data reduction, final used data set had
Collinearity statistics Tolerance VIF 0.8933 1.1193 0.8477 1.1795 0.9448 1.0584 0.8325 1.2011 0.9215 1.0850 0.8749 1.1429 0.9271 1.0785
Conclusion Reject Reject Reject Reject Reject Reject Reject
H0 H0 H0 H0 H0 H0 H0
better be verified on whether or not variables in it are independent, even if data reduction is realized by removing irrelevant, weakly relevant or redundant features. And distributional assumptions of variables are also needed to be verified, though two-step clustering is robust to violations of both two types of assumptions. Test on the level of multi-colinearity among variables used are shown in Table 4. It is commonly assumed that, there is multi-colinearity among features in the condition of tolerance < 0.1 or VIF > 10. From Table 4, we can find that tolerances are all within an acceptable range of (0.83 and 0.95) and VIF are within the range of (1 and 1.3), indicating low collinearity among variables. Test on variables independence are carried out by independent-samples t-test and the result is shown in Table 5. From Table 5, we can find that all the variables are independent expect for the two variables of ‘net profit to fixed assets’ and ‘account payable turnover’, which we believe is acceptable for the application of two-step clustering. Some other researches (Bird and McHugh, 1977; Bougen and Drury, 1980; Deakin, 1976) have already discovered the fact that, the assumption of normal distribution, which is employed by various statistical techniques such as Logit and MDA, is not well met by financial variables with sectional data set. At the same
4114
Afr. J. Bus. Manage.
Table 5. Results of Independent-samples t -test.
Variables V6 V11 V12 V16 V20 V25 V29
t-test for equality of means t-value Sig. 1.0437 0.2977 -0.2260 0.8213 3.3498 0.0009*** 4.8569 0.0000*** -3.9271 0.0000*** 3.5802 0.0000*** 3.7073 0.0000***
Conclusion Reject H0 Reject H0 Accept H0 Accept H0 Accept H0 Accept H0 Accept H0
*** Significance at the level of 1%.
Table 6. Auto-clustering process of two-step clustering.
Number of clusters 1 2 3 4 5 6 7 8 9 10
BIC 1739.6 1449.2 1248.7 1235.3 1257.9 1286.4 1330.3 1375.1 1421.6 1478.5
BIC change a
Ratio of BIC change b
Ratio of distance measures c
-290.4 -200.4 -13.4 22.6 28.4 43.8 44.8 46.5 56.8
1.000 0.690 0.046 -0.077 -0.097 -0.151 -0.154 -0.160 -0.195
1.316 2.925 1.589 1.105 1.386 1.024 1.047 1.384 1.073
a
b
,The changes are from the previous number of clusters in the table; ,the ratios of changes are relative to the change for the two c cluster solution; ,the ratios of distance measures are based on the current number of clusters against the previous number of clusters.
time, there are also some comments and demonstrations that, financial ratios are normally distributed if study is concentrated on a single industry (Ricketts and Stover, 1978), the extreme outliers are removed (Frecka and Hopwood, 1983), or theoretical analysis on that most companies attempt to hit optimal ratio level with missing the target randomly (Horrigan, 1983). The technique of two-step clustering is applicable to mine BFP knowledge whether or not the assumption is met, because it has been demonstrated to be robust to violations of both the assumption of independence and the distributional assumptions, as is mentioned earlier. Thus, whether or not data distribute normally is not taken into consideration in this research. Mining knowledge by using two-step clustering Clustering results and discussion SPSS is employed to carry out the data mining by twostep clustering. BIC is used in this research as the
selection criteria of optimal number of clusters. Financial variables such as V6, V11, V12, V16, V20, V25 and V29 are inputted as continuous variables. The important characteristic to distinguish companies in business failure from those in health is inputted as categorical variable, which has only two values (True and False). Results of auto-clustering are shown in Table 6. From Table 6, we can find that the three and four cluster solutions are the two best models for clustering Chinese listed companies because they minimize the BIC value. There is the max value of ratio of distance measure in the three cluster solution. Hence, the three cluster solution is selected as the optimal model. Cluster distribution is shown in Table 7, from which we can find that the resulting clusters 1, 2 and 3 contain 121, 26 and 119 records, which correspond to 45.5, 9.8 and 44.7%, respectively. Because we carried out a procedure to handle missing values in the data set, numbers of combined records and total records in data set are equal. The records involved are clustered into three clusters, two of which respectively contain nearly 45% of the total data and the third of which contains only more than 10%
Li and Sun
4115
Table 7. Cluster distribution.
Cluster 1 2 3 Combined Total
N 121 26 119 266 266
Percent of combined 45.5 9.8 44.7 100
Percent of total 45.5 9.8 44.7 100 100
Table 8. Frequency distribution of V31.
Cluster 1 2 3 Combined
Positive net profit in continuous two years Frequency Percent 121 91.0 12 9.0 0 0.0 133 100.0
of the total data. The conclusion can be drawn that the data set we collected is mainly clustered into two clusters. Though the data set has been clustered into three clusters, we may be more interested in frequency distributions of V31 in the three different clusters because this variable is a very important decision criterion for business failure. We wonder whether the majority of those companies with different characteristic in V31 are clustered into the two main clusters, (cluster 1 and 3). Frequency distributions of V31 in the three clusters are illustrated in Table 8, from which we can find that the resulting cluster 1 contains 121 records from those who meet the condition that the categorical value of V31 is ‘False’, which corresponds to 91%; the resulting cluster 2 contains 12 records from those who meet the condition that the categorical value of V31 is ‘False’ and 14 records from those meet the condition that the categorical value of V31 is ‘True’, which correspond to 9 and 10.5%, respectively, and the resulting cluster 3 contains 119 records from those who meet the condition that the categorical value of V31 is ‘True’, which corresponds to 89.5%. Means of the three clusters on various continuous variables are illustrated as Figure 1, from which we can find that means of continuous variables in the two main clusters, that is, cluster 1 and 3, are not only different intuitively but also partitionable by the reference line, which is the overall mean of all the records. Considering the third cluster (cluster 2), though means of variables in this cluster is different from the former two clusters, yet common characteristics of this cluster are hard to summarize. That is to say, records in this cluster may be the most misclassified in business failure binary prediction or they had better be handle separately. One-way
Negative net profit in continuous two years Frequency Percent 0 0.0 14 10.5 119 89.5 133 100.0
ANOVA is used to test whether there are significant different among these three clusters. From the result, all the F-values are within the range of (6.2 and 33.4), we can conclude that, the three clusters are significantly different at the level of 1% on all the continuous variables. And centroids of continuous variables in the three clusters are shown in Table 9. Contributions of continuous variables to the clustering result are described as Figure 2. The dashed vertical lines mark the critical values for determining the significance of each variable. For a variable to be considered significant, its t statistic must exceed the dashed line in either a positive or negative direction. A negative t statistic indicates that, the variable generally takes smaller than average values within this cluster, while a positive t statistic indicates that, the variable takes larger than average values. Considering cluster 1, V20 and V11 exceed the critical value in the negative direction, which means that ratio of liability to market value of equity and account payable turnover take smaller than average values; V16, V29 and V25 exceed the critical value in the positive direction, which means that, the ratio of cash to current liability, net assets per share and the proportion of fixed assets take larger than average values. Since the importance measures for last two variables (V6 and V12), do not exceed the critical value, we can conclude that, they are not important to the formation of the first cluster. Considering cluster 2, all of the continuous variables are not exceeded in the critical value, which may mean this cluster is a byproduct for those records that can not be clustered into the other clusters. Considering cluster 3, all the continuous variables that is, V11, V16, V6, V12 and V25, which exceed the critical value are in the negative direction, which means that they take smaller than average values
4116
Afr. J. Bus. Manage.
S im u lta n e o u s 9 5 % C o n fid e n c e In t e rv a ls fo r Simultaneous 95% confidence intervals for means Me a n s
S im u lt a n e o u s 9 5% C o n fid e n c e In te rv a ls fo r Simultaneous 95% confidence intervals for means Me a n s
2.50 00
140.00 000
2.00 00
120.00 000 100.00 000
1.00 00
V11
V6
1.50 00
0.50 00
80 .0 0000 60 .0 0000
0.00 00
40 .0 0000
-0.500 0
20 .0 0000
-1.000 0
0.0000 0 1
2
3
1
2
C lu s te r R e fe re n c e Lin e is e Ove ra ll Me a n ==.20.2369 36 9 Reference line is th the overall mean
Re fe re n ce line Lin eisisthe th e overall O ve ra llmean Me a n = 17.10816 1 7 .1 0 8 1 6 Reference
S im u lta n e o u s 9 5 % C o n fid e n c e In t e rv a ls fo r Simultaneous 95% confidence intervals for means Me a n s
S im u lt a n e o u s 9 5% C o n fid e n c e In te rv a ls fo r Simultaneous 95% confidence intervals for means Me a n s 0.50
V16
1.5000 00
V12
3
C lu s t e r
1.0000 00
0.25
0.00 0.5000 00
1
2
3
1
2
C lu s t e r
3
C lu s te r
RReference e fe re n ce Lin e is isththe e Ove ra ll Me a n = .5 65 95 line overall mean = 70.576595
R e fe re n ce Lin e is th e O ve ra ll Me a n = .1 1
Reference line is the overall mean = 0.11
S im u lt a n e o u s 995% 5% C o n fid e n c eintervals In t e rv a for ls fo r Simultaneous confidence means Me a n s
S im u lta n e o u s 995% 5 % confidence C o n fid e n c intervals e In t e rv afor ls fo r Simultaneous means Me a n s
600.000 000
0.4500 00
0.3500 00
400.000 000
V25
V20
0.4000 00
0.3000 00 0.2500 00
200.000 000
0.2000 00 0.1500 00 1
2
3
1
C lu s t e r
2
3
C lu s t e r
R e fe re n ce Lin e is th e O ve ra ll Me a n = 2 3 3 .3 2 9 6 2 4
R e feReference re n ce Lin e line is this e Ove ll Me a mean n = .3 1= 2 00.312077 77 the ra overall
Reference line is the overall mean = 233.329624
SSimultaneous im u lta n e o u s95% 9 5 %confidence C o n fid e n intervals c e In te rvfor a lsmeans fo r Me a n s 3 .0 00 0
V29
2 .5 00 0
2 .0 00 0
1 .5 00 0
1 .0 00 0 1
2
3
C lu s te r R eReference fe re nc e Line is the O veoverall ra ll Me mean a n = 2.4 line is the =892 2.4892
Figure 1. Means plots of continuous variables in the three clusters.
Li and Sun
4117
Table 9. Centroids of the three clusters.
Variables V6
Statistics Mean S.D.
Cluster 1 0.2059 0.3310
Cluster 2 0.9413 3.0383
Cluster 3 0.1145 0.2766
Combined 0.2368 1.0052
V11
Mean S.D.
11.9769 12.1443
81.1357 132.3089
8.3363 10.5119
17.1081 47.0771
V12
Mean S.D.
0.5923 0.3617
1.0965 1.0685
0.4469 0.3251
0.5765 0.4982
V16
Mean S.D.
0.1940 0.2957
0.1700 0.5422
0.0212 0.1434
0.1144 0.2891
V20
Mean S.D.
174.4857 125.3552
391.4965 390.9405
258.6049 172.3476
233.3296 197.6943
V25
Mean S.D.
0.3540 0.1712
0.3104 0.2482
0.2697 0.1307
0.3120 0.1683
V29
Mean S.D.
2.7490 1.0062
2.1338 1.5277
2.3027 1.1115
2.4892 1.1347
and in this way they contribute to the formation of the cluster. Analysis Because the feature of negative net profit in continuous two years is considered as the major characteristic of companies in business failure, the cluster containing major records with this characteristic can be viewed as the label on companies in distress. In this consideration, cluster 3 is the group of companies in business failure, which contains 89.5% of those records. At the same time, the cluster containing major records without this characteristic can be viewed as the label on companies in health. In this consideration, cluster 1 is the group of companies in health, which contains 91% of those records. Cluster 2 between the former two types of clusters, which contain 10.5% of companies with the characteristic and 9% of companies without the characteristic, can be viewed as a group of companies on the borderline of business failure, assuming that there are no outliners involved. This type of classification on business failure is more appropriate intuitively than simply classifying the state of business failure into two groups, that is, distress and health. From statistical descriptions of the two major clusters (cluster 1 and 3), we can outline relationships between listed companies’ financial state and their financial ratios by the following two rules:
1. If the value of a company’s net profit to fixed assets is around 0.2059, that of account payable turnover is around 11.9769, that of total assets turnover is around 0.5923, that of the ratio of cash to current liability is around 0.1940, that of ratio of liability to market value of equity is around 174.4857, that of the proportion of fixed assets is around 0.3540 and that of net assets per share is around 2.7490, then the company may belong to healthy one in at least three years. 2. On the contrary, if the value of a company’s net profit to fixed assets is around 0.1145, that of account payable turnover is around 8.3363, that of total assets turnover is around 0.4469, that of the ratio of cash to current liability is around 0.0212, that of ratio of liability to market value of equity is around 258.6049, that of the proportion of fixed assets is around 0.2697 and that of net assets per share is around 2.3027, then the company may fall into business failure in three years. These reasoning rules are easily understandable for enterprise managers, investors, employees and governors, etc. and can be used as an early warning expert system. And they cover profitability, activity, shorttime liability, long-time liability, structure ratios and per share items and yields of a company. As a mater of fact, these two rules can also be viewed as two vectors, (that is, 0.2059, 11.9769, 0.5923, 0.1940, 174.4857, 0.3540, 2.7490 and 0.1145, 8.3363, 0.4469, 0.0212, 258.6049, 0.2697, 2.3027), describing feature ratio vector, (V6, V11, V12, V16, V20, V25, V29, V31), of a company. When
4118
Afr. J. Bus. Manage.
Twocluster number Tw o S step te p Clu s te r Nu m b e r ==11 Bonferroni adjustment applied
Bo n fe rro ni Ad ju s tm e nt App lie d
Critic a l Va lue
Critical value
V2 0
TeTest s t S tastatistic tis tic
Va ria b le
V1 1 V1 6 V2 9 V2 5 V6 V1 2
-6
-4
-2
0
2
4
S t u d e n t's t
Twonumber Tw o Sstep te p Ccluster lu s te r Nu mber = =2 2 Bonferroni adjustment applied
Bonfe rroni Adjus tm e nt Ap plie d
Critical C ritica l Vavalue lu e V12
TeTeat s t S tastatistic tis tic
Va riab le
V11 V20 V29 V6 V16 V25
-3
-2
-1
0
1
2
3
S tu d e n t 's t
Two S step te p Clu s te r Nu m b e r ==33 Twocluster number Bonferroni adjustment applied
Bonfe rroni Adjus tm e nt Applie d
C ritica l Va lue
Critical value
V11
Te s t S ta tis tic Test statistic
Va ria b le
V16 V6 V12 V25 V29 V20
-10.0
-7.5
-5.0
-2.5
0 .0
2.5
S t u d e n t 's t
Figure 2. Continuous variable-wise importance.
carrying out the task of BFP before three years, we can determine which financial state a company is in by computing similarity between that vector of the company
and these two vectors. The label of the vector that the company is more similar to is utilized to predict the company’s financial state in three years.
Li and Sun
Importance of each financial feature in prediction of the two distinct financial states is also provided. By analyzing contributions of continuous variables to the clustering result, we can draw the conclusion that, when the first rule is used to predict financial health, these financial features, that is, ratio of liability to market value of equity, account payable turnover, the ratio of cash to current liability, net assets per share, the proportion of fixed assets, net profit to fixed assets and total assets turnover, are importance in descending sequence, which indicates that long-time liability, activity, short-time liability, per share items and yields and structure ratios are important in descending sequence to guarantee companies healthiness, while profitability may not be important; when the second rule is used to predict business failure, these financial features, which includes, account payable turnover, the ratio of cash to current liability, net profit to fixed assets, total assets turnover, the proportion of fixed assets, net assets per share and ratio of liability to market value of equity, are important in descending sequence, which indicates that activity, short-time liability, profitability and structural ratios are important in descending sequence to avoid companies from being a company in business failure while per share items, yields and long-time liability may not be important. At last, the two reasoning rules are integrated to generate the final forecast. At the same time, these two types of rules describe the characteristics of two distinct financial states and those rules generated by DT, describing what the boundary between the two distinct financial states as supplementary of each other. CONCLUSIONS AND REMARKS The conclusion of this study is that, the new method mining BFP knowledge by two-step clustering can provide pellucid descriptions on which kind of companies belongs to health ones and which kind of companies yields to run into business failure in several years. Our research also provides data on the types of financial ratios that are important for a company to be a healthy one and for a company to avoid running into distress. This easy type of reasoning rules has not been provided by researches in this area before. By combining knowledge generated in this research with the rules generated by DT describing the boundary of business failure and health, a general principle to distinguish companies in business failure from those healthy is achieved. Note that, the objective of this research is to mine knowledge from financial data by two-step clustering. It is beyond this research to construct predictive models by using the knowledge obtained. It is due to this limitation that, there are no predictive models produced and it is a valuable topic to carry out further research on. Besides, there are other three contributions of this paper, which
4119
might guide further research in this area. For business failure binary prediction, the middle cluster, taking little percentage of the total data, can be handled separately to improve predictive accuracy of binary classification. The findings also suggest that, there exist three clusters, including companies in distress, health and those on borderline. This also provides a possibility to carry out multiclassification in the area of BFP. Finally, the financial ratios which are significant for distinguishing distress companies from healthy companies could also construct an optimal feature set which could be an input of some other classifiers. It means the method presented in this research could be viewed as a new method of feature selection. And further research could be carried out to testify the performance of this new method as a feature selection method. ACKNOWLEDGEMENTS This research is partially supported by the National Natural Science Foundation of China (No. 70801055) and the Zhejiang Provincial Natural Science Foundation of China (No. Y6090392). The authors gratefully thank anonymous referees and editors for their recommendations and useful comments. REFERENCES Ahn H, Kim KJ (2009). Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach. Appl. Soft Comput., 9(2): 599-607 Altman EI (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, J. Finan., 23: 589-609. Beaver W (1966). Financial ratios as predictors of failures, J. Account. Res., 4 (supplement): 71-111. Beaver W (1968). Market prices, financial ratios and the prediction of failure, J. Account. Res., 6: 179-192. Bird RG, McHugh AG (1977). Financial ratios: An empirical study, J. Bus. Finan. Account., 4(1): 29-45. Bougen P, Drury JC (1980). U.K. statistical distributions of financial ratios, J. Bus. Finan. Account., 7(1): 39-47. Bryant SM (1997). A case-based reasoning approach to bankruptcy prediction modeling, Intelligent Systems in Accounting, Finan. Manage., 6: 195-214. Chiu T, Fang D, Chen J (2001). A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263-268. Deakin EB (1976). Distributions of financial accounting ratios: Some empirical evidence, Account. Rev., 51(1): 90-96. Frecka TJ, Hopwood WS (1983). The effects of outliners on the crosssectional distributional properties of financial ratios, Account. Rev., 58 (1): 115-128. Frydman H, Altman EI, Kao D (1985). Introducing recursive partitioning for financial classification: The case of financial distress, J. Finan., 40(1): 269-291. Guha S, Rastogi R, Shim K (1999). ROCK: A robust clustering algorithm for categorical attributes. In: Proceedings of 1999 International Conference on Data Engineering, pp. 73 - 84. Han JW, Kamber M (2001). Data Mining Concepts and Techniques, Morgan Kaufman Publishers Inc., San Mateo. Hardle W, Lee Y, Schafer D (2009). Variable selection and over
4120
Afr. J. Bus. Manage.
sampling in the use of smooth support vector machines for predicting the default risk of companies. J. Forecast., 28(6): 512-534 Horrigan JO (1983). Methodological implications of non-normally distributed financial ratios: A comment, J. Bus. Finan. Account., 10(4): 683-689. Hwang R, Cheng K, Lee J. (2010). A semi-parametric method for predicting bankruptcy. J. Forecast., 26(5): 317-342 Jo H, Han I, Lee H (1997). Bankruptcy prediction using case-based reasoning, neural network and discriminant analysis for bankruptcy prediction, Expert Syst. Appl. 13 (2): 97-108. Kumar PR, Ravi V, (2007), Bankruptcy prediction in banks and firms via statistical and intelligent techniques – a review, Eur. J. Oper. Res., 180: 1-28. Li H, Sun J, Wu J (2010). Predicting business failure using classification and regression tree. . Expert Syst. Appl., 37: 5895–5904. Li H, Sun J (2010). Forecasting business failure in China using casebased reasoning with hybrid case respresentation. J. Forecasting 29:486–501 Li ST, Ho HF (2009). Predicting financial activity with evolutionary fuzzy case-based reasoning. Expert Syst. Appl., 36(1): 411-422 Lin FY, McClean S (2001). A data mining approach to the prediction of corporate failure. Knowledge-Based Syst., 14(3-4): 189-195. Lin R, Wang Y, Wu C (2009). Developing a business failure prediction model via RST. GRA and CBR. Expert Syst. Appl., 36 (2): 15931600. Martin D (1977). Early warning of bank failure: A logit regression approach, J. Bank. Finan., 1: 249-276. Min JH, Lee YC (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters, Expert Syst. Appl., 28: 603-614. Nam C, Kim T, Park N (2008). Bankruptcy prediction using a discretetime duration model incorporating temporal and macroeconomic dependencies, J. Forecast., 27(6): 493-506. Norusis MJ (2004). SPSS 13.0 Statistical Procedures Companion, Prentice-Hall, Upper Saddle River, NJ.
Ohlson JA (1980). Financial rations and the probabilistic prediction of bankruptcy, J. Account. Res., 18 109-131. Okazaki S (2006). What do we know about mobile Internet adopters? A cluster analysis, Info. Manage., 43: 127-141. Ravi V, Pramodh C (2008). Threshold accepting trained principal component neural network and feature subset selection: Application to bankruptcy prediction in banks. Appl. Soft Comput. 8(4): 15391548. Ravisankar P, Ravi V, Bose I (2010). Failure prediction of dotcom companies using neural network-genetic programming hybrids. Info. Sci., 180(8): 1257-1267. Ricketts D, Stover R (1978). An examination of commercial bank financial ratios, J. Bank Res., 9(2): 121-124 Santomero AM, Vinso JD (1977). Estimating the probability of failure for commercial banks and the banking system, J. Bank. Finan., 1(2): 185-205. Serrano-Cinca C (1996). Self organizing neural networks for financial diagnosis, Decis. Support Syst., 17(3): 227-238. Shin KS, Lee TS, Kim HJ (2005). An application of support vector machines in bankruptcy prediction model, Expert Syst. Appl., 28: 127-135. Sun J, Li H (2008). Data mining method for listed companies’ financial distress prediction, Knowledge-Based Syst., 21(1): 1-5. Wilson RL, Sharda R (1994). Bankruptcy prediction using neural networks. Decis. Support Syst. 11: 545-557. Youn H, Gu Z (2010). Predicting Korean lodging firm failures: An artificial neural network model along with a logistic regression model. Int. J. Hospitality Manage., 29: 120-127. Zhang T, Ramakrishnon R, Livny M (1996). BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 103-114.