Sparse Bayesian Models: Bankruptcy-Predictors of Choice? Bernardete Ribeiro Member, IEEE, and Armando Vieira and Jo˜ao Carvalho das Neves Abstract— Making inferences and choosing appropriate responses based on incomplete, uncertainty and noisy data is challenging in financial settings particularly in bankruptcy detection. In an increasingly globalized economy, bankruptcy results both in huge economic losses and tremendous social impact. While early prediction for a bankruptcy, if done appropriately, is of great importance to banks, insurance firms, creditors, and investors, the need of substantially more accurately predicting models becomes crucial. This problem has been approached by various methods ranging from statistics to machine learning, however they find a class decision estimate rather than a probabilistic confidence of the class distribution. In this paper we show that sparse Bayesian models also known as Relevance Vector Machine (RVMs) are superior to the stateof-the-art machine learning algorithms such as Support Vector Machines (SVMs) therefore leading to predictors of choice. The advantage of RVM approach is that the classifier can yield a decision function that is much sparser than the SVM while maintaining its detection accuracy. This can lead to significant reduction in the computational complexity of the decision function, thereby making it more suitable for realtime applications. Preliminary experiments on Coface Data set (French credit risk provider) show that RVM classifiers outperform SVM, lead to more sparse and accurate prediction models.
I. I NTRODUCTION Bankruptcy prediction is a problem to discriminate between healthy and distressed companies based on the record of several financial indicators from previous years [7]. While early planning for a bankruptcy, if done appropriately, is of great importance to banks, insurance firms, creditors, and investors, the need of substantially more accurate predicting models becomes crucial [2]. Several machine learning algorithms such as neural networks, multiple discriminant analysis and logistic regression have been applied to bankruptcy detection [3], [14], [4]. While recent approaches such as Support Vector Machines, Genetic Algorithms demonstrated their superiority in signaling a distressed company, traditional Linear Discriminant Analysis (LDA) is still widely used. The success of LDA can be explained by its simplicity and the fact that, in most cases, the difference in performance are not substantial. Another reason to explain the analyst’s resilience in using new approaches, concerns the quality of databases used to benchmark the predictions. The most common weakness are the small size of the database, limited historical records and the use of unbalanced samples, containing a Bernardete Ribeiro is with the Department of Informatics Engineering, Center for Informatics and Systems, University of Coimbra, Polo II, P3030-290 Coimbra, Portugal,(phone: 351 239 790 087; fax: 351 239 701 266; email:
[email protected]). Armando Vieira is with Computational Physics Centre, University of Coimbra, P-3000 Coimbra, Portugal Jo˜ao Carvalho das Neves is with ISEG - School of Economics, Rua Miguel Lupi 20, 1249-078 Lisboa, Portugal
much larger number of healthy companies than financially distressed. The problem is stated as follows: given a set of parameters (mainly of financial nature) that describes the situation of a company over a given period, predict the probability that the company may become bankrupted during the following year. It is clear that probabilistic models are better suited for class distribution prediction. Probabilistic models and corresponding algorithms have a different nature than other machine learning algorithms although they have inherently the same goal of function approximation, density estimation or class decision finding. Problem estimation requires a powerful learning machine algorithm capable of good generalization on unseen data. Vapnik machine [11] and Tipping machine [10] both rely on well sound mathematical basis [6]. The former relies on the Vapnik-Chervonenkis (VC) dimension to control complexity through the maximal margin separation whereas the latter exploring a Bayesian learning framework introduces priors over the weights parameters. Both use a kernel representation which leads to a problem sparse representation. Meanwhile good generalization is ensured with SVM, the sparseness of the learning method is fully exploited by RVM leading to a few relevant vectors, probabilistic outputs and avoiding the use of Mercer kernels. We apply RVMs and SVMs methods to the problem of bankruptcy prediction of private companies. Financial data obtained from Diana, a database containing 780,000 financial statements of French companies, are used to perform experiments. We compare the sparse vector machine approaches with neural networks based on results from our previous tackling of this problem [13]. Classifiers predictor’s capability are examined on the basis of several performance measures such as ROC curves. The paper is organized as follows. Section 2 describes briefly Support Vector Machines. Section 3 introduces Relevance Vector Machines reviewing the issues of supervised learning problem, sparse Bayesian learning, prior model specification, sparsity and relevance vector classification. Section 4 describes the data sets for the bankruptcy problem setting. In Section 5 results are presented and discussed. Finally Section 6 addresses the conclusions and provides possible directions for future research. II. S UPPORT V ECTOR M ACHINES Support Vector Machines (SVMs) combine two strong ideas: maximum margin classifiers with low capacity and implicit features spaces defined by kernel functions[11]. In other words, they conjoint the following properties: low Vapnik-Chervonenkis (VC) dimension solutions through
maximization of the margin and kernel nonlinearity. These properties bring good generalization to the Vapnik learning machine. Given a training data set consisting of input-output pairs {xn , tn }N n=1 SVMs use the convolution of the scalar product to build, in input space, the nonlinear decision functions of the form: f (x) =
N X
wn K(x, xn ) + w0
(1)
n=1
K represents the kernel (or mapping) function, which must be a positive semi-definite matrix: ˙ j ) = exp(−||xi − xj ||2 /2σ 2 ) K(xi , xj ) = φ(xi )φ(x where φ is the mapping function from input space to the feature space. The weights are given by the non-zero Lagrange multipliers which are called support vectors (SVs). The convex optimization problem is solved by a quadratic programming procedure being the cost of complexity O(N 2 ).
the feature weights. The technique promotes simplicity and smoothness allowing tractability and computability:
C. Prior Specification Model In order to define a prior specification model suitable hyper priors over α as well as over the remaining parameter noise variance σ 2 should be specified [5]. They are given by Gamma distributions: p(α)
=
(2)
The learning function y(x; w) (similarly to SVM formulation) is defined by: wn K(x, xn ) + w0
(3)
p(β) with β = σ
−2
= Gamma(β|c, d)
and where:
p(wi ) = =
where K(x, xn ) is a kernel function which defines a basis function φn (x) for each example in the training data set. The likelihood of the complete data set based on the assumption of independence of tn can be defined as: 1 ||t − Φw||2 } 2σ 2
Gamma(αi |a, b)
Gamma(α|a, b) = Gamma(a)−1 ba αa−1 e−bα (6) R ∞ a−1 −t in which Γ(a) = 0 t e dt is the Gamma function. For a Gamma prior over the hyper parameters it is possible to integrate out α independently for each weight to obtain the prior:
n=1
p(t|w, σ 2 ) = (2πσ 2 )−N/2 exp{−
N Y
i=0
We assume a discriminative approach where the focus is learning the function y(x; w) directly from the training data set. In regression it is general assumed that the targets are samples from a model with additive noise tn = y(xn ; w)+ǫn given by a zero mean Gaussian process with variance σ 2 :
y(x; w) =
(5)
In equation (5) the vector of hyper parameters α of dimension N + 1 specifies the weights importance. This vector is obtained from a training procedure by maximizing the evidence p(t|α). In this optimization procedure many values of α go to infinity leaving only a few non zero wn . By this procedure irrelevant features are pruned out of the problem data.
A. The Supervised Learning Problem
N X
N (wn |0, α−1 n )
n=0
III. R ELEVANCE V ECTOR M ACHINES
p(tn |x) = N (tn |y(xn ; w), σ 2 )
N Y
p(w|α) =
Z
p(wi |α)p(αi )dαi
ba Γ(a) + 21 1 p (b + wi2 /2)−(a+ 2 ) (2π)Γ(a)
(4)
where Φ = [φ(x1 ), φ(x2 ), . . . , φ(xn )]T is the design matrix with dimension N × (N + 1), φ(xn ) = [1, K(xn , x1 ), K(xn , x2 ), . . . , K(xn , xN )]T , t = [t1 . . . tN ]T and w = [w1 . . . wN ]T . Estimating w and σ 2 from likelihood maximization would lead to over fitting. Therefore a Bayesian learning framework described in next section is then followed. B. Sparse Bayesian Learning A Bayesian approach to control complexity in discriminative learning consists of using a prior probability distribution p(w|α) over the weight parameters. Defining an explicit prior avoids over fitting which is prevented in SVMs by the margin term. In the ARD (Automatic Relevance Determination) model [9] independent Gaussian priors are assigned to
Fig. 1.
Prior p(w) obtained from (7) using Maple.
(7)
D. Sparsity The assignment of an individual hyper parameter to each weight, or basis function, is crucial in Bayesian learning framework and is responsible for sparsity properties. In a sparse estimate of parameters irrelevant or redundant components are exactly zero. Sparsity is achieved because the posterior distributions of many of the weights are sharply peaked around zero [5]. From above discussion it is clear that sparsity is desirable for the following reasons: • structural model simplification • performing feature/variable selection • minimizing computational cost • improvement of generalization E. Relevance Vector Classification For an input vector xn sparse Bayesian classification allows to model the probability distribution of its class label tn ∈ {0, 1} using a sigmoid logistic function σ as: p(tn = 1|xn ) =
1 1 + exp(−y(xn ; w))
(8)
The RVM training algorithm involves a computationally cost technique being the complexity O(N 3 ). F. Neural Networks We tested several Multilayer Perceptron (MLP) type neural networks with between 5 to 20 hidden nodes. A hidden layer of 15 neurons with a learning rate of 0.1 and a momentum term of 0.25 was determined to be the best performing. We also used Hidden Layer Learning Vector Quantization (HLVQ), a recent method to classify high dimensional data [12], [1]. HLVQ is implemented in three steps. First, a multilayer preceptron is trained using some learning algorithm. Then supervised Learning Vector Quantization is applied to the outputs of the last hidden layer ~h to obtain a set of code-vectors ~ωi corresponding to each class i. Each example ~x is assigned to the class k having the smallest Euclidian distance to the respective code-vector ω ~ i as in usual learning vector quantization. min ~ xk k~ωk − h(~ ) k
where the classifier is given by (3). Adopting a Bernoulli distribution for P (t|w) the likelihood can be written as: p(t|w) =
N Y
σ(y(xn ; w))tn [1 − σ(y(xn ; w))]
1−tn
(9)
n=1
It is not possible to find the weight set w such that p(t|w) is maximized due to the discontinuity of the likelihood. Therefore closed forms for both the posterior p(w|t, α) and the marginal likelihood p(t|α) can not be obtained. We must use Laplace approximation scheme developed in [8] for approximating a posterior distribution. Specifically, Laplace’s method approximates the evidence by a Gaussian distribution around the maximum a posterior MP value of w, wMP ). It follows that: p(w|t, α) ≈ N (w|wMP , Σ) where Σ is the covariance matrix of the posterior probability over the weights centered at wMP : −1 Σ = ΦT BΦ + A (10) wMP = ΣΦT Btˆ (11) Here, A = diag (α) is hyper parameter diagonal matrix, and B = diag{β1 , . . . , βn } with βn = σ(y(xn ; w)) (1 − σ(y(xn ; w))). The iterative approximation finds, for a fixed α, the locally most probable weights wMP . This process typically involves a gradient descent optimization over the parameters. Using the new wMP the new target tˆ is then obtained through: tˆ = ΦwMP + B−1 (t − σ(y(xn ; w)))
(12)
Using Σ and wMP , αi parameters can be updated by: αi =
γi 2 wMP
γi = 1 − αi Σii
(13)
(14)
In the third step the perceptron is retrained with two differences. First, the error correction is not applied to the output layer but directly to the last hidden layer, and second, the error correction back propagated is proportional to the distance to each class code-vector. After HLVQ is applied to the MLP, only a small fraction of the hidden nodes is relevant for the code vectors. Therefore HLVQ simplifies the network thus reducing the risk of over fitting. IV. DATA D ESCRIPTION We used a sample obtained from Diana, a database containing about 780,000 financial statements of French companies. The initial sample consisted of financial ratios of 2,800 industrial French companies, for the years of 1998, 1999 and 2000, with at least 35 employees. From these companies, 311 were declared bankrupted in 2000 and 272 presented a restructuring plan (“Plan de redressement”) to the court for approval by the creditors. We decided not to distinguish these two categories as both signal companies in financial distress. The sample used for this study has 583 financial distressed firms, most of them small to medium size, with a number of employees from 35 to 400, corresponding to the year of 1999 – thus we are making bankruptcy prediction one year ahead. This dataset includes companies from a wide range of industrial sectors. From the initial 30 financial ratios defined by COFACE 1 and included in the Diana database, we select the 21 most relevant based on their sensitivity. All ratios were normalized to zero mean and unity variance. 1 Coface
is a French credit risk provider
TABLE I BANKRUPCY D ETECTION : A LGORITHMS C OMPARISON
Learning Machine SVM RVM MLP HLVQ
EVs
Training Type I 10.37 10.12 21.20 23.15
328 26 54 -
Type II 4.15 7.06 5.26 8.23
Test Type I 19.41 13.53 35.95 33.84
Type II 13.40 8.89 12.83 10.75
TABLE II BANKRUPCY D ETECTION : P ERFORMANCE R ESULTS
Metrics Measures Recall Precision Accuracy
SVM(AUC=87.23) Training 89.63 95.53 92.77
Test 80.58 85.09 83.67
RVM (AUC=92.67) Training 89.88 92.62 91.42
Test 86.47 90.18 91.42
TABLE III BANKRUPCY D ETECTION : K ERNELS C OMPARISON
Kernel type AUC Kernel type AUC
Gauss 92.97 Cubic 89.61
Thin Plate Spline 92.22 Spline 90.23
A. Performance Metrics Performance results for all the learning machines tested are examined in terms of the recall, precision and accuracy. TP (15) TP + FN TP P recision = (16) TP + FP In above formulas TP, FP, TN and FN are usually used to denote, respectively, the number of True Positives, False Positives, True Negatives and False Negatives. Also two types of errors commonly used in classification problems containing two classes have been compared. In this problem, type I error is the number of cases classified as healthy when they are observed to be bankrupt (N10 ), divided by the number of bankrupt companies (N1 ): Recall
=
eI =
N10 N1
(17)
Type II error is the number of companies classified as bankrupt when in reality they are healthy (N01 ), divided by the total number of healthy companies (N0 ). eII =
N01 N0
eT otal
which is the average of error I and II for a balanced dataset, i.e., N0 = N1 . The global accuracy is defined as 1-eT otal . Most companies on the verge of bankruptcy have more heterogeneous patterns and therefore type I error is in general higher than type II. Since the cost associated with this type of error is in general higher, the overall classification error may not be the best indicator of the learning machine performance. Not less important are the ROC (Receiver Operating Characteristic) curves which are widely used to compare classifiers [15]. They represent graphically the rate of true positive against true negative rate. The ROC curve is generated by considering the rate at which true positives accumulate versus the rate at which false positives accumulate each one corresponding, respectively, to vertical axis and horizontal axis in Figure 2. The former corresponds to the probability of a bankruptcy detection given that a real observed bankrupt company is present; the latter to the probability that an alarm is boosted given that only noise is present. ROC analysis is equipped with powerful features which allow to show the performance of the classifier characteristics. Seen in this light it can provide a view of the predictor’s capability in similar environments.
(18) V. R ESULTS AND D ISCUSSION
The total error is: N10 + N01 = N0 + N1
Linear 83.30 Cauchy 90.83
(19)
We used a balanced dataset containing 583 healthy and 583 distressed companies. The training data set has 816 samples and the test set 350 data samples. We discriminate
amongst the state-of-the-art methods providing good performance results, recently developed RVM sparse kernel classifiers attest better accuracy, sparsity and efficiency. These classifiers incorporate weighted sums of basis functions, with priors promoting sparsity. The train allows the weight estimates to be either significantly large or exactly zero. The overall outcome is that complexity is controlled in the learned classifier by minimizing the number of kernels in the final model solution resulting in better generalization. Since some relevant financial ratios can have large annual variation, as a future work we plan to include the records of these ratios from a longer period of years prior to bankruptcy.
True Positive Rate (TP/(TP+FN))
1
0.8
0.6
0.4
0.2
0 0
0.2 0.4 0.6 0.8 False Positive Rate (FP/(TN+FP))
Fig. 2.
1
ROC Curves.
the accuracy of the classifiers for type I, type II error and the overall misclassification. Type I error is the percentage of undetected bankruptcies while type II error is the percentage of healthy companies predicted as bankrupt. In Tables I and II the results are summarized with data from 1999, one-year prior to the announcement of bankruptcy. It includes the results obtained by several algorithms. It is clear that RVM outperforms all the others both in the overall accuracy, and, more important, on type I error. This term has a much higher cost for banks and insurance companies than type II error. Type I error is always higher than type II error since distressed companies present a more heterogeneous pattern and are therefore harder to classify. From Table I the two sparse machines, SVM and RVM are compared namely for the number of selected kernels or basis functions being the latter more parsimonious than the former. In Table II results are compared for both SVM (AUC=87.23) and RVM (AUC=92.67) which demonstrate the superiority of the latter sparse classifier to tackle this problem particularly if we look at the performance in testing sample data. In Table III different RVM classifiers are compared under the AUC parameter for several kernels. The results show the best performance is obtained with the Gaussian kernel. The results shown in Figure 2 for all the tested classifiers are very good since the areas under the curve (AUC) are ca. 90% meanwhile the diagonal represents a worthless classifier in which the AUC is 50%. Although the best results (see Tables II and I) are obtained with RVMs, SVMs and also MLPs have results largely comparable. The strength of RVMs is that these learning machines yield higher sparsity in the final model solution (see the number of basis functions or expansion vectors (EVs) used in Table I) allowing them to be more suitable for real-time applications. VI. C ONCLUSION SVM, MLP and RVM classifiers were applied to a problem of bankruptcy detection in medium-size private companies where the accuracy of model prediction is crucial in financial institutions for final decision making. While SVM are
ACKNOWLEDGMENT CISUC - Center of Informatics and Systems of University of Coimbra - is gratefully acknowledged for partial supporting this research. R EFERENCES [1] A. Vieira A, P.A. Castillo, and J.J. Merelo. Comparison of HLVQ and GProp in the problem of bankruptcy predictiona. In LNCS 2687 J. Mira, editor, IWANN03 - International Workshop on Artificial Neural Networks, pages 665–662. Springer-Verlag, 2003. [2] E. I. Altman. Corporate Financial Distress and Bankruptcy: A Complete Guide to Predicting and Avoiding Distress and Profiting from Bankruptcy. John Wiley & Sons, New York, 2nd edition, 1993. [3] A. F. Atiya. Bankruptcy prediction for credit risk using neural networks: A survey and new results. IEEE Trans. Neural. Net., 12(4), 2001. [4] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone. Genetic Programming: An Introduction on the Automatic Evolution of Computer Programs and Its Applications. organ Kaufmann Publishers, Inc., 1989. [5] C. M. Bishop and M. E. Tipping. Bayesian regression and classification. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors, Advances in Learning Theory: Methods, Models and Applications, volume 190 of NATO Science Series III:Computer and Systems Sciences, pages 267–285. IOS Press, 1998. [6] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1–49, 2001. [7] J. S. Grice and M. T. Dugan. The limitations of bankruptcy prediction models: Some cautions for the researcher. Rev. of Quant. Finance and Account., 17(2):151, 2001. [8] D. J. C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, California Institute of Technology, 1992. [9] D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4:720–736, 1992. [10] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, June 2001. [11] V. N. Vapnik. The nature of statistical learning theory. SpringerVerlag, New York, 1995. [12] A. Vieira and N. P. Barradas. A training algorithm for classification of high dimensional data. Neurocomputing, 50C:461–472, 2003. [13] A Vieira, B Ribeiro, S Mukkamala, and A Sung. On the performance of learning machines for bankruptcy prediction. In IEEE - International Conference on Computational Cybernetics (ICCC), pages 323– 327, Vienna, August 2004. IEEE. [14] G. Zhang, M. Y. Hu, B. E. Patuwo, and D. C. Indro. Artificial neural networks in bankruptcy prediction: General framework and crossvalidation analysis. Europ. J. Op. Research., 116:16, 1999. [15] K H Zou and W J Halland D Shapiro. Smooth non-parametric ROC curves for continuous diagnostic tests. Statistics in Medicine, 16:2143– 2156, 1997.