on data mining and computational modeling for decision support. ... explores the
less-tractable problems of multiple classification in data mining, wherein.
Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support WILLIAM E. SPANGLER, JERROLDH. MAY, AND LUIS G. VARGAS WILLIAM E. SPANGLER is an Assistant Professor in the College of Business and Economics at West Virginia University. After several years in private industry, he earned his Ph.D. in 1995 from the Katz Graduate School of Business, University of Pittsburgh, specializing in artificial intelligence. His current research interests focus on data mining and computational modeling for decision support. His work has been published in various journals, including Information and Management, Interfaces, Expert Systems with Applications, and IEEE Transactions on Knowledge and Data Engineering.
JERROLD H. MAYis a Professor of Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and is also the Director of the Artificial Intelligence in Management (AIM) Laboratory there. He has more than sixty refereed publications in a variety of outlets, ranging from management journals such as Operations Research and Information Systems Research to medical ones such as Anesthesiology and Journal of the American Medical Informatics Association. Professor May’s current work focuses on modeling, planning, and control problems, the solutions to which combine management science, statistical analysis, and artificial intelligence, particularly for operational tasks in health-related applications.
LUISG. VARGAS is a Professor of Decision Sciences and Artificial Intelligence at the Katz Graduate School of Business, University of Pittsburgh, and Co-Director of the AIM Laboratory. He has published over forty publications in refereed journals such as Management Science, Operations Research, Anesthesiology, and Journal of the American Medical Informatics Association, and three books on applications of the Analytic Hierarchy Process with Thomas L. Saaty. Professor Vargas’s current work focuses on the use of operations research and artificial intelligence methods in health care environments. ABSTRACT: Data-mining techniques are designed for classification problems in which each observation is a member of one and only one category. We formulate ten data representations that could be used to extend those methods to problems in which observationsmay be full members of multiple categories. We propose an audit matrix methodology for evaluating the performance of three popular data-mining techniques-linear discriminant analysis, neural networks, and decision tree i n d u c t i o v JournalofManagementInformationSystems I Summer 1999,Vol. 16, No. I , pp. 31-62. 0 1999 M.E. Sharpe, Inc.
0742-1222 I 1999 $9.50+ 0.00.
38
SPANGLER, MAY, AND VARGAS
using the representations that each technique can accommodate. We then empirically test our approach on an actual surgical data set. Tree induction gives the lowest rate of false positive predictions, and aversion of discriminant analysis yields the lowest rate of false negatives for multiple category problems, but neural networks give the best overall results for the largest multiple classification cases. There is substantial room for improvement in overall performance for all techniques. data mining, decision support systems, decision tree induction, neural networks, statistical classification.
KEY WORDSAND PHRASES:
DATAMINING IS THE SEARCH THROUGH REAL-WORLD DATA for general patterns that are useful in classifying individual observations and in making reasoned predictions about outcomes [ 1 I]. That generally entails the use of statistical methods to link a series of independent variables that collectively describe a case or observation to the dependent variable(s) that classifies the case. The set of classification problems typically includes patterns containing a single, well-defined dependent variable or category; that is, an observationis assigned to one and onlyone category. This research explores the less-tractable problems of multiple classification in data mining, wherein a single observation may beclassified into more than one category. Because multiple classification is asignificant aspect of numerous managerial tasks-including, among others, diagnosis, auditing, and schedulineunderstanding the effectiveness of datamining methods in multiple classification situations has important implications for the use of information systems for knowledge-based decision support. By multipleclassijication, we mean that the categories are well defined and mutually exclusive, but the observations themselves transcend categorical boundaries. This contrasts withfuzzy clustering, in which the categories themselves are not necessarily either well defmed or mutually exclusive. Consider, for the example, universeofcategories that includes men, researchers, andjazz singers. In this situation, a single person can belong to any combination of these categories simultaneously, in effect having multiple membership across categories.ClassifLing someone in such a context requires recognizing the potential for multiple membership, while alsoidentifjing the correct categories themselves. Figures 1 and 2 show two alternative waysof pictorially representing multiple classification problems. In figure 1, the categories are shown as distinct and mutually exclusive, with individual cases or observations transcending categories. Figure 2 shows the categoriesin a typeof Venn diagram, with multiply classified cases appearing in the intersections between categories. In contrast to the situation we consider, when the categories themselves are poorly defined or understood, classification of observations within those categories may be correspondingly uncertain. Fuzzy clustering assigns observations with likelihoods to multiple categories, where the likelihoods sum to one.
Multiple ClassificationProblems-An
Example
DATAMINING IN ADIAGNOSTICSETTING IS THESEARCH for patterns linking state descriptions with associated outcomes, with the objective of predicting an outcome
CHOOSING DATA-MINING METHODS
39
Figure 1. Observations Classifiedinto Multiple Categories: Observation0 1 is in Category C1, 0 2 is in CI and C2, while 0 3 is in CI, C2, and C3.
Figure 2. A Multiple ClassificationDomainPicturedas in the Intersections Classified Observations Appearing
a VennDiagram, with Multiply
given new data showing a similar pattern. It can be characterized as an attempt to extract knowledge from data. In the medical domain, numerical codes are used to describe both whatis wrong with patient a and what was done to treat the patient. The dominant patient state taxonomy is the InternationalClassification of Diseases (ICa 9) coding system, which indicates a patient’s disease or condition within hierarchical, numerical classification scheme.Thecorrespondingproceduraltaxonomy is the Common Procedural Terminology (CPT) system, which indicates the procedure(s) to the patient. performed ona patient, partly in response to the ICD-9 code@) assigned Our empirical results are derived from 59,864 patient records, each of which includes the patient’s diagnoses (ICD-9s), the procedures performed on the patient (CPTs), and patient demographic andcase-specific information.
40
SPANGLER, MAY, AND VARGAS
The data-mining task is to find patterns linking patient ICD-9 and demographic information to CPT outcomes. The task is importantfor surgical scheduling because patient diagnoses and demographicsare known beforethe patient enters the operating room, but the procedures that will be performed there often are not known with certainty. The surgery performed is a function of the informationdiscovered by the physicians during the course of the operation. Data mining in this situation is a multiple-classificationtask because several proceduresmay be performed on a patient during a single operation. The patient recordsto which we had access contain as many as three CPTs each. The ICD-9(s) and patient demographics are the independent variables in the analysis. The surgical procedures performed on the patient (CPTs) are the dependent variables. Each set of independent variables is linked to one, two, or three CPTs. The identification of the procedures to be performedon a patient is importantfor a number of managerial tasks in this domain, including medical auditing, operating room scheduling, and materials management. If a surgical scheduler knew inadvance the most likely sets of procedures and had models for estimating times for sets of procedures, the scheduler could more optimallyplan the operatingroom schedule.The problem for a data-mining method in this multiple classification domain is twofold: First, the method mustbe able to identify the proper number of CPTs associated with a specific pattern. That is, if a set of patient factors is normally associated with two CPTs, the method should construct a pattern linking the factors to two and only two CPTs. Second, the method shouldidentify the specific CPTs.Scoringthe performance of a data-mining tool requires consideration of both ofthese aspects of the problem.
Comparison of Data-Mining Methods ~~~~
THEEXAMPLE ABOVE HINTS AT THE CHARACTERISTICSof the multiple classification problem that make it interesting as well as dificult.Our goal is to find and to propose solutions to the following associated problems of multiple classification: Problem representation:How should a decision maker structure a model both
to recognize and to identify multiple classes? Should the dependent variables be treated individually, as a series of yeslno questions related to the presence or absence of each variable, even when they occuras a group in a single record? Alternatively, should a group of dependent variables be treated as a separate entity, distinct from theindividual variables that comprise the group? Performance measurement:In multiple classification,there is potential for both false positives (i.e., assigning an observation to an incorrect class) and false negatives (i.e., notassigning an observationto a correct class). A decision maker requires a strategy for scoring the performanceof various data-mining methods, based on the relative number of both types of errors and their associated costs. Those research issues couldbe investigated usingmathematical arguments or numerically. While a mathematical comparison wouldbe the most definitive one, we are not aware of a methodology thatwould permit it to be done. Numericalresearch
CHOOSING DATA-MINING METHODS
Table 1
41
Comparison ofData-Mining Methods Decision tree induction
Neural networks
Type of method Math-based Logic-based Learning approach Supervised Supervised Nonlinear Linear Linearity Set of decision Functional Representational scheme nodes and branches: relationship production system betweenattributes and classes
Discriminant analysis Math-based Supervised Linear Functional relationship between attributes and classes
can be done with artificially generated data or with real data. Conclusions based on empirical research are useful if the characteristics of the samples on which they are based are sufficiently similar to those of problem instances others are likely to encounter in practice. We preferred using a realdata set to the use of generated data because it is representative of an important class of managerial problems, and because the noise in the data set provides us with information on the sensitivity of the approaches to “dirt” in a data set. Artificial data would have allowed us to carehlly control the population parameters from which the data are drawn, but would have required that we first define and estimate values for all such critical population parameters. Using our real data set, we empirically compare the performance of tree and rule induction (TRI), artificial neural networks (ANN), and linear discriminant analysis (LDA) in modeling the multiple classification patterns in our data set. We chose the three methods because each is a popular data-mining method sharing a number of common characteristics while also exhibiting some notable differences (see Table 1). Weiss and Indurkhya divide data-mining algorithms into three groups: math-based methods, distance-based methods, and logic-based methods [29]. LDA is the most common math-based method, as well as the most commonclassification technique in general, while TRI is the most common logic-based method. Neural networks are an increasingly popular nonlinear math-based method.Tools employing these methods are commonly available in commercial computer-based applications. All three methods are supervised learning techniques. That is, they induce rules for assigning observations to predefined classes from a set of examples, as opposed to unsupervised techniques, which bothdefine classes and determine classification rules [20, 251. Supervised learning techniques are appropriate for our decision problems because the classes (CPT codes) are defined exogenously and cannot be modified by the decision maker. Each ofthe methods we compare engages in discrete classification through a process of selection and combination of case attributes, and each employs similar validation techniques, described below. Cluster analysis and knowledge discovery methods are examples of unsupervised learning algorithms. The methods also differ, particularly in the way they model the relationships among
42
SPANGLER,MAY,
AND VARGAS
attributes and classes. The classification structures of LDA and ANN are expressed mathematically as a functional relationship between weightedattributes and resulting classes. TRI represents relationships as a set of decision nodes and branches, which, in turn, can be represented as a production system, or set of rules. LDA and TRI are linear approaches; NN is nonlinear. The effectivenessof the representation generally depends on the orientation and mathematical sophistication of the user. Liang argues that, because the choice of a learning method for an application is an important problem, research into the comparison of alternative (or perhaps even complementary) methods is likewise important [ 191. That is especially true for data mining, where the costs and potentialbenefits involved strongly motivate the proper choice of tool and method as well as the proper analysis of the results.
Tree and Rule Induction TRI is attractive because its explicit representation of classification as a series of binary splits makes the induced knowledge structure easy to understand and validate. TRI constructs a tree, but the tree can be translated into equivalent an set of rules. We used Quinlan’s See5 package, the most recent versionof his ID3 algorithm [22]. ID3 induces a decision tree from a table of individual cases, each of which describes identified attributes as well as the class to which the case belongs. At each node, the algorithm builds the tree by assessing the conditional probabilities linking attributes and outcomes, and divides the subset of cases under consideration into two further subsets so as to minimize entropy, a measure of the information content of the data. The user specifies parameters that controlthe stopping behavior of the method. If the training set contains no contradictory c a s e s t h a t is, cases with identical attributes that are members of different c l a s s e s a fully grown tree will produce an error rate of zero on the training set. Weiss and Kulikowski[31] show that as a tree becomes more complex, measured by the number of decision nodes it contains, the danger of overfitting the data increases, and the predictive power of the tree declines commensurately. That is, the true predictive error rate, measuredby the performance of the tree on test cases, becomes much higher thanthe apparent error rate reflected in the performance of the tree against the training cases alone. To minimize the true error rate, See5 first grows the tree completely, andthenprunes it based on a prespecified certainty factor at each node. Performanceevaluationofclassification methodologiesis discussedin the statistical literature (for example, see [15], ch. 11). The two most common approaches are dividing the data set into training and holdout subsets before estimating the model and jackknifing. The former avoids the bias of using thesame information for both creating andjudging the model. However, itrequires large data sets, and there is no simple way to provide a definitive rule for determining either the size or composition of the two subsets. Worse, separating a holdout sample results in the creation of a model that is not the desired one, because the removal ofthe holdout sample reduces the information content of the training set and mayexclude cases that are critical to the estimation of an accurate or robust model. The alternative to separating the data into two groups is
CHOOSING DATA-MINING METHODS
43
jackknifing, a one-at-a-time holdout procedure due to Lachenbruch [ 181. Jackknifing temporarily ignores the first observation, estimates the model usingobservations two through n, classifies the first, held-out observation, and notes whether it was correctly or incorrectly classified. It thenputs the first observationback into the data set, ignores the second, and repeats the process. Repeating the procedure n times, jackknifing creates n different models and tallies up overallmodeling performance based on the behavior of each of those models on a single omitted observation. Omitting only a single observation minimizes the loss of information to which the modeling process is exposed, but it can require a lot of computer time and does not produce a single model as its result. How do you combine n potentially very different models, and how do you interpret its evaluation when each holdout sample was tested on a different model? The TRI software package See5 includes a k-fold cross-validation, an approach less extreme than either a fixed holdout sample or jackknifing. K-fold crossvalidation divides the data set into k equal-sized partitions, ignores them one at a time, estimates a model, and computes its error rate on the ignored partition.
Neural Networks Artificial neural networks simulate human cognition by modeling the inherent parallelism of neural circuits found in the brain using mathematical models of how the circuits function. The models typically are composed of a layer of input nodes (independent variables), one or more layers of intermediate (or hidden) nodes, and a layer of output nodes (dependent variables). Nodesin a layer are each connected by one-way arcs to nodes in subsequent layers, and signals are sent over those arcs. “Behavior” propagates from values set in the input nodes,sent over arcs through the hidden layer(s), and results in the establishmentof values in the output layer. The value of a node is a nonlinear, usually logistic, function of the weighted sum of the values sent to it by nodes that are connected to it. A node forwards a signal to a subsequent node only if it exceeds a threshold value. An ANN modelis specified by defining the number of layers it has, the number of nodes in each layer, the way in which the nodes are connected, and the nonlinear function used to compute node values. Estimation of the specified model involves determining the best set of weights for the arcs and threshold values for the nodes. An ANN is trainebthat is, its parameters are e s t i m a t e h s i n g nonlinear optimization. In the backpropugation algorithm,the first-ordergradient descent method used in the Brainmaker software we used, the network propagates inputs through the network, derives a set of output values, compares the computed output to the provided (corresponding) output, and calculates the difference between the two numbers (Le., the error). Ifa difference exists, the algorithm proceeds backward through the hidden layer(s) to the input layer, adjusting the weights between connections based on their gradients to reduce the sum of squared output errors. The algorithm stops when the total error is acceptably small. Neural networks are frequentlyused in data mining because, in adjusting the number of layers, nodes, and connections,the user can make an ANN model almost any smooth
44
SPANGLER,MAY,
AND VARGAS
mathematical function. While inputs to an ANN might be integer or discrete, the weighted nonlinear transformations of the inputs as part of their being fed forward through the network result in continuous output level values.Continuousoutput levels result in a more tractable error measure for the backpropagation algorithmto optimize and also permit the interpretationof outputs as partial group membership. Partialgroup membership means an ANN is capable of representing inexactmatching, if thatis the way to find a best fit for some set of input data. It also can model classification tasks that are inherently “fuzzy”--that is, tasks that generally are simple for humans but traditionally difficult for computers. Because of their flexibility, ANNs may be difficult to specify. Adding too much structure to an ANN makes it prone to overfitting,but too little structure may prevent it from capturing the patterns in thedata set. Those patterns are represented in the arc (connection) weights and thenode thresholds, a form thatis not transparent to humans. Computationally, if the training set is large, backpropagation andrelated algorithms may require a lot of time. ANNs may begood classifiers and predictors as compared with linear methods, but the mathematical representations of the various nodes, and the relative importance of the independent variables, tend to be somewhat less accessible to the end user than induced decisiontrees, rules, and even classification functions. Neural networks often are treated as a black box, with only the inputs and outputs visible to the decision maker.The classification chosen bythe ANN is easily visible to the user, but the decision process thatled to that classification is not.
Linear Discriminant Analysis Linear discriminant analysis (LDA) is the most commonclassificationmethod in use, and also one of the oldest [131, having been developed Fisher by in the 1930s. Because of its popularity and long history, we provide only a brief overview of the method here. or more classes, within which Like TRIand ANN, LDA partitions a data set into two new observations or cases can then be assigned. Because it uses linear functions of the independent variables to define those partitions, LDA is similar to multiple regression. The primary distinction betweenLDA and multiple regression lies in the form of the dependent variable. Multiple regression uses a continuous dependent variable. The dependent variable in LDA is ordinal or nominal. For a data set of cases, each with m attributes and n categories, LDA constructs classification hnctions of the form, c1 a , + c p 2 . . . + c, a, + c,
where ci is the coefficient for the case attribute ai and c,, is a constant, for each of the n categories. An observation is assigned to the class for which it has the highest classification function value.
CHOOSING DATA-MINING METHODS
Table 2
45
Comparative Data-Mining Method Studies Across Domains
Asset writedowns Bankruptcy
Neural networks
Tree/rule induction
[2,6,8,14, 191
[S, 19,21,261
Remession
~31
Bank failure Inventory accounting Lendingkreditrisk Corporate acquisitions Corporate earnings Management fraud Mortgage choice
Studies of Supervised Learning Approaches PREVIOUS RESEARCH HASINVESTIGATED AND COMPAREDSUPERVISED, inductive learning techniques in a number of domains (see Table 2), with mixedresults. Some comparative studies suggest the superiority of neural networks over other techniques. For example, in bankruptcy prediction, Tam and Kiang [27] found that neural networks performed better than discriminant analysis, logit analysis, k-nearest neighbor, and tree induction (ID3). Fanning and Cogger [8] were somewhat more tentative in comparing neuralnetwork models with both logistic regression and existing bankruptcy models. Although they found no particular technique to be superior across all comparisons, they argued that neural nets were “competitive with, and often superior to” the logit and bankruptcymodel results.In a subsequent study of fraudulent financial statements, Fanning and Cogger [9] reported that aneural network was better able than the traditional statistical methods to identify management fraud. By contrast, previous comparative studies had shown decision treehle induction to be superior to other methods. Messier and Hansen [21], for example, compared decision treeh.de induction with discriminant analysis, as well as with individual and group judgments. On the basis of the attributes selected and the percentage of correct predictions, they concluded that the induction technique outperformed the other approaches in the prediction of bankruptcies. Weiss and Kapouleas [30] compared statistical pattern recognition (linear and quadratic discriminant analysis, nearest neighbor, and Bayesian classification), neural networks, and machine learning methods (rule/decision tree induction methods: ID3K4.5 and Predictive Value Maximization). They concluded that the rule induction methods were superior to the other methods with respect to accuracy of classification, training time, and compatibility with humanreasoning. Other multiple method studies have been less conclusive and suggest that performance is dependent on other factors such as the type of task and the nature of thedata
46
SPANGLER, MAY, AND VARGAS
set. Chung and Tam [6], for example, compared three inductive-learning models across five managerial tasks (in construction project assessment and bankruptcy prediction). They concluded that model performancegenerally was task-dependent, although neural networks tended to produce relatively consistent predictions across task domains. In assessing LIFO/FIFO classification methods, Liang etal. [20] reported that neural networks tended to perform bestoverall in holdout tests, and when the data contained dominant nominal variables. However, when nominal variables were not dominant, probit provided better performance. Sen andGibbs [24] studied corporate takeover models, comparing six neural network models and logistic regression. They found little difference in predictive performance among them, indicating that they all performed poorly. Boritz et al. [2] tested the performance of neural networks with several regression techniques, as well as with well-knownbankruptcy models. No approach was clearly superior, and the ability of an induced model to distinguish between bankrupt and nonbankrupt firms was dependent on the number of bankrupt firms in the training set.
Bases forJudging Performance JUDGING THE PERFORMANCE OF ONE DATA-MINING METHOD over another requires consideration of several modeling objectives.
Predictive Accuracy Most of the comparative studies we cited above measured the predictive accuracy and error rate of each method. Messier and Hansen [21], for example, compared the percentage of correct classifications produced by their induced rule system to the percentage drawn from discriminant analysis, as well as individual and group judgments. As suggested by the review above, it is difficult to make generalclaims about the relative predictive accuracy of the various methods. Performance is highly dependent on the domain and setting, the size and nature of the data set, thepresence of noise and outliers in the data, and the validation technique(s) used. Predictiveaccuracy tends to be an important and prevalent indication of a method’s performance, but others also are important.
Comprehensibility Henery [ 131 uses this term to indicate the need for a classification method to provide clearly understood and justifiable decision support to a human manager or operator. TRI systems, because they explicitly structure the reasoning underlying the classification process, tend to have an inherent advantage over both traditional statistical classification models and ANN. Tessmer et al. [28] argue that, while the traditional statistical methods provide effkient predictive accuracy, “they do not provide an explicit description ofthe classificationprocess.” Weissand Kulikowski [3 11 suggest that any explanation resident in mathematical inferencing techniques is buried in
CHOOSING DATA-MINING METHODS
47
computations that are inaccessible to the “mathematicallyuninclined.” The results of such techniques might be misunderstood and misused. Rules and decision trees, on the other hand, are more compatible with human reasoningand explanations.
Speed of Training and Classification Speed can be an important consideration in some situations [3 I]. Henery [ 131suggests that a number of real-time applications, for example, must sacrifice some accuracy in order to classify and process items in a timely fashion. Again, because of situational dependencies, it is difficult to make generalizations about the computational expense of each method. ANNs estimated using backpropagationmay require an unacceptably large amount of time [3 11.
Modeling and Simulation of Human Decision Behavior Using case descriptions and human judgments as input, data-mining methods also can be used for the automated modeling and acquisition of expert knowledge. Kim et al. [16] determined that the performance of a particular method in constructing an inductive model of a human decision strategy is dependent, in part, on conformance ofthe model with the strategy. Linear models tend to simulatelinear (orcompensatory) decision strategies more accurately, while nonlinear models are more appropriate for nonlinear (or noncompensatory) strategies. Kim et al. found ANN to be superior to decision tree induction (ID3), logistic regression, and discriminant analysis, even in simulations of linear decision processes. They note that the flexibility of neural networks in forming both linear and nonlinear decision models contributes to their superior performance relative to the other methods.
Selection of Attributes The attributes selected for consideration and their relative influence on the outcome are an indication of the performance of a method. The concept of diagnostic validity of induction methods was proposed by Cunim et al. [7], and was used by Messier and Hansen [21] to compare the attributes selected by each of their induction methods.
Data Set and Tools OUR DATASET IS A CENSUS OF 59,864 CASES OF SURGERY PERFORMED between
1989 and 1995 at a large university teaching hospital (see [ l ] for a description of the computerized collection system). Each case is represented as a single record containing twenty-three attributes describingpatient demographicand case-specific information. Of the twenty-three factors available, the following were chosen for analysis: diagnoses (one, two, or three ICD-9 codes), procedures(one, two, or three CPT codes), type of anesthesia (general, monitor, regional), patient demographicinformation (age, sex), the patient’s overall condition (based on the six-value ASA ordinal coding
48
SPANGLER, MAY, AND VARGAS
scheme), emergencyhonemergency status, in-patiendoutpatient status, and the surgeon who performed the procedure (identified by number). The remaining fields are the time durations of surgical events. We chose 8 19 records dealing with ICD-9 code 180.9, “malignant neoplasm of the cervix, unspecified,” because there is a fairly large fanout from it to the associated CPTs across the records, presenting a challenge to any classification method. ICD-9 180.9 is associated with 139 different CPTs, although 107 of the CPTs appear in four or fewer records. Because the presence of outliers impedes the detection of general patterns, we followed the standard data-mining approach of removing them. Of the 8 19 records containing ICD-9 180.9, 160 records contained one of the 107 CPTs. Those records werejudged to be outliers and were removed, leaving 659 records linked to a total of 32 CPTs remaining in the data set. Table 3 provides a detailed description of each of the 32 remaining CPTs. We used commercial software instead of programming the methods ourselves, to eliminate possible bias caused by our own computer skills. We used Statgraphics version 3.1 for LDA, BrainMaker version 3.1 for ANN, and See5 version 1.05for TRI.
Methodology W E IDENTIFIED TEN DISTINCTWAYS OF REPRESENTING THE MULTIPLE CLASSIFICATION
problem. As shown in Table 4,not all methods are capable of estimating parameters for each of the representations. Our strategy was to evaluate each method from a decision support perspective. That is, how does a method fundamentally constrain the types of representation that can (and should) be employed by a person using the method?
Discriminant Analysis Six LDA models were constructed, three basic representations with twovariations on each. The three basic representations-multiple, replicated, and binary4iffer in their treatment of the dependent variables; recall that each case can be a member of one, two, or three classes. For each basicrepresentation, two variations on thetreatment of prior probabilities were included: (1) prior probabilities for each group are assumed to be equal, and (2) prior probabilities for each group are assumed to be proportional to the number observations in each group. The basic representations and variations are described below. Dependent Variables Represented as Multiple Values (LDAMult) The dependent variable is a string with all CPTs, space delimited. For example, the dependent variable in a record containing only CPT 58210 is represented as “582 10.” A record containing CPTs 58210,77760, and 77770 has “58210 77760 77770” as its dependent variable. Because one dependent variable is usedfor all CPT codes present, a single linear discriminant analysis could be performed for each of the two variations:
CHOOSING DATA-MINING METHODS
Table 3
CPT
49
Top 32 CPTs and Their Description Frequency
36489
5
38500 38562
8 17
38564
8
38780
8
441 20
6
45300
16
47600 49000
7 34
49010
7
52000 52204 52332
34 9 8
5631 1
5
57100 57410 57500
13 27 34
5751 3 57520
5 32
581 50
41
58200
47
Description
‘Placement of central venous catheter subclavian, jugular, or other *vein, e.g., for central venous pressure, hyperalimentation, hemodialysis, *or chemotherapy, *percutaneous, over age 2 Biopsy or excision of lymph nodes, superficial separate procedure Limited lymphadenectomy for staging separate procedure, pelvic and para-aortic Limited lymphadenectomy for staging separate procedure, retroperitoneal aortic and/or splenic Retroperitoneal transabdominal lymphadenectomy, extensive, including pelvic, aortic, and renal nodes separate procedure Enterectomy, resectionof small intestine, single resection and anastomosis Proctosigmoidoscopy, rigid, diagnostic, with or without collection of specimens by brushing or washing separate procedure Cholecystectomy Exploratory laparotomy, exploratory celiotomy withor without biopsys separate procedure Exploration, retroperitoneal area withor without biopsys separate procedure Cystourethroscopy separate procedure Cystourethroscopy, with biopsy Cystourethroscopy, with insertion of indwelling ureteral stent, e.g., Gibbons or double4 type Laparoscopy, surgical, with retroperitoneal lymph node sampling biopsy, single or multiple ‘Biopsy of vaginal mucosa, ‘simple separate procedure *Pelvic examination under anesthesia *Biopsy, single or multiple, or local excision of lesion, with or without *fulguration separate procedure Cauterization of cervix, laser ablation Conization of cervix, with or without fulguration, with or without dilation and curettage, with or without repair, cold knife or laser Total abdominal hysterectomy corpus and cervix, with or without removal of tubes, with or without removalof ovaries Total abdominal hysterectomy, including partial vaginectomy, with para-aortic and pelvic lymph node sampling, with or without removal of tubes. with or without removalof ovaries
equal prior probabilities (LDAMuZtE) and prior probabilities proportional to the sample (LDAMuZtP).The advantage of LDAMult is that each observation is represented exactly once. The disadvantage is its inability to represent class intersections. An observation that is a member of both categories a and b (i.e., dependent variable = “a b”)is considered to be completely separate from observations that are members of only either a or b.
50
SPANGLER, MAY, AND VARGAS
Table 3 CPT
Continued Frequency
58210
168
58240
10
58260 58720
6 6
58960
6
58999 77761 77762 77763 77777 77778
12 46 106 12 6 11
Description
Radical abdominal hysterectomy, with bilateral total pelvic lymphadenectomy and para-aortic lymph node sampling biopsy, with or without removalof tubes, with or without removal of ovaries Pelvic exenteration for gynecologic malignancy, with total abdominal hysterectomy or cervicectomy, with or without removal of tubes, with or without removalof ovaries, with removalof bladder and ureteral transplantations, and/or abdominoperineal resection of rectum andcolon and colostomy, or any combination thereof Vaginal hysterectomy Salpingo-oophorectomy, complete or partial, unilateral or bilateral separate procedure Laparotomy, for staging or restagingof ovarian malignancy second look, with or without omentectomy, peritoneal washing, biopsy of abdominal and pelvic peritoneum, diaphragmatic assessment with pelvic andlimited para-aortic lymphadenectomy Unlisted procedure, female genital system nonobstetrical Intracavitary radioelement application, simple Intracavitary radioelement application, intermediate Intracavitary radioelement application, complex Interstitial radioelement application, intermediate Interstitial radioelement application, complex
Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep) A single record containing multiple values for the dependent variable is decomposed into multiple records, one for each value of the dependent variable. A record containing CPTs 582 10,77760,and 77770 is represented three times in the data set, once with “582 10” as the dependent variable, once with “77760,” and the third with “77770”; all three records have the same independent variable values (see figure 3). In this representation, because only the relative sizes of the classification fknction values are meaningful, a single-step estimation process provides a rank order for membership in each of the categories but does not provide any insight regarding the number of categories into which theobservation is to be classified (unlike neural nets or logistic regression, for which 0.5 is a commonly accepted threshold). Therefore, a two-step process is required: First, use an LDA model to estimate the number of categories to which the observationbelongs, and thenuse a separate LDA todetermine what those categories are. LDARepE is that two-stage process with equal prior probabilities for both parts of the process, and LDARepP is the corresponding technique using proportional probabilities. The advantage of LDARep is that it recognizes an observation that is a member of
CHOOSINGDATA-MININGMETHODS
Table 4
Model Representations Across Methods
Neural networks
Tree/rule induction Discriminant analvsis
Multiple
Equal Proportional Equal Proportional Equal Proportional
Replicated Binary
5I
No layer hidden One hidden layer, 57 nodes One hidden layer, 114 nodes
,I
"A ,,
,
Y
vu"
I
"
"C"
I
-
"4"
Figure 3. Dependent Variables Represented as Single Values, Multiple Values Replicated (LDARep)
Figure 4. Dependent Variables Represented as Single Binary Values (LDABin)
more than one class as being in each of those classes individually. The disadvantage is that the representation does not differentiate between a single observation that is simultaneouslyin multiple classes, in whichthe replication of its independent variable values is a representational necessity, from multiple observations with identical independent variables that are, however, members of different classes. That is, a set intersection and a contradiction have the same representation. Dependent Variables Represented as Single Binary Values (LDABin) The dependent variable is represented as a series of binary values, one for each possible value of the dependent variable (see figure 4). An observation is considered a member of a category ifits classification function value for assigning membership is larger than its classification h c t i o n value for not assigning membership. This representation requires a separateLDA model for each class, thirty-two in the case ofour data set. LDABinE is this approach with equal probabilities, and LDABinP has proportional probabilities.
52
SPANGLER, MAY,
AND VARGAS
The advantages of this approach are that subset relationships are preserved and that each observation occurs only once in the data set, so that intersections are represented differently from contradictions. Its disadvantage is that an individual observation might be a member of no classes or too many classes (for our data, more thanthree).
Neural Networks The ANN representation of the dependent variable the is same as in LDABin,because of the ways values are propagated through the network to the output nodes. The variations in the model are functions ofthe structureofthe hidden layer(s). One hidden layer is all thatneeds to be considered, if structure between the input andoutput layers is desired, but the number of nodes in it is a matter of choice [3 11. With a hidden layer and a sigmoid function for combining activations,the ANN performs logistic regression. Deferring to commercial software, we allowed BrainMakerto suggest the size of a hidden layer. With 25 input nodes and output 32 nodes, it recommended57, their sum. Wealso considered a network withtwice that manynodes in the hidden layer, a structure that might tendto overfit the data. The modeling alternatives are: ( I ) a neural network with no hidden layer(NNO), (2) a network with57 nodes in the hidden layer (NN57), and (3) a network with114 nodes in the hidden layer(NNZ14). Activation at an output node, interpreted as degree of membership, ranges from zero to one. An observation is considered a member of any group for which it generates an activation value above 0.5, and is considered not a member of any group for which it generates an activation level below0. All three ANN models have the same advantages and disadvantages as LDABin. The ANN model representation is preferable to that of LDABin because LDABin requires, in our case, thirty-two separate binary models, and ANN simultaneously models all thirty-two binary alternatives in a single model.
Decision TreeRule Induction We used a single TRI representation, analogous to LDAMult. Multiple dependent variables in each record were representedcollectively within a single string (“a b c”). A representation similar to that of LDARep is not possible because See5 trees do not rank-order the classification alternatives. See5 does allow for differential misclassification costs, but they are not capable of representing equal and proportional prior probabilities in a way equivalent to that in LDA. The advantages and disadvantages of LDAMult apply to our See5 representation. The model has the advantage of constraining possible categories to between 1 and 3, inclusive, and the disadvantage of not recognizing the intersections of classes.
Results and Analysis COMPARISON OF THE METHODS REQUIRES THE CONSIDERATIONOf a number of issues
related to the measurement of performance in multiple classification problems. For
CHOOSNG DATA-MINIh’G METHODS
53
example, consider a single test case having two values for the dependent variable (“1 2”). If the method predicts the value “1 2,”there is little doubt that the method has performed without error. If the method predicts the value “3,” it is incorrect for three reasons. First, it failed to recognize the case as having multiple classes. Second, it failed to include either “1” or “2” in its predicted value for the dependent variable. Three, it included an incorrect value (i.e., “3”) in itsprediction. If the method predicts “1 3,” it has identified the correct number of cases (2), while also correctly identifying one of the values but incorrectly identifying the other. If the method predicts “1 2 3,” it has identified both of the correct classes, but it also has predicted the wrong number of classes, and in doing so has included a class that is incorrect. If it predicts “1,” it has predicted the wrong number of classes. However, the class it has predicted is correct, and it refrained from predicting any incorrect classes. The above list of error alternatives argues for what we call an audit matrix within which multiple classification results can be judged. We use the term “audit” to denote a situation assessment, performed by an auditor or decision maker, which attempts to reconcile the observed characteristics of a situation with a priori expectations of that situation (i.e., “actual” versus “predicted” characteristics-see figure 5). That is, an auditor, when initially encountering a situation, will expect to observe certain characteristics while also expecting not to observe others. Subsequently, if expected characteristics are observed and unexpected characteristics arenot observed, the situation matches expectations and the classification judged is correct. However, predictions can vary from in two important ways:(1) the auditor might expeceor predich the observed situation characteristic thatis not present (i.e., a false positive), or (2) the auditor might observe a characteristic that had not been predicted (i.e., a false negative). An audit matrix has four cells, two of which indicate correct behavior of a methodand two of which indicate incorrect behavior. The two correct cells are the number of classes predicted to be present and actually observed and the number predicted to be absent and actually absent.The incorrect cellsare the numberof classes predictedto be presentbut actually absent, and the number of classes predicted to be absent that actually were present. For example, figure 6 shows an audit matrix for an observation that was predicted to be a member of class 3, but that actually is a member of classes 12. and Reduction ofan audit matrix to a single number could be done with a weighted linear function of the number of the cell values, such as classification score = W&-
( WspFP + WfiE;R),
where W,, Wfp.and Wfi are weights assignedto the numberof matches, false positives, and false negatives, respectively, and A4 is the numberof matches, FP is the number of false positives, andFN is the number of false negatives. The weights would be application-specific and would relate to the relative costs of two the types of misclassification. Table 5 illustrates our approach using the results from the ten models for a single observation that was a member of CPT classes 38500,338562, and 58720 (shown as value = 1 in the column labeled “actual”). The columns labeled 1 through 10 under the header “representation” contain the output of each model given the values of the
54
SPANGLER,MAY,
AND VARGAS
Predicted
I
Presenl
Not present
Present
1 Match
Actual present I Not
1 False Positive
False negative
1
1
1
Match
Figure 5. An Audit Matrix for Structuring and Evaluating Multiple Classification Results
Predicted Present
Present
Actual Not present
--
3 I
Notpresent
192
Figure 6. An Audit Matix Example for Actual = “1 2” and Predicted = “3”
independent variables of the observation (1 = CPT is predicted; 0 = CPT is not predicted). For example, LDABinE predicts that sevenCPTs would be associated with this particular observation, but is incorrect on all seven (7 false positives, 3 false negatives, and 0 matches). NN57 predicts 5 CPTs and is correct on two (3 false positives; 1 false negative; 2 matches). Table 6 summarizesthe results for all 650 cases in the data set. In the absence of a value functionfor relative misclassification costs, (1) the number of the performance of each of the representations can be measured by correct predictions relative to the number of misclassifications and (2) the relative number of false positives and false negatives. Table 7 compares the correct predictions (represented as a percentage: “Observed & Predicted” / C [misclassifications] * 100) across each of the representations, for each type of case (i.e., those cases containing I, 2, and 3 CPTs) andfor all cases. As shown, 5 of the I O representations are able accurately to classify over 50 percent of the single CPT cases. Although NN57 is the most accurate model overall (92.41 percent), all of the neural networkmodels as a group outperformthe other representations. However, the performance of all representations deteriorates dramatically when classifying the multiple dependent variable cases. In the 2-CPT cases, LDARepP has the highest accuracy, but is still exceptionally poor (6.25 percent). In the 3-CPT cases, LDABinP is highest with 7.62 percent. Notably, the neural network models were among the poorest performers. The relative performanceof the representationsis also reflectedin the number and nature of the classificationerrors. Those include the misclassification rate (the number misclasof sifications divided by the number of observations in andthe each group), proportion of false
55
CHOOSING DATA-MINING METHODS
Table 5 Classification and Misclassification Results for an Observation Containing Three Dependent Variables* __
Representation CPT
Actual 1
2
3
4
36489 38500 38562 38564 38780 441 20 45300 47600 49000 490 10 52000 52204 52332 5631 1 57100 57410 57500 57513 57520 58150 58200 58210 58240 58260 58720 58960 58999 77761 77762 77763 77777 77778
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0
0
0
0
0 0
0 0
0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Total 3 predicted
7
5
28
3
13
3
1
1
5
2
Total
0
3
3
0
2
0
0
1
2
2
5
6
7
8
9
1
_
0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 _
~
correct
* Each column correspondsto one of the ten representations:I = LDABinE;2 = LDABinP; 3 = LDAMultE;4 = LDAMultP;5 = LDARepE;6 = LDARepP;7 = See.5; 8 = N O ; 9 = NN57; 10 = N N I 14.
negatives and false positives, which are importantin considering the relative costs of misclassifying observations. Figures 7 through 11 graphically show the relative number of false negatives and positives for all cases, and cases with 3 CPTs, 2 CPTs, and 1 CPT, respectively.Figure 1 1 shows results for allmultiple CPT cases, combined.
~
56
SPANGLER, MAY, AND VARGAS
Table 6
Audit Matrix of Results from Each Method and Representation
Matches
Misclassifications Observed & notNotobserved & predicted not predicted predicted predicted All CPTs 4935 248 LDARepE 393 438 LDARepP 2263 280 LDABinE 29 1 486 LDABinP 241 1 297 LDAMultE 1904 522 LDAMultP 275 407 NNO 298 319 NN57 296 318 NN114 86 533 See5
LDARepE 56LDARepP 166 LDABinE 35 LDABinP 178 LDAMultE 138 LDAMultP 10 NNO 25 NN57 16 NN114 See5
38
54 81 45 70 32 77 52 44 42 96
Observed &
Not observed &
102 286 13 189 3 0 278 355 339 22 1
15803 19971 18532 201 22 18377 18662 20128 201 16 201 35 20248
3 CPTs 1 20950 20875 20975
0 1 2 8 0 0 0 0 0 0
7
20653
026
20878 20873 21 21019 21030 20985
2 CPTs 782 LDARepE LDARepP 372 LDABinE 61 LDABinP LDAMultE LDAMultP 47 NNO 64 NN57 60 NN114 12 See5
61 111 58 108 49 94 113 102 97 128
3772 LDARepE 256 LDARepP 1725 LDABinE 195 LDABinP LDAMultE LDAMultP 1490 NNO 209 NN57 220 NN114 67 See5
133 246 177 308 1873 216 351 21 242 173 179 309
20244 20884 81 20657 3 20678 360 276 20927 20920 20929 20940 1 CPT 17082 3 19176 18997 8 20352 20353
2091
1 12 1 6 1 0 1 2 2 8
101 273 10 2041 175 2 0 276 353 337 20499 213
8
2071
2031
0 19247
20352
CHOOSING DATA-MINING METHODS
Table 7
57
Correct Classifications as a Percentage of Total Errors ~~
LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5
~
All CPTs
3 CPTs
2 CPTs
1 CPT
1.97 34.42 0.51 24.32 0.11 0.00 0.00 40.76 57.54 55.21 35.70
0.00 0.73 0.95 7.62 0.00 0.00 0.00 0.00 0.00 0.00
0.12 6.25 0.23 3.55 0.24
2.59 54.38 0.53 34.79 0.10 0.00 60.00 92.41 84.46 56.65
0.63 1.20 1.27 5.71
LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5
4 FalseNeg FalsePos
0
1000 2000 3000 4000 5000 6000
Misclassifications Figure 7. Misclassification Results for All Observations
Overall, the NN57, NN114, and See5 have the fewest total misclassifications, but the neural net models more evenly balance false positives and false negatives than does the See5 classification, which appears to give more weightto false positives than it does to false negatives. The same holds true for observations in multiple categories, although the See5 classifier does marginally better than the neural network models for observations in exactly two categories, and somewhat worse for observations in exactly three categories. The See5 classifier is consistently more prone to generating false negatives than false positives. All three neuralnet models show a similar tendency, which is more pronounced for the three-category observations than for the two-category observations. For the LDA approaches, the classifiers that usedprior probabilities proportional to sample frequenciesconsistently give a smaller number oftotal errors and more closely balance false positives against false negatives than didthose with equal prior probabilities. The classifiers derived with equal prior probabilitiesdo give a smaller number
58
SPANGLER, MAY, AND VARGAS
LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5
FalseNeg FalsePos
0
100
200
300
400
500
Misclassifications Figure 8. Misclassification Results for Observations with Three Dependent Variables
1
LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5
W FalseNeg FalsePos
I
0
200
400
600
800
1000
Misclassifications Figure 9. Misclassification Results for Observations with Two Dependent Variables
of false positives than do those with probabilities proportional to sample frequencies. Finally, as expected, the discriminant analysidased methods and See5 all show an increase in the rate of false positives, false negatives, and total misclassifications(with the exception of LDABinE, for false positives) as the numberof categories increases from one, to two, and to three (see Table 8). The behavior of the neural nets was not expected, however. Although all of the neural net models likewise increase their misclassification rates when the number of categories increases from one to two, the rates actually decrease when the number of categories increases from two to three. In one case, the false positives for N N O , the rate for three categories is actually lower than the rate for a single category. It is surprising to see neural nets do better, in that sense, as the problem becomes more difficult.
CHOOSING DATA-MMNG METHODS
LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN57 NN114 See5
59
rn FalseNeg FalsePos
0
1000
2000
3000
4000
Misclassifications Figure IO. Misclassification Resultsfor Observations with One Dependent Variable
LDARepE LDARepP LDABinE LDABinP LDAMultE LDAMultP NNO NN5 7 NN114 See5
rn FalseNeg FalsePos
0
300
600
900
1200
1500
Misclassifications Figure 11. MisclassificationResults for Observations with Multiple (2 or 3) Dependent
Variables
Conclusions OUR NUMERICALRESULTS INDJCATE THATCURRENTDATA-MININGTECHNIQUES do not adequately model data sets in which observations may simultaneously be members of multiple classifications. Such data sets commonly appear in medical and other applications, and extraction of knowledge from data sets such as ours is important for the design of intelligent decision support systems for auditing, planning, and control tasks. Three primary conclusions can be drawn from our experimentation with alternative representations and solution approaches. First, the performance of the various models is indicated by the number and characteristics of the classification errors they produce. The data-mining methods showed clear differences in therate and type of classification errors in both single and
60
SPANGLER, MAY, AND VARGAS
Table 8 MisclassificationRatesfor 3, 2, and 1 Categories Across Representations Observations in 3 categories Method
False neg.
False pos.
LDARepE 1.227 8.659 LDARepP 1.841 1.273 LDABinE 1.023 3.773 LDABinP 1.591 0.795 LDAMultE 0.727 4.045 LDAMultP 1 ,750 3.136 1.182 0.227 NNO 1.000 0.568 NN57 0.955 0.364 NN114 0.1 2.1 82 See5
Observations in 2 categories
False Total pos. neg.
False
9.886 0.656 3.1 141.194 4.795 0.624 2.386 1.161 4.773 0.527 4.886 1.011 1.409 1.215 1.568 1.097 1.318 1.043 59 2.341 1.376
8.409 0.871 4.000 0.656 3.871 2.968 0.505 0.688 0.645 0.129
Observations in 1 category
False Total pos. neg.
False
9.065 2.065 4.624 1.817 4.398 3.978 1.720 1.785 1.688 1.505
7.226 0.490 3.305 0.374 3.588 2.854 0.418 0.400 0.421 0.128
0.255 0.471 0.339 0.590 0.414 0.672 0.464 0.331 0.343 0.592
Total 7.481 0.962 3.644 0.964 4.002 3.527 0.881 0.732 0.764 0.720
multiple class observations. The neural net and See5 models, for example, all tend to show a higher proportion of false negatives in comparison to the LDA models, but an overall lower misclassification rate. This suggests that the magnitude of Type I and Type 11 costs incurred by an organization, and therelative differences between those costs, can affect a decision maker’s choice of method and representation. Second, the performance of a data-mining method inmultiple classification problems ismeasured by the extent to which it allows a representationthat is appropriate to the problem at hand. Neuralnetworks, despite their poor performance in identifying multiple classes, arguably allow the most natural modeling, a simultaneous binary representation of dependent variables. Formulationsthat can be accommodatedby the other methods, particularly LDA, are somewhat clumsy approximations of what neural networks can model naturally.The choice of representation can have asignificant bearing on the time and cost involved in the analysis of multiple category data, and a decision maker needsto consider the tradeoff of a better mathematical representation and better output performancefrom an ANN against the difficulty in interpreting the model’s reasoning process. Third, the choice of representation also is dependent on the inherent capabilities of the available methods, and on the compatibilityof the theory underlying each method with the data and objectives ofthe decision maker. For example, classical discriminant analysis is based on all the independent variables being continuous; therefore, qualitative independent variables, which are present in this study, might cause problems. Krzanowski[171 notes that discriminant analysis can perform poorlyor satisfactorily in the presence of qualitative variables depending on the correlation between the continuous and the qualitative variables. Also, for discriminant functions, the rules based on minimizing the expected costof misclassification are a function of prior probabilities, misclassification
CHOOSING DATA-MINING METHODS
61
costs, and density hnctions of the data, so the incorporation of prior probabilities within See5 thereby constructinga new representation similarto the LDA approaches, may not be replicable by See5’s use of misclassificationcosts alone [15]. The CART approach [3] includes a statistical alternative tothe procedure used by See5. Although CART was not included in this study,it is a possible item for future research. The poor predictive performance shownby each ofthe representationsin identifying multiple classes suggests continuedexploration of this area. Further studyis required in order to determine the root causes of the limitations, whether they can be circumvented, and whetherother classification methods mightbe more appropriate. For example, the limitations shown here could be due in part to the nature of the domain and the data set. Because only137 of the 659 observations inthe data set contain more than one category(44 with 3 categories, 93 with 2 categories), some of the difficulty in identifying multiple classes mightbe attributed to an insufficient numberof multiple-class observations. As noted, five of the ten representations performed satisfactorily (i.e., over50percent) on the 80 percent ofthe observations (i.e., 522/659) that contained asingle observation. Additional exploration of the prediction task inthis environment mightalso reveal other factors affecting the performance ofthe decisionmodels. For example, modeling the decision strategies of human experts, such as schedulers and surgeons, could provide an indication of the degree oflinearity of this task and the associated impact of the linearity on the performance of the decision models [16]. Further research also is required to determine how and whether additional case information, in the form of additional independent variables, might improve predictive accuracy. Beyond that, the context-independent classification methods we used this in paper couldbe supplemented by domain-specific knowledge of case factors and their relationships, as well as causal and/or heuristic knowledge of the task environment. More knowledge-rich modelsofdiagnosisand situation assessment are possible enhancements to the traditional induction approaches describedin this paper. REFERENCES 1. Bashein, G., andBarna, C. A comprehensive computer system for anesthetic record retrieval. Anesthesia and Analgesia,64,4 ( I 985), 425-43 1. 2. Boritz, J.E.; Kennedy, D.B.; and Albuquerque, A. Predicting corporate failure using a neural network approach. Intelligent Systems in Accounting, Financeand Management, 4, 2
(1995), 95-1 1 1 . 3. Breiman, L.; Friedman,J.H.; Olshen, R.A.; and Stone, C.J. Classification andRegression Trees. Monterey, CA: Wadsworth and Brooks, 1984. 4. Carter, C., and Catlett,J. Assessing credit card applications using machine learning. fEEE Expert, 2,3 (1987), 71-79. 5. Charitou, A., and Charalambous, C. The prediction of earnings using financial statement information: empirical evidence with logitmodels and artificial neural networks. Inrelligent 5 , 4 (1 996), 19%215. Systems in Accounting, Finance and Management, 6. Chung, H.M., and Tam, K.Y. A comparative analysis of inductive learning algorithms. Intelligent Systems in Accounting, Financeand Management,2, 1 (1993), 3-1 8. 7. Currim, I.S.; Meyer, R.J.; and Le, N. A concept-learning system for the inference of production models of consumer choice.UCLA Working Paper, 1986. 8. Fanning, K., and Cogger, K.O. A comparativeanalysis of artificial neural networks using
62
SPANGLER, MAY, AND VARGAS
financial distress prediction. Intelligent Systemsin Accounting, Finance andManagement, 3,4 ( 1994), 24 1-252. 9. Fanning, K.M., and Cogger, K.O. Neural network detection of management fraud using 7( 1998), publishedfinancialdata IntelligentSystems in Accounting, FinanceandManagement, 2141. 10. Fanning, K.; Cogger, K.O.; and Srivastava, R. Detection of management fraud: a neural network approach. Intelligent Systems in Accounting, Finance and Management,4 , 2 (1 99.9, 1 13-126. 1 1 . Fayyad, U.; Piatetsky-Shapiro, G.; and Smyth, P. From Data mining to knowledge discovery in databases. AI Magazine, I 7 , 3 (1 996), 37-54. 12. Grudnitski, G.; Do, A.Q.; and Shilling, J.D. A neural network analysis of mortgage choice. Intelligent Systems in Accounting, Finance andManagement,4 , 2 (1995), 127-135. 13. Henery, R.J. Classification. In D. Michie, D.J. Spiegelhalter, and C.C. Taylor (eds.), Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood, 1994, pp. 6 - 1 6. 14. Jo, H.; Han, I.; and Lee, H. Bankruptcy prediction using case-based reasoning, neural networks, anddiscriminant analysis. Expert Systems withApplications, 13,2 (1997), 97-108. 15. Johnson, R.W., and Wichern, D.W. Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice-Hall, 1998, pp. 629-725. 16. Kim, C.N.; Chung, H.M.; and Paradice, D.B. Inductive modeling of expert decision making in loan evaluation: a decision strategy perspective. Decision Support Systems, 21, 2 (1 997), 83-98. 17. Krzanowski, W.J. The performance of Fisher’s linear discriminant function under non-optimal conditions. Technometrics,19,2 (1 977), 191-200. 18. Lachenbruch, P.A., and Mickey, M.R. Estimation of error rates in discriminant analysis. Technometrics,19,2 (1 977), 191-200. 19. Liang, T.P. A composite approach to inducing knowledge for expert system design. Management Science,3 8 , l (1 992), 1-1 7. 20. Liang, T.P.; Chandler, J.S.; Han, I.; and Roan, J. An empirical investigation of some data effects on the classification accuracy of probit, ID3, and neural networks. Contemporary Accounting Research,9,1 (1 992). 21. Messier, W.F., and Hansen, J.V. Inducing rules for expert system development: an example using default andbankruptcy rules. Management Science,34, 12 (1988), 1403-1415. 22. Quinlan, J.R. C4.5: Programs for MachineLearning. San Mateo,CA:Morgan Kaufmann, 1993. 23. Ragothaman, S., and Naik, B. Using rule induction for expert system development: the case of asset writedowns. Intelligent Systems in Accounting, Finance and Management,3, 3 (1 994), 187-203. 24. Sen, T.K., and Gibbs, A.M. An evaluation of thecorporate takeover model using neural networks. Intelligent Systemsin Accounting, Finance andManagement,3 , 4 (1994), 279-292. 25. Shavlik, J.W., and Dietterich, T.G. General aspects of machine learning. In J.W. Shavlik and T.G. Dietterich (eds.),Readingsin MachineLearning.San Mateo, CA: Morgan Kaufmann, 1990, pp. 1-1 0. 26. Shaw, M.J., and Gentry, J.A. Using an expert system with inductive learning to evaluate business loans. FinancialManagement, 17,3 (1 988), 45-56. 27. Tam, K., and Kiang, M. Managerial applications of neural networks: the case of bank failure prediction. Management Science,38, 7 (1992), 926-947. 28. Tessmer, A.C.; Shaw, M.J.; and Gentry,J.A. Inductive learning for international financial analysis: a layered approach. Journal ofManagement Information Systems, 9 , 4 (1 993), 17-36. 29. Weiss, S.M., and Indurkhya, N. PredictiveDataMining. San Francisco: Morgan Kaufinann, 1998. 30. Weiss, S.M., and Kapouleas, I. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Proceedings of the Eleventh International JointConference on Artificial Intelligence. San Francisco: Morgan Kaufmann, 1989, pp. 781-787. Sari Francisco: 31. Weiss, S.M., and Kulikowski, C.A. ComputerSystemsThatLearn. Morgan Kaufmann, 199 1.