IMPROVING THE PERFORMANCE STABILITY OF INDUCTIVE EXPERT SYSTEMS UNDER INPUT NOISE
Vijay S. Mookerjee, Michael V. Mannino, and Robert Gilson Department of Management Science, Box 353200 University of Washington, Seattle, WA 98195-3200
[email protected],
[email protected],
[email protected]
This paper appears in Information Systems Research 6, 4 (December 1995), 328-356.
IMPROVING THE PERFORMANCE STABILITY OF INDUCTIVE EXPERT SYSTEMS UNDER INPUT NOISE
Abstract Inductive expert systems typically operate with imperfect or noisy input attributes. We study design differences in inductive expert systems arising from implicit versus explicit handling of input noise. Most previous approaches use an implicit approach wherein inductive expert systems are constructed using input data of quality comparable to problems the system will be called upon to solve. We develop an explicit algorithm (ID3ecp) that uses a clean (without input errors) training set and an explicit measure of the input noise level and compare it to a traditional implicit algorithm, ID3p (the ID3 algorithm with the pessimistic pruning procedure). The novel feature of the explicit algorithm is that it injects noise in a controlled rather than random manner in order to reduce the performance variance due to noise. We show analytically that the implicit algorithm has the same expected partitioning behavior as the explicit algorithm. In contrast, however, the partitioning behavior of the explicit algorithm is shown to be more stable (i.e., lower variance) than the implicit algorithm. To extend the analysis to the predictive performance of the algorithms, a set of simulation experiments is described in which the average performance and coefficient of variation of performance of both algorithms are studied on real and artificial data sets. The experimental results confirm the analytical results and demonstrate substantial differences in stability of performance between the algorithms especially as the noise level increases.
1. Introduction Inductive expert systems have become an important decision support tool as evidenced by considerable attention in the academic literature and business press, and a number of commercial products to develop such systems. Inductive expert systems are typically developed to support classification tasks, i.e., systems that attempt to classify an object as one of n categories [Quinlan 1986a]. Examples of classification in business decision making include fault diagnosis in semiconductor manufacturing [Irani et al. 1993], bank failure prediction [Tam and Kiang 1990], and industry and occupation code prediction [Creecy et al. 1992]. The primary goal of an inductive expert system is to perform at the same level of human experts. Such systems can provide many benefits to an organization [Holsapple and Whinston 1987] such as reducing decision making time, improving the consistency of decisions, and reducing dependence on scarce human experts. An inductive expert system is constructed using a learning algorithm and data set. A learning algorithm develops classification rules that can be used to determine the class of an object from its description, i.e., from the object's attributes. The classification rules developed by these algorithms can be depicted as a decision tree in which the non-leaf nodes of the tree prescribe inputs that must be observed and the arcs represent states that the input variables can take. Leaf nodes in the tree indicate how an object is to be classified. Induction algorithms build such a tree from a set of pre-classified cases referred to as the training set. Another part of the data set known as the test set is used to study the performance of a decision tree on novel cases. Inductive expert systems are typically developed to maximize solution accuracy; i.e., maximize the number of cases in which the output (decision, recommendation) provided by the system is similar to that provided by human experts. Economic considerations, for example, costs of observing inputs,
Page 2
benefits from system outputs, and other factors that may contribute to system value are rarely factored into system design [Mookerjee and Dos Santos 1993]. The subject of this paper is the design and performance evaluation of inductive expert systems using noisy input attributes. The presence of input noise can have a significant impact on the performance of an inductive expert system. We only consider noise that affects the input 1
values used by the system to make classification decisions, not other forms of noise. This definition includes errors from such causes as incorrectly measuring an input, wrongly reporting the state of an input, relying on stale values, and using imprecise measurement devices. Input errors in a training set can cause a learning algorithm to form a rule with an incorrect state for an input, while input errors in cases to be classified can cause the wrong rule to be used. The specific issue studied here is how to account for the level of input noise: (i) implicitly through a training set with a representative level of noise or (ii) explicitly through a noise parameter and a clean training set. Figure 1 graphically depicts the explicit and implicit approaches. In common practice, the implicit approach is used because it is cost effective and has been carefully studied. However, high variance of performance is a key disadvantage of the implicit approach that has not been widely discussed or studied. Our most important finding here is that an induction algorithm using an explicit noise parameter can have more stable performance than a comparable implicit algorithm. Variation in performance is an outcome variable of interest in a wide variety of systems. For example, the performance of a manufacturing process is judged in terms of its mean behavior
1
It must be noted however, that researchers in the past have discussed other dimensions of noise such as conflicts in the classification, classification errors, missing input data, etc. In this paper, we will use the term noise in a restricted sense, only with reference to input measurement accuracy. Thus, for our purposes, noise rises if the input measurement accuracy falls. Conversely, noise reduces if the input measurement accuracy rises.
Page 3
as well as by the variation in its behavior. Reducing variance is routinely used as a performance objective in survey research where errors may be introduced by interviewers, respondents, questionnaires, processing of forms, and so on. In the context of inductive expert systems however, past research has largely ignored this important design objective. Variation in performance could often be an extremely important aspect of an inductive expert system. For example, if a loan granting inductive expert system was to make very good decisions in one set of cases, but extremely poor ones in another set, it may cause the bank to fail in the period of poor performance. Thus, managers are likely to prefer a more stable system to a highly variable one even though the mean performance of the two systems is the same.
Clean Training Set
Noisy Training Set Noise Parameter
Explicit Algorithm
Decision Tree
Implicit Algorithm
Decision Tree
Figure 1: Explicit and Implicit Noise Handling
In addition to stable performance, explicit noise handling is interesting to study because there are a number of situations in which an explicit approach is more practical than an implicit
Page 4
approach. One such situation is when the training set is obtained from experts rather than from historical data. Here, an implicit approach would require that the input states in the training set be deliberately corrupted. In addition, if there are multiple ways to measure an input or if the level of noise in historical data is not representative of current practices, an explicit approach may be preferable. Because explicit algorithms have not been carefully studied, the major topics presented are the design and performance evaluation of explicit algorithms. We design an explicit algorithm that injects noise according to a specified noise parameter as it partitions a data set. The novel aspect of the algorithm is that it injects noise in a controlled rather than random manner in order to reduce the variance due to noise. We show analytically that the expected partitioning behavior of the implicit and explicit algorithms is the same. However, we demonstrate analytically that the explicit algorithm has more stable partitions than the implicit algorithm. To extend the analytical results to the classification accuracy of the algorithms, we conduct a set of simulation experiments on several real and artificial data sets with a range of values for the number of classes and skewness in the class distributions. Our experimental results reveal that the expected accuracy and variance of accuracy of the algorithms are consistent with our analytical results. This rest of this paper is organized as follows. In Section 2, we review related work on noise handling approaches used in inductive expert systems. In Section 3, we provide background on decision tree induction and analytically evaluate the behavior of the implicit and explicit algorithms: ID3p and ID3ecp. In Section 4, we describe the hypotheses investigated, experimental designs used, and results obtained from a set of simulation experiments. A summary and conclusions are provided in Section 5.
Page 5
2. Related Work In this section, we summarize a theoretical study of input noise, various pruning procedures, and models to cope with measurement errors in surveys. Laird [Laird 1988] studied the Bernoulli Noise Process (BNP) as an extension of the theory of Probably, Approximately Correct (PAC) learning. The goal of PAC theory is to derive an upper bound on the number of examples needed to approximately learn a concept within a given error bound with a specified confidence level. A BNP is characterized by independent parameters for the classification error rate and the input error rate. Laird's basic result is that input error rate alone is not sufficient to determine the maximum number of examples needed. There must be an additional parameter that indicates the sensitivity of the true concept definition to input errors. This theoretical result about the relationship between the importance of an attribute and the impact of noise has been empirically demonstrated in other studies. In more applied studies, researchers have developed pruning procedures (post construction techniques) to refine the rule set generated by a learning algorithm. Learning algorithms generally find a perfect set of rules for a training set, but the rules are usually too specialized leading to poor performance on unseen cases. Pruning techniques reduce specialization by eliminating rules in whole or part. Similarly, pruning techniques have also been found useful to handle noise because noise in a training set can lead to extra rules and highly specialized rules. For example, Quinlan [Quinlan 1987] demonstrated that four pruning methods significantly reduced the complexity of induced decision trees without adversely affecting accuracy. He later found on a study of the chi-square pruning technique [Quinlan 1986b] that the performance of a learning algorithm is better using a noisy training set than a perfect test set. He also found that increases in predictive performance from noise reduction depends on the importance of an input.
Page 6
Two important themes in the development of pruning procedures are the use of an extra test set and parameters to control the amount of pruning. A number of techniques use an extra test set to choose among multiple collections of rules (critical value pruning [Mingers 1989] and error complexity pruning [Breiman et al. 1984]) or to reduce the complexity of the rules (minimum error pruning [Quinlan 1987]). These techniques require more data than techniques that prune using the training set alone (pessimistic pruning [Quinlan 1987], minimum error pruning [Niblett and Bratko 1986], and Laplace pruning [Christie 1993]). Mingers [Mingers 1989] found that pruning techniques using an extra test set achieved higher accuracy than techniques not using an extra test set. However, Quinlan [Quinlan 1987], in an earlier study, did not find higher accuracy as a result of extra test cases. Several techniques have been developed that use a single parameter to control the amount of pruning (error complexity pruning [Breiman et al. 1984] and m-probability-estimate pruning [Cestnik and Bratko 1991]). The parameters are rather coarse applying to the entire data set, not individual inputs. In addition, there are no guidelines for setting parameter values except that high values should be used when there is a large amount of noise. Moulet [Moulet 1991] developed input noise parameters for the ABACUS discovery system and demonstrated their effectiveness in learning simple laws of physics. However, he did not apply his technique to classification problems. Unlike the work on pruning techniques, the area of measurement errors in surveys [Groves 1991] includes a rich stream of research on techniques to measure and reduce the level of input noise and models to compensate for the effect of input noise on prediction tasks. This area of research has developed a detailed classification of input noise beginning with systematic errors that introduce bias and random errors that cause variance. Beyond this division, the source of errors (interviewer, respondent, process, questionnaire) and the cause of errors (e.g., memory loss and non-response) are often identified. When the level of input noise is not known, it can be estimated using a re interview technique [Hill 1991, Rao and Thomas 1991] where error-prone measurements are made on a sample and then more expensive and relatively error-free
Page 7
measurements are made on a sub sample. The re interview technique is similar to the idea of using a clean training set with an explicit estimate of the noise level. Many models have been developed to compensate for the effect of input noise on regression [Fuller 1991], analysis of variance [Biemer and Stokes 1991], and estimation of survey statistics of categorical data [Biemer and Stokes 1991]. However, there is no reported research on classification tasks.
3. Algorithm Design and Analysis In this section, we present the explicit noise algorithm used in our simulation experiments and analyze the mean and variance of its partitioning behavior. Before presenting the explicit noise algorithm, we review the induction process underlying the baseline algorithm, ID3p, (the ID3 algorithm [Quinlan 1986a] with the pessimistic pruning procedure). 3.1. Decision Tree Induction Induction algorithms develop a decision tree by recursively creating non-leaf nodes and leaf nodes in the tree. Non-leaf nodes are labeled by input names. The input chosen to label a nonleaf node is determined using an input selection criterion and a set of cases. Traditionally, inputs are selected by their information content, measured by the reduction in information entropy achieved as a result of observing the input. After an input has been chosen to label a non-leaf node, each of the q outgoing arcs are labeled by a possible state of the selected input where q is the number of possible states. The set of cases used to compute the label of a node (for the root node this is the entire training set) is then partitioned into q subsets such that the state of the input used to label the node is the same within each subset. The tree can grow along each outgoing arc using the subset of cases corresponding to the state of the input used to label the arc. Creation of non-leaf nodes continues along each path of the tree until a stopping condition is reached, at which stage a leaf node is created. Leaf nodes are labeled using a classification function.
Page 8
An important factor that must be considered in the design of an induction algorithm is the input selection criterion. The input selection criterion determines how, from a set of candidate inputs, an input is chosen to label a non-leaf node. We describe the input selection criterion for the ID3 algorithm in the remainder of this subsection. Let, D N
be a randomly drawn training set of size N,
X 1 , X 2 ,..., X p
be p observable input variables that may be used to classify an object,
xk1, xk2, .., xkq
be q possible states for input X k ,
ψ
be the random variable for the class of an object,
c1, c2, .., cm ~ Z
be m possible states for the class variable ψ ,
~ L( Z , π )
be a partition of the training set (Z ⊆ D N ), ~ ~ be an input state conjunction, e.g., X 1 = x14 ∧ X 2 = x 23 , ~ be a partition of Z such that π is true,
k ( Z )
be a classification function that determines how a leaf node is labeled,
g ( X k | Z )
be the expected gain for input X k given Z .
π
The input selection criterion in the ID3 algorithm chooses the input with the maximum information content (gain) measured by the reduction in information entropy [Shannon and Weaver 1949] as a result of observing the input. Formally, the gain of input X k is defined as: g ( X k | Z ) = ∆EN ( X k | Z ) = EN ( Z ) − EN ( X k | Z ) where ~ EN ( Z ) = −
m
∑ P[ψ~ = cr | Z ]log2 P[ψ~ = cr | Z ] , is the initial expected entropy, ~
~
(1)
(2)
r =1
~ ~ EN ( X k | Z ) =
q
∑ P[ X k = xkj | Z ]EN ( L( Z , X k = xkj )) , j =1
~
~
~ ~
~ is the expected entropy after observing X k , and
(3)
Page 9 EN ( L ( Z , X k = xkj )) = m
−
∑ P[ψ = cr | L( Z , X k = xkj )]log2 P[ ψ = cr | L( Z , X k = xkj )]
(4)
r =1
is the expected entropy in the partition L ( Z , X k = x kj ) ID3 estimates the probabilities in equations (2) through (4) by sample proportions obtained from the training data. ~ To demonstrate the above equations, consider a sample of 100 cases ( Z ) with 2 classes (c1 = ~ 66, c2 = 34). Let input X k with states xk1, xk2, xk3 be used to partition the 100 cases. From equation (2) the initial entropy is: ~ EN ( Z ) = −(0.66 *log 2 0.66 + 0.34 *log 2 0.34) = 0.9248 ~ For X k = x k 1 , let c1 = 48 and c2 = 12 be the subpartition of cases obtained. The entropy within this subpartition is: ~ ~ EN ( L( Z , X k = xk 1 )) = −(0.8 *log 2 0.8 + 0.2 *log 2 0.2) = 0.7219 ~ ~ ~ For X k = x k 2 , let c1 = 10 and c2 = 20 be the subpartition with EN ( L( Z , X k = xk 2 )) = 0.9149 ~ computed the same as above. For X k = x k 3 , let c1 = 8 and c2 = 2 be the subpartition with ~ ~ ~ EN ( L( Z , X k = xk 3 )) = 0.7219 . From equation (3), the entropy after observing X k is: ~ ~ EN ( X k | Z ) = (0.6 * 0.7219 + 0.3 * 9149 + 01 . * 0.7219) = 0.7798 ~ Thus from equation (1) the gain from observing input X k is: ~ ~ ~ ~ g ( X k | Z ) = ∆EN ( X k | Z ) = 0.9248 − 0.7798 = 01449 . To cope with noise and overfitting, we augment ID3 with the pessimistic pruning procedure [Quinlan 1987]. We call the augmented ID3 algorithm, ID3p. The pessimistic pruning procedure uses a statistical measure to determine whether replacing a sub-tree by its best leaf (i.e., the classification that maximizes accuracy for a given set of cases) is likely to increase accuracy. If so, the branch is replaced by its best leaf; otherwise the branch is retained. Once all branches have been examined, the process terminates. A more detailed description can be found in
Page 10
Appendix B. We use the pessimistic pruning procedure because it is easy to implement, effective on noisy data [Mingers 1989], and does not require an extra test set. 3.2. Model of Noise Inductive expert systems typically operate under conditions where inputs are subject to noise. Noise occurs when the true input state is perturbed by a measurement process. We assume that measurement errors is independent of the time of measurement and the true state of an input. If we ignore differences among wrong states (e.g., measuring a high value as medium or low), noise can be characterized as a binomial process with mean C, the probability of correct measurement. The value of C can be estimated from empirical data as follows. First, for each noisy input, collect a representative sample of input values. Second, correct the noisy values through various techniques such as repeated measurement [Hill 1991, Rao and Thomas 1991] and/or improved measurement, that is, using more expensive and relatively error free devices. Third, estimate the population error rate from the sample as the number of values that are incorrect divided by the sample size. One minus the sample error rate is an unbiased estimate of C. We define the parameter W as the probability of wrong or incorrect measurement, a measure of the likelihood of disruptions in the measurement process that lead to a particular 2
incorrect state being recorded . We assume that wrong states are equally likely and that all inputs have the same level of noise. Thus, the relationship C + (q-1)W = 1 holds for any input since there are q-1 incorrect states. For example, if there is a 5 state input and the probability of correct measurement is C = 0.8, then any wrong state has a probability of W = (1 − 0.8) (5 − 1) = 0.05.
2
The noise parameters (C and W) are output rather than process measures. In some studies, the noise parameter is the probability that the measurement process has been perturbed. A perturbation in the measurement process may or may not lead to an incorrect measurement. We use an output measure here because it is more convenient, and a process measure can be mapped to an output measure.
Page 11
Although the assumptions about constant C across inputs and constant W within an input’s states can be easily relaxed, we use them to simplify our analysis. Without these assumptions, many more noise parameters will have to be estimated. More precisely, C and W are defined as: C = P[ X ko = xkj | X kt = xkj ] ∀k , i ~ ~ W = P[ X ko = x kj | X kt = x ki ] ∀k , ∀j ≠ i where X ko , X kt are the random variables for the observed (noisy) and true states of input ~ Xk . To analytically describe the impact of noise on the input gain, we state Proposition 1: Proposition 1 ~ ~ ∂g ( X ko | Z C ) > 0, C ∈ (1 / q ,1] ∂C = 0, C = 1 / q < 0, C ∈[0,1 / q )
(6.1) (6.2) (6.3)
~ where Z C is a noisy partition with noise level C. With no noise, the gain is at its highest (C = 1). As noise is increased (value of C is decreased from 1 to 1/q), the gain decreases to the point where the value of C is 1/q (6.1). When C is equal to 1/q, there is no benefit from observing the input because it could occur in any of its states with equal probability (6.2). As the value of C decreases below 1/q, the noise becomes so high that the observed states become predictably incorrect. Hence, observed states of the input again begin to provide some information (6.3). Thus the nature of the relationship between the noise level and the gain is convex, validating that we are dealing with a reasonable model of noise. In Appendix A, we prove Proposition 1 for a special case. In addition, our simulations strongly suggest that the gain monotonically decreases (i.e., less uncertainty reduction) as the correct measurement probability increases from 0 to 1/q and then monotonically increases until the correct measurement probability is 1.
Page 12
3.3. Impact of Noise on Expected Behavior In this subsection, we demonstrate how the information content of a noisy input is computed in an implicit version of ID3 (ID3p) where a noisy training set is used and in an explicit version of ID3 (ID3ep) where a clean training set and noise parameters are used. 3.3.1.Implicit Handling of Noise In an implicit algorithm such as ID3p, the level of noise acts implicitly on the gain of an input. Specifically, Equation (3) uses two probability estimates: (i) the probability of a class given a new input and the set of previously observed inputs and (ii) the probability of a new input given the set of previously observed inputs. With noise, we need to estimate the same two probabilities except that it is the noisy observed states rather than true states of the input that are estimated. Equation (7) shows the expected entropy of X ko after observing a set of noisy inputs ). ID3 estimates the probabilities in equation (7) by sample proportions obtained from the (Π o p
noisy training data. q m ) = − ∑ P[ X ] ∑ ζ log ζ EN ( X ko | Π = x | Π o ko kj o r 2 r j =1 r =1
(7)
] and where ζ r = P[ ψ = cr | X ko = xkj ∧ Π o Π o = Π1o ∧... ∧ Π do is the set of previous observations on the path.
3.3.2.Explicit Handling of Noise Explicit algorithms estimate the probabilities in equation (7) using a clean training set and noise parameters rather than with a noisy training set. Both implicit and explicit algorithms are subject to noise processes having the same mean noise parameters (C and W in our case). The difference is that the noise process has already occurred for implicit algorithms as opposed to the noise process occurring as part of explicit algorithms. Thus, implicit and explicit algorithms will compute the same expected decision tree because the expected gain calculations are the same. Even though explicit algorithms have better information (both the noise level and clean data),
Page 13
their expected predictive ability is the same as that of implicit algorithms. However, in subsection 3.4, we demonstrate that the additional information can be used to reduce the variance in the predictive performance of explicit algorithms. For explicit algorithms, one way to estimate the information content of a noisy input is to use a recursive Bayesian updating scheme [Pearl 1988]. In the Bayesian approach, we rewrite the 3
expression for ζr from equation (7) as equation (8) using the chain propagation rule ([Pearl 1988] p.154). Because we start with a clean training set, the information content of an input in its true state can be estimated. Note that we use the property of conditional independence wherein the true state separates the class and the observed state ([Pearl 1988] p.154]). q ~ ~ ~ ~ ~ = c | X~ = x ∧ Π ζ r = ∑ P[ X kt = x ki | X ko = x kj ∧ Π o ] P[ ψ r kt ki o]. = i 1
(8)
In equation (8), the probability of a class given the current true state and the history of observed states can be written as: ~ ~ = c | X~ = x ∧ Π P[ ψ r kt ki o] = ~ ~ ~ ~ ~ = c | X~ = x ∧ Π P[ ψ ∑ r kt ki t ] P[Π t | X kt = x ki ∧ Π o ] ~ Πt ∈π t
(9)
is the random vector for the combination of previously observed variables where Π t (without noise) and π t is the set of true state vectors.
Equation (9) demonstrates the difficulty of a Bayesian approach. The first term of the right hand side of (9) must be computed O((p-d)(qd)) times where d is the node or input level (root node is level 0) in a decision tree, p is the number of inputs, and q is the average number of input states. The cardinality of π d is qd as π d contains all possible true states of all inputs on the path. The joint probability calculations must be repeated for all remaining inputs (p-d). Since the
3
The chain propagation rule provides the relation: P ( A| B ) =
∑ j P( A| B, C j ) P(C j | B)
Page 14
depth of a tree is partially dependent on the number of inputs, a recursive Bayesian updating approach is impractical for computational reasons. As an alternative to a Bayesian updating approach, noise effects can be propagated during tree construction by introducing noise into partitions by randomly scrambling cases using the parameters C and W. This amounts to computing the class probabilities using a partition instead of calculating the probabilities in (9). A formal description of the random scrambling procedure follows. Procedure Random Scrambling Input Z: a partition X: branching input with q states C: noise level Output S: a ‘scrambled’ set of partitions of input X, S = {S k S k is partition of S} k = 1,2,..., q Procedure 1. Initialize each S k ∈ S to φ . 2. Let T be the set of true partitions resulting from splitting Z into q partitions on input X. 3. For each partition Tk ∈ T do 3.1. For each case τ ∈Tk do 3.1.1. Let r be a random number in [0,1]. 3.1.2. If r ≤ C then set S k : = S k ∪ τ else randomly choose S j , j ≠ k and set S j := S j ∪ τ . In contrast to the Bayesian updating approach, the complexity of injecting noise through random scrambling is O((p-d)q) because q scrambling operations for each input are necessary and only the current input needs to be scrambled. The effect of noise from all but the current input on the path is already included in the starting noisy partition. Figure 2 illustrates the random scrambling procedure. The root box depicts a partition of size 100 with 66 cases expected of class c1 and 34 cases expected of c2. Input Xk is selected and partitioned by its three states where the expected size and class frequencies of the partitions are
Page 15 4
shown in the boxes without parentheses . Noise is then introduced by scrambling all the partitions of Xk according to the correct noise probability C=0.8. For each case in a partition, a random number is drawn to determine if the case should be moved to a partition associated with another state of the input. If the number is less than or equal to C, the case remains in its original partition. Otherwise the case is randomly moved to a partition associated with another state of the input. After scrambling, the size and class frequencies of each partition are typically more uniform because of the impact of noise. In Figure 2, the scrambled partitions are beneath the clean partitions. The lines indicate that cases can be moved from a clean partition to any noisy partition.
100 c1: 66, c2: 34
Xk xk3
xk1 xk2 60 c1 : 48, (24.96) c2 : 12, (10.56)
30 c1 : 10, (8.92) c2 : 20, (16.06)
10 c1 : 8, (7.36) c2 : 2, (1.96)
Initial Partitions
52 c1: 40.2, (25.43) {16.14} c2: 11.8 (10.85) {6.94}
31 c1: 13.6, (12.66) {6.03} c2: 17.4 (14.88) {10.4}
17 c1: 12.2, (11.54) {4.73} c2: 4.8 (5.05) {1.52}
Scrambled Partitions (C = 0.8)
Figure 2: Example of the Random Scrambling Process
4
Numbers inside round and curly brackets are the variances that are explained in subsection 3.4.
Page 16
The explicit algorithm ID3ep uses the random scrambling procedure to introduce noise into partitions. First, the usual partitioning process of the ID3 algorithm is performed. Second, the partitions created in the first step are scrambled using the random scrambling procedure. After all partitions of an input are scrambled, the normal formulas for the input selection, stopping rule, and classification function are used. Note that we retain the pessimistic pruning procedure in ID3ep. Thus, the only difference between ID3p and ID3ep is the way that noise is treated. In ID3p, the treatment of noise is indirect through a sample of the data collection process. In ID3ep, the treatment of noise is direct through random injection of noise in a clean training set using the parameters C and W. Despite the differences in handling noise, the expected behavior of both algorithms is governed by equation (7). This observation is similar to the results of previous research comparing input selection criteria [Mantaras 1991]. 3.4. Impact of Noise on Variance As discussed in subsection 3.3, explicit algorithms have no advantage over implicit algorithms in terms of expected predictive ability. The advantage of explicit noise handling lies in the potential to make performance more stable. In this section, we study the impact of noise on the variance in class frequencies between a binomial noise process and a constant noise process. The binomial noise process represents ID3p and ID3ep in which noise is randomly introduced either in the training set or through the random scrambling process. The constant noise process represents a controlled variation of ID3ep (ID3ecp) in which class frequencies in noisy partitions are set to their estimated expected values. As the analysis elucidates, the variance in size due to noise is much less in the constant noise process than in the binomial noise process. We begin with the sampling variance common to both noise processes limiting our focus to a single input, state, and class. Let, P[ X kt = x kj ] = γ j and P[ ψ = cr | X kt = x kj ] = ρ rj be population parameters and s( L ( Zn , X kt = x kj )) be the size of the partition Zn where X kt = x kj .
Page 17 The expected value and variance of the size of a partition of size n where X kt = x kj are defined in equations (10) and (11), respectively. For each case, the sampling process selects X kt = x kj with probability γj. Thus, the size of a partition is binomially distributed with mean and variance as shown in equations (10) and (11). E ( s( L( Zn , X kt = x kj ))) = nγ j ~ ~ V ( s( L( Z n , X kt = x kj ))) = nγ j (1- γ j )
(10) (11)
To analyze the mean and variance of the sub partition size where ψ = cr , we need to apply Wald's theorem [Ross 1970 -- page 37]. Wald's theorem defines the expected value and variance ~ ~ of the sum of K random variables where K is also a random variable. The expected value of the ~ ~ ~ ~ sum of K independent, identically distributed random variables ( X i ) is given by E ( K ) E ( X i ) . ~ ~ ~ ~ The variance of the sum is given by E ( K )V ( X i ) + V ( K )( E ( X i )) 2 . Here, the expected size of the beginning partition is nγ j as defined in (10). Within this partition, the probability of ψ = cr is ρ rj . Applying Wald's theorem, the mean and variance of the sub partition size are defined by (12) and (13). In (13), we assume that the class random variables of the individual cases within the partition are independent and identically distributed. E ( s( L( Zn , ψ = cr ∧ X kt = xkj ))) = nγ j ρ rj = µ rj ~ ~ ~ = cr ∧ X kt = x kj ))) = nγ j ρ rj (1 − ρ rj ) + nγ j (1 − γ j )ρ 2rj = σ 2rj V ( s( L( Z n , ψ
(12) (13)
Now consider the effect of the binomial noise process common to ID3p and ID3ep. This process introduces more variance in the class frequency because cases are assigned at random to partitions based on the parameter C. The expected value and variance of the size of the above partition after noise is given in equations (14) and (15) where the noise is a binomial process with mean and variance (C, (1-C)C). Note that these formulations involve another application of Wald's theorem because once again the size of the beginning partitions is uncertain as defined in ~ (12) and (13). In (14) and (15), nγ j ρ rj replaces E ( K ) for correct input states and ~ nγ i ρ ri replaces E ( K ) for q-1 incorrect input states. For correct input states, C replaces
Page 18 ~ ~ ~ E ( X i ) and W replaces E ( X i ) for q-1 incorrect states. In (15), σ 2rj from (13) replaces V ( K ) for ~ correct input states and σ 2ri replaces V ( K ) for q-1 incorrect states. For correct input states, C(1~ ~ C) replaces V ( X i ) and W(1-W) replaces V ( X i ) for q-1 incorrect states. ~ ~ ~ E ( s( L ( Z n , ψ = cr ∧ X ko = xkj ))) = nγ j ρ rj C +
q
∑ nγ i ρ riW
i≠ j q
= Cµ rj + W ~ ~ ~ V ( s( L( Z n , ψ = cr ∧ X ko = x kj ))) =
(14)
∑ µ ri i≠ j
(15) q
2 C (1 − C )µ rj + C 2 σ rj + W (1 − W )
∑ i≠ j
q
µ ri + W 2
∑ σ ri2 i≠ j
Now consider the effect of the constant noise process associated with the controlled scrambling algorithm ID3ecp. Here, the class frequency is set to its expected value. Therefore, the variance due to noise disappears. However, the variance of the sampling process remains. Equation (16) for the partition size with a constant noise process is derived using the variance of a constant times a function of a random variable. Alternatively, the variance can be derived by dropping the first and third terms in (15) because the constant noise process has zero variance. Note that the expected value remains as defined in equation (14). ~ ~ ~ V ( s( L( Z n , ψ = cr | X ko = x kj ))) = C 2 σ 2rj + W 2
q
∑ σ 2ri
(16)
i≠ j
To further depict the difference between the binomial noise process and the constant noise process, consider a numerical example based on Figure 2 where the population statistics are given below.
~ ~ ~ P[ X kt = x k 1 ] = 0.6 , P[ X kt = x k 2 ] = 0.3 , P[ X kt = x k 3 ] = 01 . ~ ~ ~ ~ ~ = c | X~ = x ] = 0.8 P[ ψ = c1 | X kt = x k 1 ] = 08 . , P[ ψ = c1 | X kt = x k 2 ] = 0.33, P[ ψ kt k3 1 ~ ~ ~ ~ ~ ~ P[ ψ = c2 | X kt = x k 1 ] = 0.2 , P[ ψ = c2 | X kt = x k 2 ] = 0.67 , P[ ψ = c2 | X kt = x k 3 ] = 0.2 The labeling of the boxes in Figure 2 shows the expected values for the input state and
class in the clean and noisy partitions. The variances for the binomial (equation (15)) and
Page 19
constant noise (equation (16)) processes are shown in round and curly brackets, respectively, next to the expected values (equation (14)) in the noisy partitions. Note the reductions for the constant noise process. Figure 3 shows a graphical view of the difference between the constant and binomial noise processes for a particular class. Here there are two curves corresponding to partitions at level 4 in a decision tree. The curves for the binomial noise process are based on (15), while the curve for the constant noise process is based on (16). For the binomial noise process, Figure 3 demonstrates that the coefficient of variation increases as the noise level increases with the peak about 1/q. Although not shown here, the difference between the coefficient of variation of the two processes increases as the depth of the tree increases because observing more noisy inputs introduces additional uncertainty.
1.8 Binomial Noise Process
1.6 1.4
1
Constant Noise Process
0.8 0.6 0.4 0.2
C
Figure 3: Graphical Comparison of Noise Processes of a Node at Level 4
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 0
CV Size
1.2
Page 20
The goal of reducing the variance of the sub-partition sizes is to make the input selection process more stable and ultimately to reduce the variance in performance on a set of unseen cases. Analytically, it is rather difficult to measure the variance in the gain because distribution assumptions are necessary and the mathematics is complicated. Since any distribution assumptions would rarely be met, it seems better to make a strong statement about the subpartition sizes rather than a weak statement about the gain. Even with a strong statement about the variance of the gain, simulation experiments are still necessary to link the theoretical behavior with performance on unseen cases. 3.5. Controlled Scrambling Procedure The controlled scrambling procedure is designed to behave as close as possible to the constant noise process and thereby to reduce the variance in the gain and ultimately, the performance on unseen cases. The differences between the controlled scrambling procedure and the constant noise processes are due only to rounding of fractional values and conserving the number of cases in various partitions of the training set. The controlled scrambling procedure first computes the class and true input state frequencies as close as possible to their estimated values and then randomly assigns cases to match the computed frequencies. Control of class frequencies follows from the discussion in Section 3.4. The reason for controlling the true state frequency is a little subtle, however. The true state distribution shows the fraction of each true state within a noisy partition. For example in Figure 2, the true state frequencies in the noisy partition xk1 are 48 (60*0.8) for true state xk1, 3 (30*0.1) for true state xk2, and 1 (10*0.1) for true state xk3. Controlling the true state frequency does not affect the variance of the class frequency in the current node, but rather it potentially impacts the class frequencies in descendant nodes. The true state frequency is important to control when the next input to select in the decision tree is conditionally dependent on the current input state. When there is a strong dependence, reducing the variance in true state frequencies
Page 21
will reduce variance in the class frequencies of the next input. For example, the constant noise curve in Figure 3 was generated with a strong dependence between the current input state and the next input state. Thus, controlling the true state frequencies is a matter of reducing variance in descendant nodes rather than in the current node. Computation of the class and true state frequencies is accomplished by solving two optimization models. First, the controlled scrambling procedure assigns class frequencies such that the assignment minimizes the distance between the assigned class frequencies and estimated class frequencies in a set of partitions (all the partitions of an input) subject to integer, nonnegativity, and case conservation constraints. The latter constraints ensure that the total number of cases is the same before and after allocation by the controlled scrambling procedure. Second, the controlled scrambling procedure assigns true state frequencies such that the assignment minimizes the distance between the assigned true state frequencies and the estimated true state frequencies in a subset of a partition (all cases of the same class) subject to integer, nonnegativity, and case conservation constraints. The latter constraints are based on the assignment made by the class frequency optimization model. Let d ( cr , j ) be the assigned frequency of class cr in the noisy partition where X ko = xkj . s( cr , L ( Z , X kt = x kj )) be the frequency of class cr in the partition Z where X kt = x kj . e(cr , j ) be the estimated frequency of class cr in the noisy partition where X ko = xkj . q
e(cr , j ) = C * s(cr , L( Z , X kt = x kj )) + W
∑ s(cr , L( Z , X kt = x ki )) i≠ j
d ( xki , r , j )be the assigned frequency of true state xki in the noisy partition where X ko = xkj and the class is cr. e( x ki , r , j ) be the estimated frequency of true state xki in the noisy partition where X ko = xkj and the class is cr. e( x ki , r , j ) = C * s(cr , L( Z , X kt = x ki )) if i = j e( x ki , r , j ) = W * s(cr , L( Z , X kt = x ki )) if i ≠ j
Page 22
The class frequency optimization model solves m optimization problems CFr, r=1,...,m. In each problem, there are q decision variables, d ( cr , j ) , j=1,...,q. Problem CFr : q 2 Min (d (cr , j ) − e(cr , j)) j =1 s. t. d (cr , j ) ≥ 0 and integer, ∀ j = 1,..., q
∑ q
∑
(17)
q
d ( cr , j ) =
j =1
∑ s(cr , L( Z , X kt = x kj ))
- - case conservation constraint
j =1
The true state frequency optimization model solves mq optimization problems TSFrj, r=1,...,m, j=1,...,q (one for each combination of class and observed state). In each problem, there are q decision variables, d ( xki , r , j ), i=1,...,q. Problem TSFrj : q 2 Min (d ( x ki , r , j ) − e( x ki , r , j )) i =1 s. t. d ( x ki , r , j ) ≥ 0 and integer ∀i = 1,..., q
∑
(18)
q
∑ d ( x ki , r , j) = d * (cr , j) i =1
- - case conservation constraint
where d * (cr , j ) is the optimal value of d (cr , j ) in CFr
The controlled scrambling procedure initially assigns class frequencies (d ( cr , j ) ) by rounding the estimated class frequencies to the nearest integer value. If the case conservation constraint is not satisfied for a set of assigned class frequencies, the assigned class frequencies are adjusted in the order that minimizes the distance from the estimated class frequency until the constraint is satisfied. A similar procedure is used for the assigned true state frequencies. More precisely, the algorithm to compute the assigned class frequencies is shown in procedure Assign Class Frequencies.
Page 23 Procedure Assign Class Frequencies Input Z : a partition X k : branching input with q states C: noise level Output D: a set of assigned class frequencies for Z and Xk, D = {d (cr , j ) d (cr , j ) is an assigned class frequency r = 1,2,..., m; j = 1,2,..., q} Procedure 1.
For each class cr and each state xkj of Xk, set d ( cr , j ) = rounded ( e( cr , j )) .
2.
For each set of assigned frequencies of a class ({d ( cr , j ) | j = 1... q} ), adjust them such that the case conservation constraint is satisfied. Stop when the case conservation constraints are satisfied for all sets of assigned class frequencies. 2.1 If
q
q
j =1
j =1
q
q
j =1
j =1
∑ d (cr , j ) > ∑ s(cr , L( Z , X kt = xkj )) , then adjust by taking away from some
d ( cr , j ) . Sort d ( cr , j ) by descending order of (d (cr , j ) − e (cr , j )) . Starting from the d ( cr , j ) with the largest difference, compute new d ( cr , j ) = old d ( cr , j ) - 1 until the case conservation constraint is satisfied. 2.2 If
∑ d (cr , j ) < ∑ s(cr , L( Z , X kt = xkj )) , then adjust by adding to some d (cr , j ) .
Sort d ( cr , j ) by ascending order of (d (cr , j ) − e (cr , j )) . Starting from the d ( cr , j ) with the smallest difference, compute new d ( cr , j ) = old d ( cr , j ) + 1 until the case conservation constraint is satisfied. The procedures to assign the class and true state frequencies are optimal and polynomial in the number of states and classes. The procedures start with the best assignment (rounded estimated values) and adjust the best assignment until a feasible value is obtained. The adjustments are always the smallest deviation from the best assignment. Because there is no interaction among the decision variables, the resulting assignment minimizes the sum of squared deviations subject to case conservation constraints. The worst case complexity of the class frequency assignment algorithm is O(mqlog(q)) for each input because there are m lists of decision variables where each list must be sorted (qlog(q)). The operations of computing the rounded estimates and adjusting the estimates can be performed in time linear to the number of
Page 24
states. Similarly, the worst case complexity for the true state frequency assignment algorithm is O(mq2log(q)) because there are mq decision variables. The total worst case complexity is the sum of the above worst case complexities because the two procedures are independently performed. The controlled scrambling procedure uses the algorithms to assign class and true state frequencies. After the class and true state frequency assignments are made, the controlled scrambling procedure randomly selects sets of cases from the true partitions and assigns them to scrambled partitions to satisfy the class and true state frequencies. In contrast, the random scrambling procedure assigns individual cases to scrambled partitions without the constraints of the class and true state frequencies. Formally, the algorithm to introduce noise in a controlled manner is presented in procedure Controlled Scrambling. Procedure Controlled Scrambling Input Z : a partition X k : branching input with q states Output S: a ‘scrambled’ set of partitions of input X, S = {S j S j is partition of S} j = 1,2,..., q Procedure 1. Perform procedure Assign Class Frequencies. 2. Perform procedure Assign True State Frequencies. 3. Let T be the set of true partitions resulting from splitting Z into q partitions on input X k . T = {Ti Ti is a partition of T where X k = x ki } . 4. For each j = 1,2,...,q do 4.1. For each i = 1,2,...,q do 4.1.1 For each r = 1,2,...,m do 4.1.1.1 Define Tir , such that Tir is a partition of Ti where the class is cr 4.1.1.2. Let S jr be d ( xki , r , j ) randomly selected cases without replacement from Tir . S j = S j ∪ S jr The explicit algorithm ID3ecp uses the controlled scrambling procedure to introduce noise into partitions. First, the usual partitioning process of the ID3 algorithm is performed. Second, the partitions created in the first step are scrambled using the controlled scrambling procedure.
Page 25
After all partitions of an input are scrambled, the normal formulas for the input selection, stopping rule, and classification function are used. Note that we retain the pessimistic pruning procedure in ID3ecp.
4. Comparison of Algorithm Performance In this section, we describe simulation experiments that investigate the performance of the implicit (ID3p) and explicit algorithms (ID3ecp) in terms of average accuracy and coefficient of variation (CV) of accuracy. We describe the hypotheses, performance measures, data sets, methodology, and results. 4.1. Hypotheses The hypotheses extend the analytical results of Section 3 to the performance of the implicit and explicit algorithms. Section 3 established several relationships between the implicit and explicit noise handling approaches: (i) same expected behavior on the gain, (ii) lower variance in the class frequency of a partition for a constant noise process than a binomial noise process, and (iii) difference in variance increases as noise level increases. Because it is difficult to analytically link the performance of the algorithms to the noise handling approach, simulation experiments are necessary. We feel that these three results will extend to the performance of the implicit and explicit algorithms as stated below. Hypothesis 1: There is no difference in expected performance between the implicit algorithm ID3p and the explicit algorithm ID3ecp. Hypothesis 2: The explicit algorithm (ID3ecp) has a smaller coefficient of variation in performance than the implicit algorithm (ID3p). Hypothesis 3: The difference in the coefficient of variation in performance of the explicit algorithm (ID3ecp) minus the implicit algorithm (ID3p) increases as the noise level increases over a range of reasonable noise levels.
Page 26
4.2. Performance Measurement We use two measures of performance: classification accuracy, a traditional measure and relative information score, a more refined measure. Classification accuracy (ratio of correctly classified cases to total cases) does not account for the effects of the number of classes and the prior probabilities of each class. The Relative Information Score (RIS) [Kononenko and Bratko 1991] measures the percentage of the uncertainty of the data set that is explained by the learning algorithm. Thus, a high RIS value is preferred to a low value. In equation (19), RIS is computed as the amount (i.e., the number of bits) of uncertainty removed by the classification process (Ia) divided by the amount of uncertainty in the data set before classification (the entropy of the distribution E). I RIS = a * 100% E
(19)
where Ia is the Average Information Score computed as
1 n Ia = * ∑ I j T 1 where T is the size of the test set and Ij is the information score of case j where − log 2 P(C ) if a correct classification is made Ij = log 2 (1 − P(C )) if an incorrect classification is made where P(C) is the prior probability of class C (determined from the data set). 4.3. Data Sets To provide a range of case distributions, we executed the experiments with five data sets, two real and three artificially generated. The Bankruptcy data set [Liang 1992] contains 50 cases, 8 inputs, and 2 equally distributed classes (bankrupt or healthy). The Lymphography data set [Murphy and Aha 1991] was developed through data collection at the University Medical Centre, Institute of Oncology in Ljubljana, Yugoslavia. To reduce the effects of spurious noise, we removed cases with missing values, removed redundant cases, and removed all but one among
Page 27
conflicting cases. After this cleansing activity, the Lymphography data set contains 148 cases distributed among 4 classes where two classes are very sparse (2 and 4 cases, respectively). The Lymphography data set contains 18 inputs with an average of 3.3 states per input where the number of states ranges from 2 to 8. The artificial data sets were generated by a program based on the specifications described in [Bisson 1991]. The data set generator can control the number of cases, classes, inputs, states per input, the distribution of cases among classes, and the complexity of the true rule sets for each class. Data set 1 contains 4 equally distributed classes with 10 inputs. Data set 2 contains 8 moderately skewed classes and 15 inputs. Data set 3 contains 12 highly skewed classes with 20 inputs. In data set 2, two classes have 50% of the cases and the remainder of the cases are uniformly distributed among the other 6 classes. In data set 3, 80% of the cases are distributed to 3 classes and the remainder are uniformly distributed to the other 9 classes. In the artificial data sets, the number of cases was 200, the average number of states per input was 3, and the average size of the true rule sets was 2 rules per class and 3 conjunctive terms per rule. 4.4. Experimental Design 5
Table 1 shows a 2 X 3 factorial design for the algorithm and noise level factors. The numbers in the cells show the observations for each treatment. As shown in Table 1, we choose 3 levels of noise: high (C = 0.65), moderate (C = 0.80), and low (C = 0.95). The low level of noise is close to perfect measurement. The moderate level of noise causes a significant decrease in predictive performance. The high noise level causes a further significant decrease in predictive performance. Further decline in predictive performance from higher levels of noise is not as
5
The algorithms and simulation experiment were implemented using Microsoft C on a 486 personal computer.
Page 28
significant. In separate experiments, the entire range of noise levels is studied to graphically depict the functional relationship between noise and the mean and variance of performance. There are two experiments corresponding to the dependent variable average RIS and coefficient of variation (CV) of RIS. The method used to estimate performance is the standard test-sample method [Breiman et al. 1984]. Each observation is the average performance of the same 30 splits of a data set where the data set is divided roughly 70% for training and 30% for testing. Each cell of both experiments uses the same set of 100 noise perturbations. In a perturbation, the data set is randomly changed using the given C and W values. ID3p is given a perturbed training set while ID3ecp is given a clean training set. Both algorithms use the same test set and the same perturbed training set in the pessimistic pruning procedure. In the average experiment, all observations are used. In the CV experiment, each cell contains the same 30 6
random samples of size 30 from the 100 observations. An observation is the CV computed from the given sample.
Algorithm ID3p ID3ecp
Table 1: Experimental Design Noise Level (C) L (.95) M (.80) 100 (AVG) 100 (AVG) 30 (CV) 30 (CV) 100 (AVG) 100 (AVG) 30 (CV) 30 (CV)
H (.65) 100 (AVG) 30 (CV) 100 (AVG) 30 (CV)
4.5. Results Analysis of variance was used to determine whether there are performance differences between the algorithms across the 5 data sets. Tables 2 - 5 report the analysis of variance results for the Lymphography data set and data set 1 (GD1). The ANOVA tables show that the
6
A sample is a random selection of 30 perturbations. Each cell of the CV experiment contained the same 30 perturbations.
Page 29
simulation results are consistent with the theoretical analysis in Section 3. The noise level (NL) affects both dependent variables (AVG RIS and CV RIS), but the algorithm (ALG) and the interaction term (NL*ALG) are not significant for AVG RIS (Tables 2 and 4). However, ALG and the interaction term are significant for CV RIS (Tables 3 and 5). Thus, the ANOVA results confirm Hypothesis 1 and support Hypotheses 2 and 3. Similar results were obtained for the other data sets (see Table 6). Tables 7 and 8 list mean CV differences between the algorithms (first order differences) at each noise level and mean differences between the algorithms at adjacent noise levels (second order differences). Because both the first order and second order differences are significant, Hypotheses 2 and 3 are confirmed. Table 2: AVG Lymphography ANOVA Results SOURCE SS df MS F NL ALG NL*ALG WITHIN TOTAL
73785.8149 39.7113688 31.9080444 17310.1319 91167.5661
2 1 2 594 599
36892.9074 39.7113688 15.9540222 29.1416361
P-value
1265.98614 1.3627021 0.54746488
SOURCE
Table 3: CV Lymphography ANOVA Results SS df MS F P-value
NL ALG NL*ALG WITHIN TOTAL
1.72273569 0.52293785 0.16348104 0.22173633 2.63089091
SOURCE NL ALG NL*ALG WITHIN TOTAL
2 1 2 174 179
0.86136785 0.52293785 0.08174052 0.00127435
675.928939 9.1635E-83 410.357584 1.212E-47 64.1430741 1.3533E-21
Table 4: AVG GD1 ANOVA Results SS df MS F 230805.371 0.74741892 4.5854718 8725.97472 239536.678
2 1 2 594 599
115402.685 0.74741892 2.2927359 14.6901931
6.368E-215 0.24353811 0.57870587
7855.76366 0.05087877 0.15607255
P-value 0 0.82161873 0.85553219
Page 30
SOURCE SS NL ALG NL*ALG WITHIN TOTAL
Table 5: CV GD1 ANOVA Results df MS F
0.26582049 0.0891291 0.01144605 0.04442473 0.41082037
2 1 2 174 179
0.13291025 0.0891291 0.00572303 0.00025531
520.574572 349.095311 22.4155911
P-value 3.6729E-74 1.9006E-43 2.1791E-09
Table 6: P-Values for Other Data Sets Bankruptcy GD2 NL ALG NL*ALG
AVG 1.5053E-62 0.6006 0.9231
CV 5.7346E-56 2.2906E-68 1.7721E-33
AVG 0.0000 0.8216 0.8555
CV 2.3041E-95 9.0516E-64 2.3986E-22
GD3 AVG 0.0000 0.4669 0.0225
CV 4.6106E-76 2.5253E-39 3.0017E-18
Table 7: Lymphography CV Mean Differences ID3p ID3ecp t-value p-value H M L M-H L-M
H M L M-H L-M
0.429047734 0.268555164 0.11611841 -0.160492571 -0.152436757
0.251359557 0.153436962 0.08552477 -0.097922595 -0.067912194
11.53771308 15.87069539 6.540714774 -3.638112396 -10.34340997
1.16936E-12 3.88396E-16 1.8278E-07 0.000529124 1.52555E-11
Table 8: GD1 CV Mean Differences ID3p ID3ecp t-value
p-value
0.1812388 0.1197556 0.0678872 -0.061483251 -0.051868367
8.071E-13 2.786E-16 3.47781E-14 0.009954073 2.19337E-07
0.1180832 0.0735932 0.043692 -0.044490058 -0.029901399
11.717007 16.073779 13.31519161 -2.464027269 -6.473220833
4.6. Discussion To depict the magnitudes of the differences in performance, we generated simulation data to graphically compare the performances. In a simulation run, each observation was computed from the same 20 splits and 20 perturbations. In Figures 4 and 5, the average performance graphs of GD1 almost coincide as expected. In Figures 6 and 7, the average performance graphs for the Lymphography data set are more erratic as the graphs cross numerous times. This slightly erratic
Page 31
behavior is probably due to the increased level of residual variation in the Lymphography data set as its maximum RIS is slightly less than 50% compared to more than 80% for GD1. The shape of the average performance graphs provides evidence to support Proposition 1 in Section 3.2. Note that the average RIS and accuracy is minimized near 1/q and the bowl-like shape of the curves. For GD1, the average number of states is just below 3. For the Lymphography data set, the average number of states is 3.3 and there is a larger variation in the number of states (2 of the inputs have 8 states). Because ID3 favors inputs with many states, the inputs with 8 states probably appear on most paths in the decision tree. This explains why the Lymphography graph is minimized below the simple average number of states from the data set. As for the CV of performance, Figures 4 through 7 show a strong separation between the implicit and explicit algorithms. In addition, the CV of performance increases as noise increases and the difference between the explicit and implicit algorithms increases as the noise level increases. The shape of the curves and their extreme point is consistent with the theoretical graphs shown in Figure 3. There is larger separation between the explicit and implicit graphs in the Lymphography data set (Figures 6 and 7) than in the GD1 (Figure 4 and 5). The difference at low levels of noise is obscured in Figures 6 and 7 because the scale is wider. However, the difference between the plotted points is larger for the Lymphography data set than GD1 even at low levels of noise. Note also that the shape of the CV graphs in Figures 6 and 7 differ because the CV RIS graph (Figure 6) is affected by some negative RIS values. To probe the sensitivity of ID3ecp, we generated additional simulation data for GD1. Here, the true noise level (C) used in the test set and in the training set of ID3p differed from the false noise parameters given to ID3ecp (C'). In the constant error sensitivity runs, the false noise parameter C' was misstated by 0.10 (either all high or all low in a run) except near the end points (C = 0 or 1). In varying error sensitivity runs, the mean of the false noise level C' was equal to the true noise level C but variance in a range of ±0.10 or ±0.05 of the true C value was
Page 32
introduced. Somewhat surprisingly, the average performance of the algorithms in each sensitivity run shows little difference than in Figure 4. In Figure 8, the CV of performance continues to demonstrate a large advantage for ID3ecp for both constant error runs (high and low). In Figure 9, the CV advantage of ID3ecp has disappeared at high values of C in the ±0.10 case. In the ±0.05 case, however, there is still a marked advantage for ID3ecp. Figure 9 demonstrates that CV of performance of ID3ecp is relatively sensitive to accurate noise level assessments as compared to an implicit algorithm using a training set drawn from the same noise process as unseen cases. Thus, the stable behavior of explicit algorithms may deteriorate due to the variance caused by incorrect specification of the noise level. When the variance in incorrect specification is high, the noise variance acting on implicit algorithms may be balanced by the incorrect specification variance acting on explicit algorithms.
5. Summary and Conclusions We compared decision tree induction algorithms under conditions of noisy input data where the level of noise was either implicitly known through a sample of the noise process or explicitly known through an external parameter. Explicit measurement of noise is cost effective when training data is provided directly by experts or when the cost of directly estimating the level of noise is less than the cost of sampling a representative noisy process. In addition, the appeal of explicit noise measurement broadens when there is an associated performance benefit. Here, we demonstrated that explicit noise measurement can be accompanied by more stable performance on unseen cases. Our primary contributions were to develop an explicit noise handling algorithm (ID3ecp) and demonstrate its advantages as compared to a standard implicit noise algorithm (ID3 with the pessimistic pruning procedure). ID3ecp injects random noise into partitions but controls the class and true state frequencies in a partition as close as possible to their estimated values. The aim of
Page 33
the controlled scrambling procedure is to reduce the variance in the partitioning behavior. We demonstrated that the implicit and explicit algorithms have the same expected behavior and performance, but the explicit algorithm has more stable behavior and performance. The behavioral results were demonstrated by an analytical comparison of a constant noise process as the best case for a controlled partitioning process and a binomial noise process for implicit noise measurement. The performance of the algorithms was demonstrated by simulation experiments to compare the average performance and coefficient of variation of performance for the two algorithms. This research is part of our long term interest concerning the economics of expert systems. Two direct extensions of this work are treating the noise level as a decision variable rather than a constraint and developing induction algorithms that combine mean and variance of performance. In the former topic, explicit measurement of noise is required to make a tradeoff among the cost of removing noise with the benefit of improved decision making. Other topics not directly related to this work are optimizing expert system performance over a multi-period horizon and developing cost-benefit objectives for other approaches such as Bayesian reasoning networks.
Page 34
Appendix A Convexity of Gain with Respect to Noise Level Proposition 1 ~ ~ ∂g ( X ko | Z C ) > 0, C ∈ (1 / q ,1] ∂C = 0, C = 1 / q < 0, C ∈[0,1 / q )
(6.1) (6.2) (6.3)
We prove (6.2) and demonstrate the truth of (6.1) and (6.3) for a special case. Let us begin by showing the second condition, namely, the slope is zero at C = 1/q. With some algebra, first derivative of the gain function with respect to C (A1) can be derived.
m q pr + α ( qp j prj − pr ) ∂g ( X ko | Z C ) 1 = ( qp j prj − pr ) log 2 ∂C q − 1 r =1 j =1 1 + α ( qp j − 1) where , p j = P ( X kt = x kj | Z C ) pr = P ( ψ = cr | Z C ) prj = P( ψ = cr | Z C ∧ X kt = x kj ) α = (1 − Wq )
∑∑
(A1)
For C = 1/q, α = 0. Substituting α = 0 in the above equation for the slope we obtain: ∂g ( X ko | Z C ) 1 = ∂C 1− q
q
m
∑ r =1
log 2 ( pr )
∑
q
( qp j prj − pr ) = 0 as
j =1
The second condition in the proposition is proved.
∑ (qp j prj − pr ) = qpr − qpr = 0 j =1
For (6.1) and (6.3), we show that the second derivative is greater than zero in a special case. With some algebra, the second derivative of the gain with respect to W can be derived as (A2). We must show that the second derivative is greater than 0 as stated in (A3).
Page 35 ∂ 2 g ( X ko | Z C ) ∂ 2W where W δ rj δj q
q
=−
3
δ rj pr + qδ rj
m
q
δ j (1 − qδ j )
∑ ∑ Wp r + (1 − Wq )δ rj + ∑ W + (1 − Wq )δ j j =1 r =1
= q1-−C1 = qp j prj − pr = qp j − 1
δ j (1 − qδ j )
8
q
m
j =1
3
δ rj pr + qδ rj
8
∑ W + (1 − Wq )δ j ∑ ∑ Wp r + (1 − Wq )δ rj >
j =1
(A2)
(A3)
j =1 r =1
Because
∑r δ rj = δ j ,we assume that one of the δrj terms equals δj and the other m-1
terms be divided into pairs such that the sum of each pair is zero and each pair member is the same absolute value. In addition, we assume that pr (= µ) is constant for all r. Using these assumptions and some algebra, the right hand side of (A3) can be reduced to a quantity less than (A4). q
δ j (µ − qδ j )
∑ W µ + (1 − Wq )δ j
(A4)
j =1
Substituting (A4) into (A3) and using some further algebra, (A3) can be reduced to (A5). q
δ2 j > D j =1
∑
q
δ2 jµ where D = (W µ + (1 − Wq )δ j )(W + (1 − Wq ) δ j ) D j =1
∑
(A5)
(A5) is true because δ 2j > 0 and µ < 1 and D > 0. This proves (6.1) and (6.3) for the special case stated above.
Page 36
Appendix B Pessimistic Pruning Procedure Appendix B describes the pruning procedure used in the algorithms ID3p and ID3ecp. This description has been adapted from [Quinlan 1987]. For any given tree T, generated using a training set of N cases, let some leaf in the tree account for K of these cases with J of them misclassified. The ratio J/K does not provide a reasonable estimate of the error rate when classifying unseen cases [Quinlan 1987]. A more reasonable estimate is obtained using the continuity correction factor for the binomial distribution, wherein J is replaced by J+0.5 [Snedecor and Cochran 1980]. Let S be a subtree of T with LS leaves, and let JS and KS be the corresponding sums of errors and cases classified over S. Using the continuity correction factor, the expected number of cases (MS) misclassified by S out of KS unseen cases should be: M S = J S + 0.5 LS The standard error of MS, Se(MS), is given by: Se( M S ) =
M S * ( KS − M S ) KS
Let E be the number of cases misclassified out of KS if the subtree S is replaced by its best leaf. The pessimistic pruning procedure replaces S by its best leaf if: E + 0.5 ≤ M S + Se( M S ) In pessimistic pruning, all non-leaf subtrees are examined only once and subtrees of pruned subtrees need not be examined at all.
Page 37
References Bisson, H. "Evaluation of Learning Systems: An Artificial Data-Based Approach," in Proceedings of the European Working Session on Machine Learning, Y. Kodratoff (ed.), SpringerVerlag, Berlin, F.R.G., 1991. Breiman, L., Friedman, J., Olshen, R., and Stone, C. Classification and Regression Trees, Wadsworth Publishing, Belmont, CA, 1984. Biemer, P. and Stokes, L. "Approaches to the Modeling of Measurement Error," in Measurement Errors in Surveys, Chapter 24, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 487-516. Christie, A. "Induction of Decision Trees from Noisy Examples," AI Expert, May 1993, 16-21. Cestnik, B. and Bratko, I. "On Estimating Probabilities in Tree Pruning," in Proceedings of the European Working Session on Machine Learning, Porto, Portugal, Springer-Verlag, March 1991, pp. 138-150. Creecy, R., Masand, B., Smith, S., and Waltz, D. "Trading MIPS and Memory of Knowledge Engineering," Communications of the ACM 35, 8 (August 1992), 48-64. Fuller, W. "Regression Estimation in the Presence of Measurement Error," in Measurement Errors in Surveys, Chapter 30, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 617-636. Groves, R. "Measurement Error Across Disciplines," in Measurement Errors in Surveys, Chapter 1, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 1-28. Hill, D. "Interviewer, Respondent, and Regional Office Effects," in Measurement Errors in Surveys, Chapter 23, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 463-486. Holsapple, C. and Whinston, A. "Business Expert Systems, Irwin, Homewood, Illinois, 1987. Irani, K., Cheng, J., Fayyad, U., and Qian, Z. "Applying Machine Learning to Semiconductor Manufacturing," IEEE Expert 8, 1 (February 1993), 41-47. Kononenko, I. and Bratko, I. "Information-Based Evaluation Criterion for Classifier's Performance," Machine Learning, 6, 1991, 67-80. Laird, P. Learning from Good and Bad Data, Kluwer Academic Publishers, Norwell, MA, 1988. Liang, T. "A Composite Approach to Inducing Knowledge for Expert System Design," Management Science 38, 1 (1992), 1-17. Mantaras, R.. "A Distance-Based Attribute Selection Measure for Decision Tree Induction, Machine Learning 6, 1991, 81-92. Mingers, J. "An Empirical Comparison of Pruning Methods for Decision Tree Induction, Machine Learning 4, 2, 1989, 227-243.
Page 38 Moulet, M. "Using Accuracy in Scientific Discovery," in Proceedings of the European Working Session on Machine Learning, Porto, Portugal, Springer-Verlag, March 1991, pp. 118136. Murphy, P. and Aha, D. UCI Repository of Machine Learning Databases, University of California, Irvine, Department of Information and Computer Science, 1991. Mookerjee, V. and Dos Santos, B. "Inductive Expert System Design: Maximizing System Value," Information Systems Research (in press), 1993. Niblett, T. and Bratko, I. "Learning Decision Rules in Noisy Domains," in Research and Development in Expert Systems, (Proceedings of the Sixth Technical Conference of the BCS Specialist Group on Expert Systems, Brighton, U.K., 1986. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kauffman Publishers, San Mateo, CA, 1988. Quinlan, J. "Induction of Decision Trees," Machine Learning, Vol. 1, 1986, 81-106. Quinlan, J. "The Effect of Noise on Concept Learning," in Machine Learning, Vol. 2, Eds. R. Michalski, J. Carbonnell, and T. Mitchel, Palo Alto, CA, Tioga Press, 1986, pp. 149-166. Quinlan, J. "Simplifying Decision Trees," International Journal of Man Machine Studies, Vol. 27, 1987, 221-234. Ross, S. Applied Probability Models, Holden-Day, San Francisco, CA, 1970. Rao, J. and Thomas, R. "Chi-Squared Tests with Complex Survey Data Subject to Misclassification Error," in Measurement Errors in Surveys, Chapter 31, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 637-664. Shannon, C. and Weaver, W.. The Mathematical Theory of Communication, University of Illinois Press 1949, (published in 1964). Snedecor, C. and Cochran, W. Statistical Methods, 7th edition, Iowa State University Press, 1980. Tam, K. and Kiang, M. "Predicting Bank Failures: A Neural Network Approach," Applied Artificial Intelligence, 4, 1990, 265-282.
Page 39
Generated Data Set 1 - Average Performance (RIS)
Avg RIS
90 80 70 60 50
ID3p ID3ecp
40 30 20 10 0 0
0.2
0.4
0.6
0.8
1
C
CV RIS
Generated Data Set 1 - Variance of Performance (RIS) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
ID3p ID3ecp
0
0.2
0.4
0.6
0.8
C
Figure 4: RIS Performance Graphs for Generated Data Set 1
1
Page 40
Generated Data Set 1 - Average Performance (Accuracy)
0.7 0.6 0.5
ID3p ID3ecp
0.4 0.3 0.2 0.1 0 0
0.2
0.4
0.6
0.8
1
C
Generated Data Set 1 - Variance of Performance (Accuracy)
CV ACC
Avg ACC
0.9 0.8
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
ID3p ID3ecp
0
0.2
0.4
0.6
0.8
C
Figure 5: Accuracy Performance Graphs for Generated Data Set 1
1
Page 41
50 45 40 35 30 25 20 15 10 5 0
ID3p ID3ecp
0
0.2
0.4
0.6
0.8
1
C
Lymphography Data Set - Variance of Performance (RIS) 2.5 2 CV RIS
Avg RIS
Lymphography Data Set - Average Performance (RIS)
1.5
ID3p ID3ecp
1 0.5 0 0
0.2
0.4
0.6
0.8
C
Figure 6: RIS Performance Graphs for the Lymphography Data Set
1
Page 42
Lymphography Data Set - Average Performance (Accuracy) 0.75 0.7
0.6
ID3p ID3ecp
0.55 0.5 0.45 0.4 0
0.2
0.4
0.6
0.8
1
C
Lymphography Data Set - Variance of Performance (Accuracy)
CV ACC
Avg ACC
0.65
0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
ID3p ID3ecp
0
0.2
0.4
0.6
0.8
1
C
Figure 7: Performance Graphs (Accuracy) for the Lymphography Data Set
Page 43
ID3ecp Sensitivity in Generated Data Set 1 - Constant Low ( -0.1) 0.4 0.35
0.25 ID3p ID3ecp
0.2 0.15 0.1 0.05 0 0
0.2
0.4
0.6
0.8
1
C
ID3ecp Sensitivity in Generated Data Set 1 - Constant High (+0.1) 0.4 0.35 0.3 CV RIS
CV RIS
0.3
0.25 ID3p ID3ecp
0.2 0.15 0.1 0.05 0 0
0.2
0.4
0.6
0.8
C
Figure 8: Constant Sensitivity Graphs for Generated Data Set 1
1
Page 44
ID3ecp Sensitivity in Generated Data Set 1 - Varying [-0.1, +0.1] 0.45 0.4 0.35 0.25
ID3p ID3ecp
0.2 0.15 0.1 0.05 0 0
0.2
0.4
0.6
0.8
1
C
ID3ecp Sensitivity in Generated Data Set 1 - Varying [-0.05, +0.05] 0.45 0.4 0.35 0.3 CV RIS
CV RIS
0.3
0.25
ID3p ID3ecp
0.2 0.15 0.1 0.05 0 0
0.2
0.4
0.6
0.8
C
Figure 9: Varying Sensitivity Graphs for Generated Data Set 1
1