OPTIMIZING EXPERT SYSTEMS: HEURISTICS FOR EFFICIENTLY GENERATING LOW COST INFORMATION ACQUISITION STRATEGIES
Michael V. Mannino Graduate School of Business Administration, Campus Box 165, P. O. Box 173364 University of Colorado at Denver, Denver, CO 80217
[email protected]
Vijay S. Mookerjee Department of Management Science, Box 353200 University of Washington, Seattle, WA 98195
[email protected]
5 January, 2000
OPTIMIZING EXPERT SYSTEMS: HEURISTICS FOR EFFICIENTLY GENERATING LOW COST INFORMATION ACQUISITION STRATEGIES Abstract We study the sequential information acquisition problem for rule-based expert systems as follows: find the information acquisition strategy that minimizes the expected cost to operate the system while maintaining the same output decisions. This problem arises for rule-based expert systems when the cost or time to collect inputs is significant and the inputs are not known until the system operates. We develop several “optimistic” heuristics to generate information acquisition strategies and study their properties. The heuristics provide choices concerning precision (i.e., how optimistic) versus computational effort. The heuristics are embedded into an informed search algorithm (based on AO*) that produces an optimal strategy and a greedy search algorithm. The search strategies are designed for situations in which rules can overlap and the cost of collecting an input may depend on the set of inputs previously collected. We study the properties of these approaches and simulate their performance on a variety of synthetic expert systems. Our results indicate that the heuristics are very precise leading to near optimal results for greedy search and moderate search effort for optimal search.
Rule based expert systems have become an important decision making tool in many organizations. Some of the benefits attributed to rule based expert systems include increased quality, reduced decision making time, and reduced downtime. Examples of successful rule based expert systems are reported in many areas such as ticket auditing (see Smith and Scott[25]), risk analysis (see Newquist[18]), computer system design (see Barker and O’Connor[2]), and building construction (see Tiong and Koo[26]). Development of these systems has benefited from a large amount of research on technologies that can reduce the cost to construct and maintain the systems. As these technologies mature and lead to widespread deployment of expert systems, obtaining maximum value from the operation of expert systems becomes more important. 0.1 Motivation Many expert systems may fail to deliver maximum value because designers lack tools to operate these systems in the most cost efficient manner. Our research falls under a category of expert system design problems known as “cost minimization.” The value of an expert system depends on the quality of decisions and the cost of reaching those decisions. In cost minimization, the objective is to minimize cost while holding decision quality constant. In contrast, “value maximization” problems manipulate both decision quality and cost. Value maximization problems are less common in the literature (see Mannino and Mookerjee[12]) because (i) it may be unacceptable to change the expert’s decisions and (ii) it can be difficult to estimate the value of correct and incorrect decisions. Cost minimization is an important objective in many expert systems. For example, in troubleshooting applications, the goal is to identify a fault and perform a repair to fix the fault. Because it is usually assumed that available tests can identify the fault, the objective is to identify
Page 2
the fault with the least costly information (see Heckerman and Breese[10] and Patttipati and Alexandridis[20]). In many bureaucratic systems involving taxes, immigration, and insurance (see Schwayder, Kenney, and Ainslie[24]), decision rules are fixed. The objective is to provide the same recommendation using the least costly information. Even in situations where the decision rules are not fixed, input cost reduction applies when all decisions have the same value. When all decisions have the same value and the value is known, the decision to stop collecting inputs depends on the decision value, however. Thus, reducing input costs is an important objective for a variety of expert systems. 0.2 Research Outline Figure 1 depicts our approach to redesigning expert systems. Our approach starts with a rule-based expert system in which the rules provide a finite input-output mapping: (i) there are a finite number of joint input states, and (ii) for a given joint input state, it is clear whether or not a conclusion can be reached. The first step is to validate the rules for completeness, consistency, and redundancy. It is particularly important that rules are not overspecified. The second step involves elimination of intermediate outputs, conversion of the rule base into disjunctive normal form (a disjunction of conjunctions of literals), and construction of a decision table with the transformed rules. Each disjunction is transformed into a rule in a limited entry decision table. In the third step, the decision table is used as a set of constraints for generating an information acquisition strategy in the form of a decision tree. This last step is the subject of this research. The main contribution of this research is the development of two heuristics that can be used to produce either an optimal decision tree or a good decision tree with low search effort. These heuristics provide tradeoffs between precision and computational effort. The Constrained
Page 3
Expected Rule function is more precise (i.e., closest estimate to the optimal expected cost) but difficult to compute while the Expected Rule function is less precise but easier to compute. These heuristics support input cost dependencies and overlapping rules. Two rules overlap if their conclusions are identical and there exists a state that matches both rules. Dependencies or economies in input costs represent common fixed costs such as extracting blood from a patient, opening an engine cover, and executing a diagnostics program. Overlapping rules are common especially after a rule base is analyzed for consistency and redundancy. We demonstrate that the heuristics are optimistic (underestimate the optimal solution cost) and rank them by their precision and computational effort.
Validate and Simplify
Rule Base
Generate Best Decision Tree
Convert to Decision Table
Decision Tree
Figure 1: Steps of Expert System Redesign To demonstrate the performance of the heuristics, we empirically characterize the quality of greedy solutions and the size of problems for which an optimal strategy can be efficiently determined. The heuristics are embedded into an optimal search algorithm based on AO* and a greedy search algorithm. We extend these search algorithms with the “always necessary” pruning
Page 4
technique that reduces the search space. Simulations using a variety of rule bases are conducted to study the effects of the heuristics, input cost variance, number of inputs, fixed to variable cost ratio, and probability assessments on the performance of both optimal and greedy search. The simulations provide evidence that the Expected Rule and Constrained Expected Rule functions are very precise leading to low search effort with AO* and near optimal results with greedy search. A novel aspect of this research is the use of a data set in the conversion of a decision table to a decision tree. Each record in a data set contains values for all the observable inputs. The conclusion reached for a particular joint input state is not required because the data set is not used for induction. To increase realism, the data generation technique used in this research captures input dependencies associated with an expert system’s rules. Using a data set serves two purposes. First, a data set provides estimates of the probability of an input state, conditioned on the states of one or more other inputs. Estimation of such probabilities is necessary because the objective is to minimize the expected input cost incurred by the system to reach a recommendation. Second, a data set eliminates the need to assess rule probabilities. The probability of a rule is the probability that the rule will be used to provide a recommendation. Assessing rule probabilities is particularly difficult when the rules overlap. The above two reasons (i.e., the need to estimate conditional input probabilities and overlapping rule probabilities) make the availability of a data set a prerequisite for using our approach. The remainder of this paper is organized as follows. In section 1, we describe related work on decision table conversion and sequential decision making models for expert systems. In section 2, we define the problem and present an example using a simple credit granting expert system. In section 3, we formulate the problem as an AND/OR tree search and describe the
Page 5
heuristics and their properties. In section 4, we describe the simulation results using the heuristics with both optimal and greedy search. Section 5 summarizes the paper.
1. Related Work Conversion of decision tables into decision trees has been studied by a number of researchers primarily in the context of converting a functional specification into a computer program. A number of search procedures have been proposed including branch and bound (see Reinwald and Soland[22]), dynamic programming (see Schumacher and Sevcik[23]), AND/OR graph search (see Martelli and Montanari[13] and Patttipati and Alexandridis[20]), and greedy search (see Dos Santos and Mookerjee[7] and Ganapathy and Rajaraman[8]). Most of the previous work does not support cost dependencies and overlapping rules. In addition, previous work lacks empirical results on reasonable size problems. The work by Ganapathy and Rajaraman[8] includes overlapping rules with two shortcomings: (i) only approximate solutions are proposed, and (ii) an exponential number of probability estimates are required. The previous work by Dos Santos and Mookerjee[7] does not include optimistic heuristics, informed optimal search algorithms, cost dependencies, overlapping rules, and detailed comparisons with reasonable sized expert systems. There is considerable research on economic design of expert systems using different knowledge sources and objectives (see Mannino and Mookerjee[12]). Moore and Whinston[17] developed an abstract model for information acquisition in sequential decision making systems that has inspired a number of other studies. Using a classified data set as a knowledge source (i.e., induction), algorithms that minimize costs are described by Mookerjee and Mannino[16] and Nunez[19]. Other algorithms have been developed by Mookerjee and Dos Santos[14], Turney[27], and Breiman et al.[4] to maximize value (tradeoff costs and benefits) using a classified data set.
Page 6
Cost minimization is reported by Heckerman, and Breese, and Rommelse[10] using a belief network as a knowledge source. Most of these models use greedy search because the stopping condition is not certain as with a rule base.
2. Example and Problem Definition In this section, we define the information acquisition problem and present an example to aid in visualizing the problem. This example is extended in Section 3 to depict the solution approaches. 2.1 Example Consider the simple credit granting expert system in Table 1. Note that there are 4 inputs (income, education, employment, and references) that the user may have to provide and three recommendations (grant loan, refuse loan, and investigate further) that the system makes. The expert system can be transformed into a mixed entry decision table (Table 2) where the top rows indicate the inputs, the bottom rows indicate the recommendations, and the columns indicate the rules. An “*” indicates that the decision in that row is made if the inputs take the indicated values. A dash “-“ indicates that the input is not needed for that rule. Hence the first rule means that if an applicant’s income is high, a loan should be granted regardless of the values of the other inputs. In the translation to a decision table, intermediate outputs are suppressed. Rules with intermediate outputs in the left hand side are eliminated. For example, Rule 1 containing “SoundFinancial-Status” is eliminated. Intermediate outputs in the right hand side are changed to final outputs. For example, “Sound-Financial-Status” in Rule 5 is changed to Grant-Loan. Our objective is to generate an optimal decision tree for a decision table. In a decision tree, the first input to collect is the root, the second input depends on the value of the first input,
Page 7
and so on. Leaf nodes are recommendations made by the system. In a decision tree for the credit granting problem (Figure 2), the first input is income. If its value is high, then a decision can be made immediately. If the value is medium, then the education level of the applicant should be assessed and so on. Table 1: Knowledge Base for a Simple Credit-Granting System Rule 1
If Sound-Financial-Status then Grant-Loan
Rule 2
If Future-Repayment-Schedule then Investigate-Further
Rule 3
If Doubtful-Repayment-Schedule then Investigate-Further
Rule 4
If Poor-Financial-Status then Refuse-Loan
Rule 5
If Income = 'H' then Sound-Financial-Status
Rule 6
If Income = 'H' and Employed and Education ≥ 'Bach' then Sound-Financial-Status
Rule 7
If Income = 'M' and not Employed and Education ≥ 'Bach' then Future-RepaymentPotential
Rule 8
If Income = 'M' and Employed and Education < 'Bach' then Doubtful-RepaymentPotential
Rule 9
If Income = 'L' and Education < 'Bach' and References = 'G' then FutureRepayment-Potential
Rule 10
If Income = 'L' and Employed and References = 'G' then Future-RepaymentPotential
Rule 11
If Income = 'L' and References = 'B' then Poor-Financial-Status
Rule 12
If Income = 'M' and not Employed and Education < 'Bach' then Poor-FinancialStatus
Rule 13
If References = 'G' and Employed and Education ≥ 'Bach' then Sound-FinancialStatus
Page 8
Table 2: Decision Table for the Credit-Granting System Inputs
Rules
Inc (I1) Edc ≥ Bach (I2) Emp (I3) Ref (I4) Decisions Grant (D1) Refuse (D2) Investigate (D3)
R1 H -
R2 Y Y G
*
*
R3 M N Y -
R4 M Y N -
R5 M Y Y -
R6 L N G
R7 L N G
*
*
R8 L B
R9 M N N -
*
*
* *
*
I1 H
D1
M
I2 Y
I4 N
G
I3
I3 Y
D1
L
Y
N
D2
D2
D3
I2 N
Y
D3
N
I3 Y
D1
B
D2 N
D2
Figure 2: A Decision Tree for the Credit Granting System We seek to generate the decision tree that results in the smallest expected cost (time of operation, number of questions, cost of tests, ...). In some cases, determining the optimal tree is quite easy. For example, if the number of inputs is small, all possible decision trees can be enumerated and evaluated. If all inputs are necessary in every rule (no dashes in the table), all strategies are equally costly. In most cases, determining an optimal tree requires a careful search process using knowledge of the input costs and input state probabilities. Note that the optimal tree is constrained by its associated rule base in that the recommendations made by the decision
Page 9
tree must match those of the associated rule base. For example, if income is high, both the expert system and decision tree should recommend granting the loan. 2.2 Problem Extensions We describe two extensions to the expert system redesign problem. The cost model extension supports a more general model of costs that is common in practice. The extension to rule base characteristics supports current development practices. 2.2.1 Cost Modeling Extensions We do not require that input costs are independent. The cost of collecting two inputs (say A and B) can be less than the sum of the individual input costs, but must be at least the maximum of the individual costs: Cost(A, B) ≤ Cost(A) + Cost(B) Cost(A, B) ≥ max(Cost(A), Cost(B)) The first inequality above reflects economies due to jointly collecting A and B. The second inequality asserts that the cost of A or B alone is not larger than the joint cost. Economy in joint costs represents a fixed cost that is common to a number of inputs such as opening an engine block or taking a blood sample. Fixed costs appear frequently in expert systems used in medicine, aircraft maintenance, semiconductor manufacturing, and loan processing. Fixed costs are especially common when test results are delayed (i.e., not immediately available) (see Turney[27]). For example, a physician may send a patient to a lab for blood work and results may not be available for some time. In (Turney[27]), all tests in a delayed test group must be performed together. We relax this requirement by allowing the fixed cost to change. For example, if three tests can be performed on the same blood sample, there is a fixed
Page 10
cost to perform the tests. If only two tests are performed, the same fixed cost applies. If later in the diagnosis (after other intervening tests) the third test is conducted, a higher fixed cost can be imposed. The higher cost might represent the additional cost of a return visit. The delayed test group idea is a special kind of order dependency. A more general form is one where the cost of a group of inputs could change with every permutation of the inputs. However, we have found no evidence that such forms of order dependency exist in practice. Another form of order dependency is ordering constraints. For example, an ordering constraint may require that input A is collected before input B. Ordering constraints can be specified in our model. 2.2.2 Rule Base Extensions We extend the characteristics of the rule base to support overlapping rules as this would occur in most rule bases. For example, Rules 1 and 2 in Table 2 are both satisfied when income is high, education is at least a bachelors degree, and the applicant is employed with good references. Overlapping rules can lead to computational complexities when assessing rule probabilities. Our solution technique in Section 3.4 provides an efficient way to handle overlapping rules. In addition to the overlapping extension, we make an important assumption about the rule base. We assume that the rules are not overspecified. This assumption is reasonable because expert system design emphasizes parsimony, and there are a number of tools to support this objective (see Coenen and Bench-Capon[6] for a review of tools). By separating rule design (finding the most parsimonious rules) from rule implementation (finding the best decision tree), we obtain a search space that is more economical to search. Without this assumption, considerable search effort is needed to simplify the rules.
Page 11
2.3 Problem Definition We now present a formal definition of the information acquisition strategy problem informally presented in Section 2.1. We begin with a set of definitions followed by the objective and constraints of the model. Let, X be the set of inputs ( X i is an input, i = 1, ... p) xij be an abbreviation for X i = j (i.e., state j of input X i where j = 1, ... q) D is a set of decisions; R be a set of rules ( Rk is a rule, k = 1, ...r); Rk is a pair (x,d), where x is set of xij s (the left hand side) and d ∈ D (the right hand side) Rk ↑ x is the left hand side component of rule Rk Rk ↑ d is the right hand side component of rule Rk Γ( x ) is the set of inputs in the set of attribute-value pairs, x Example: Γ( { x1T , x4F, x3T} ) = {X1, X4, X3} JC(Y); JC is the joint cost of acquiring the set of inputs Y; Y ⊆ X T is a decision strategy; T is a tree where inputs label the non-leaf nodes, input states label the arcs, and decisions label the leaf nodes. Paths(T); Paths is a function that provides the paths of tree T P is a set of paths; ( Pi is a path , i = 1,2,...π) r r Pi is a path in T. A path is a pair ( x , d ) where x is a list of xij s and d ∈ D r Pi ↑ x is the left hand side component of path Pi
Page 12
Pi ↑ d is the right hand side component of path Pi r Set ( x ) ; Set is a function that converts a list into a set. Root(S); Root is a function that returns the root of tree S. r r ST(S, x ); ST is a function that returns the entire subtree of tree S below x . r r append(xij, x ); append is a function that concatenates xij to the end of x . r r IC ( x , r ); the incremental cost of acquiring input r after acquiring the inputs in Set( x ). r r = [JC( Γ (Set( x )) ∪ r) - JC( Γ (Set( x )))] r r EC(S, x ) is the expected cost of subtree ST(S, x ). r = IC ( x , r ) +
r
r
∑ P(r | x ) EC( ST ( S , append (r , x )) ij
ij
j ∈J
r where r = Root ( ST ( S , x )) and J is the set of states in which r can be observed. r = 0, if ST(S, x ) is a leaf node. To simplify the specification, we assume that JC is the minimum cost (fixed plus variable) of acquiring the set of inputs Y. To extend JC to accommodate delayed test groups, Y will be a list. The objective of the model is to determine the decision tree with minimum expected cost. The consistency and completeness constraints ensure that the decision tree and rule base provide the same decisions for any set of problems. Objective Min[ EC (T , φ )] , where φ is the null path. T
Constraints 1. Completeness:
Page 13
Every rule of R is contained in at least 1 path of T. r ∀Rk ∈ R, ∃Pj ∈ Paths(T ) such that ( Rk ↑ x ⊆ Set ( Pj ↑ x )) ∧ ( Rk ↑ d = Pj ↑ d ) 2. Consistency: Every path of T contains at least 1 rule of R. r ∀Pj ∈ Paths(T ), ∃Rk ∈ R such that ( Set ( Pj ↑ x ) ⊇ Rk ↑ x ) ∧ ( Rk ↑ d = Pj ↑ d ) 3. Input Ordering: Input Xi cannot be acquired unless its set of predecessor inputs PRE(Xi) has been acquired. r r ∀X i , ∀Pj ∈ Paths(T ), if X i ∈ Set ( Pj ↑ x ) then PAR ( X i , Pj ↑ x ) ⊇ PRE ( X i ) where PAR(element, list) returns the elements in list up to element. Problem Difficulty The information acquisition problem is difficult. Hyafil and Rivest[11] proved that constructing an optimal binary decision tree is an NP-complete problem. In the above model, the number of choices for T grows exponentially with the number of inputs. For the first input, there are p choices to evaluate. For the second input, there are (p-1)q choices for each of the p choices. For the ith input, there are p(p-1)(p-2)...(p-i +1)qi-1 choices. The sum of the choices is larger than the binomial series (1+q)p. This is an upper bound on the number of choices because a problem’s constraints (the rule set) have not been taken into account. The only restriction is that inputs are not repeated on a path of a tree.
Page 14
Assumptions We make several assumptions about the set of rules R. These assumptions ensure that the rule base is valid and solution to the problem is optimal. 1.
The set of rules is internally consistent: if for any state vector z (a set of input-value pairs) there are multiple rules that satisfy the vector (that is, Rk ↑ x ⊆ z ), then these rules have the same right hand side.
2.
The set of rules is fully compressed: for any state vector matching a set of rules R ′ ( R ′ ⊆ R) , there does not exist a rule that (1) satisfies the state vector, (2) contains a subset of inputs of any rule in R ′ , and (3) can be inferred from R ′ . In other words, R contains all minimal rules to solve every problem.
3.
The proposed model is appropriate for data directed queries in which the user wants to match against all rules. For goal directed queries, the rules for the goal and non goal must meet assumption 2 above.
3. Solution Methods In this section, we formulate the solution of generating an optimal decision tree as a search problem using the AO* algorithm (See Pearl[21]). Following the requirements of the AO* algorithm, we develop several heuristics and discuss their properties. 3.1 Search Space Representation and Search Algorithm The generation of an optimal decision tree can be visualized as a search through a space of AND/OR trees. Figure 3 displays a partial AND/OR tree for the credit granting problem. There are three kinds of nodes in Figure 3: start node, ii) input nodes such as I1 and I2, and iii) value nodes such as H, M, and L beneath I1. Although not shown in Figure 3, goal nodes such as Grant
Page 15
placed beneath the value node H are another kind of node. The OR nodes are either the start node or value nodes. The AND nodes, depicted as rectangles with thick lines below, are the input nodes. A decision tree is a hyper path beginning with start and ending with the set of goal nodes. On a hyper path, all children of each AND node are included. A valid decision tree provides the same recommendations as its underlying set of rules as defined in Section 2. In some previous work (see Martelli and Montanari[13]), the search space is represented as an AND/OR graph rather than AND/OR tree. The AND/OR graph representation supports jointly simplifying the rule base and finding the least cost decision tree. Given the assumption of fully compressed rule bases, our representation only supports finding the least cost decision tree. With some minor modifications, our search algorithm can also simplify the rule base. However, our heuristics will not be strictly optimistic. It most practical cases, we feel that the effect on finding the optimal solution would be insignificant. The disadvantage of the AND/OR graph representation is that the search space is usually much larger. Start
I1
H
M
I2
L
Y
I3
N
Y
I4
N
G
I2
Y
B
I3
N
Y
I4
N
G
Figure 3: Partial AND/OR Tree for the Credit Granting System
B
Page 16
We use the AO* algorithm to search the AND/OR space depicted in Figure 3. AO* is an “informed” optimal search algorithm that is guided by a heuristic function f that estimates the cost of the best solution at a given node. AO* is guaranteed to be optimal (see Bagchi and Mahanti[1] and Chakrabarti, Ghose, and DeSarkar[5]) as long as the heuristic f is admissible. A sufficient condition for admissibility is that the heuristic f is optimistic: it underestimates the expected cost of the optimal solution at every node. This optimistic property allows pruning. The more closely that f approximates the optimal expected cost, the greater the pruning ability of AO*. If f exactly estimates the optimal cost, only nodes on optimal hyper paths will be expanded by AO*. 3.2 Optimistic Heuristics In our problem, the heuristic f is an additive function consisting of two parts. The current cost function g(N) computes the cost from start to the current node N under consideration. The look-ahead function h(N) optimistically estimates the optimal expected cost of the remaining subtree. As a path is extended, the h values are revised to become closer to the optimal expected cost. Before presenting look-ahead functions, some further notation is introduced. Let N ij be an OR node labeled as xij N i∗ be an AND node labeled as input X i (the subscript “*” represents all states of the input X i ) N G be a goal node Π( N ) be the set of OR nodes in the path from start to an OR node N (N can be an OR or AND node) excluding start. Φ( N ) be the set of AND nodes in the path from start to AND node N .
Page 17
Using the above notation, the f and g functions are defined as follows. f ( N i ∗ ) = g( N i ∗ ) + h( N i ∗ ) where is N i ∗ an AND node
(1)
f ( N ij ) = g ( N i ∗ ) + h( N ij ) where is N ij an OR node
(2)
g ( N i ∗ ) = JC ( Γ (Π( N i ∗ )) U X i )
(3)
We define two h functions for our search problem: i) Expected Rule function (h2) and ii) the Constrained Expected Rule function (h3). The Expected Rule function is described here while the Constrained Expected Rule function is defined in the following subsection. In the empirical testing of these functions, we add a third look ahead function, namely, the Min Rule function (h1), for comparison purposes. The Min Rule function is easy to compute, but does not approximate the remaining optimal cost precisely. The Expected Rule function approximates the remaining optimal cost as the expected cost of the remaining rules in a partition. To ensure an optimistic estimate, the Expected Rule function uses the minimum cost rule matching each state vector and ignores sequential constraints that would occur in a tree. We prove that the Expected Rule function is optimistic in Appendix 1. Let κ ( N i * ) ; κ is a function that provides the set of inputs remaining at node N i* κ ( N i * ) = X − Φ( N i * ) Z ( y ); Z is a function that provides the set of input state vectors that can be generated from the set of inputs y. z ∈ Z ( y ); is a particular input state vector ξ ( z ); ξ is a function that provides a set of rules in R that are consistent with the input state vector z. ξ( z ) is defined as below ξ( z ) = {Rk ∈ R | ( ¬ ∃ zij ∈ z ) such that match( zij , Rk ) = false} where
Page 18
match( zij , Rk ) = true, if Γ ( zij ) ∉ Γ ( Rk ↑ x )
h2( N ij ) =
∑ P ( z| Π( N
z ∈Z (κ ( Ni * ))
ij
[
= true, if zij ∈ ( Rk ↑ x ) = false, otherwise
) C ( z , N ij ) − JC (Γ (Π ( N ij )))
]
(4)
q
h2( N i ∗ ) = ∑ P( xij | Π( N ij )) * h2( N ij ) j =1
C ( z , N ij ) =
Min
Rk ∈ ξ ( Π ( N ij ) ∪ z )
[ JC(Γ( R
k
↑ x ) U Γ ( Π( N ij )))
(5)
]
(6)
We depict the h2, g, and f functions by continuing with the credit granting example. Assume the following input acquisition costs: Variable costs: I1 = 10, I2 = 5, I3 = 2, I4 = 7 Fixed costs: {I1, I3}= 10, {I2, I4}= 5 The joint costs are then the sum of the fixed and variable costs. For example, JC(I1, I3) = 22, JC(I2, I4) = 17, JC(I1, I2, I3) = 32, and JC(I2, I3, I4) = 29. Let us evaluate the h2, g, and f functions for I1 at start. The function g(I1) evaluates to JC(I1) = 20 because there are no other inputs above I1. To compute the look ahead function h2(I1), we need the state probabilities of I1, input state vectors for the remaining inputs I2, I3, and I4, and the mapping between the state vectors and rules. We use probabilities of 0.6, 0.3, and 0.1 for states “H”, “M”, and “L”, respectively. Table 3 shows the input state vectors from the function Z(κ(I1H)) that need to be enumerated in the h2 function. Table 4 shows the incremental cost of the minimum matching rules for the “M” and “L” states of I1. We do not show the minimum rules for I1H because a decision can be reached with no additional inputs.
Page 19
Table 3: Input State Vectors Computed by Z(κ(I1*)) Index 1 2 3 4 5 6 7 8
I2 Y Y Y Y N N N N
I3 Y Y N N Y Y N N
I4 G B G B G B G B
Table 4: Minimum Rules for State Vectors * Nodes I1M I1L
1 R5 (12) R2 (19)
2 3 R5 (12) R4 (12) R8 (12) R7 (14)
State Vectors 4 5 R4 (12) R3 (12) R8 (12) R6 (17)
6 R3 (12) R8 (12)
7 R9 (12) R7 (14)
8 R9 (12) R8 (12)
* The parenthesized numbers are computed by the expression C ( z , I1 j ) − JC (Γ ( Π( I1 j ))) using the inputs of the noted rule. With this background, the h2(I1) function is evaluated as shown below. For this example, we assume that input state vectors are equally likely. However, the expressions for the AO* heuristics do not make this assumption.
∑ P( I
h2( I 1∗ ) =
1j
)∗ h2( I 1 j )
j ∈H , M , L
h2( I 1H ) =
∑ P( z| Π( I )[C( z , I
z ∈Z (κ ( I1* ))
1H
1H
) − JC ( Γ (Π( I 1H )))]
Thus, h2( I 1H ) = 0 because R1 with an incremental cost of 0 is consistent with all state vectors. h2( I1 M ) = 12 (All minimum rules in Table 4 have an incremental cost of 12.) h2(I1L) =
1
8
∗ (19 + 12 + 14 + 12 + 17 + 12 + 14 + 12) = 14
h2( I 1∗ ) = 0.6∗0 + 0.3∗12 + 01 . ∗14 = 5.00
Page 20
3.3. Loss Considerations The Expected Rule heuristic is overly optimistic because it does not account for potential loss due to sequential acquisition of inputs. By counting only the inputs needed to satisfy the cheapest rules, the Expected Rule heuristic ignores extra cost or loss that is included in the optimal cost. In this subsection, we revise the Expected Rule function with a sequential acquisition constraint to account for some of this loss. In addition, we use the loss notion to develop a pruning technique that can significantly reduce the search space. 3.3.1 Constrained Expected Rule Function We revise the Expected Rule function by constraining it to occur through one input. Instead of computing the cost of the cheapest rule for each state vector, we compute the cost of each state vector constrained through each remaining input. The loss is implicit in the constrained cost calculation. In the formal definitions of the Constrained Expected Rule function (h3), CC is the constrained cost given an input, SCC is the constrained cost given an input and state vector, and RSCC is the constrained cost given an input, state vector, and rule. q
h3( N i ∗ ) =
∑ P( xij | Π( N ij )) * h3( N ij )
(7)
j =1
[
h3( N ij ) = Min CC (r , N ij ) r ∈κ ( N i * )
CC(r , N ij ) =
]
∑ P ( z| Π( N
z ∈Z (κ ( Ni * ))
SCC (r , N ij , z ) =
Min
Rk ∈ ξ ( Π ( N ij ) ∪ z )
(8)
ij
) SCC (r , N ij , z )
(9)
[ RSCC( R , r, N , z)] k
(10)
ij
RSCC ( Rk , r , N ij , z ) = 0 if C ( z , N ij ) − JC (Γ (Π ( N ij )) = 0
[
]
= JC (Γ ( Rk ↑ x ) ∪ r ∪ Γ ( Π( N ij )) − JC (Γ (Π ( N ij )) otherwise
(11)
Page 21
We depict the Constrained Expected Rule function by continuing with the example from Section 3.2. There is no revision to h2(I1H) because rule R1 with zero loss is consistent with every state vector. Table 5 shows the constrained costs (RSCC) that underlie the calculations of h3(I1M) and h3(I1L). The final cost does not change because there are always necessary inputs (I2 and I3 in node I1M and I4 in node I1L) in the best rule for each state. Without such always necessary inputs, the Constrained Expected Rule function would have been larger than the Expected Rule function. Table 5: Constrained Costs Given an Input, State Vector, and Rule * Constraining Input I3 I1L I1M I1L R2 (19) R5 (12) R2 (19) R8 (17) R5 (12) R8 (14) R7 (19) R4 (12) R7 (14) R8 (17) R4 (12) R8 (14) R6 (17) R3 (12) R6 (19) R8 (17) R3 (12) R8 (14) R6 (17) R9 (12) R7 (14) R8 (17) R9 (12) R8 (14)
I2 Vector 1 2 3 4 5 6 7 8
I1M R5 (12) R5 (12) R4 (12) R4 (12) R3 (12) R3 (12) R9 (12) R9 (12)
I4 I1M R5 (19) R5 (19) R4 (19) R4 (19) R3 (19) R3 (19) R9 (19) R9 (19)
I1L R2 (19) R8 (12) R7 (14) R8 (12) R6 (17) R8 (12) R7 (14) R8 (12)
* The parenthesized numbers are computed by the function RSCC(Rk, r, Nij) where Rk is the cheapest rule (internal cell) for remaining input r (column) and sub columns (I1M and I1L) are the nodes (Nij). The state vector z is the row. 3.3.2 Always Necessary Pruning A small extension of the example depicting the Constrained Expected Rule function leads to insight about reducing the search space. In the evaluation of the OR nodes I1L, I1M, and I1H beneath I1, the search can be significantly reduced because of always necessary inputs. An “always necessary” input is needed in every rule of a partition. For example, I2 and I3 are necessary in every rule in the I1M partition. We demonstrate in Appendix 1 that an input needed in all rules of a node’s partition must be the root of an optimal subtree beneath the node. An always necessary input can be immediately chosen without computing the heuristics. Thus, the potential
Page 22
subtrees containing the other remaining inputs as roots can be immediately pruned resulting in a significant search space reduction. 3.4. Computational Considerations The worst case complexities of the Expected Rule function and the Constrained Expected Rule function arise from the requirement to enumerate all possible input combinations (cases). There are qn possible input state vectors where q is the number of input states and n is the number of remaining inputs. For each input state vector, we need the set of consistent rules (from ρ remaining rules). For each rule, there are n consistency checks assuming both the input state vectors and rules are in the same input order. For each input state vector, the worst case for finding the minimum cost rule among the number of consistent rules is ρn. The worst case complexity of the Expected Rule function for an OR node is O(qn * (ρ*n)). The worst case complexity of the Constrained Expected Rule function is O(qn * (ρ∗n* n)) because an additional iteration over the remaining inputs is necessary. To alleviate the computational difficulty of enumerating all cases and storing all joint probabilities, we propose that the set Z(κ ( N i * ) ) and the joint probabilities be estimated from a data set. If reliable historical data is available, a sample of this data can be used. Note that historical data does not need to be classified as is necessary for induction algorithms. If reliable historical data is not available, a data set can be generated from a model of the joint distribution of the inputs. When a probability model is available, the generation of the data set is still required to deal with the problem of rule overlap. If a model is not available, a simple model can be built by assuming that inputs are independent. The simple model can be improved using prior knowledge as described by Heckerman[9]. Thus, using a data set supports overlapping rules and more
Page 23
detailed probability models. Generating a data set involves only a small effort (generating data from a distribution model) if historical data is not available. We now revise our heuristics using the a data set instead of the complete set of input state vectors. Let, T be a set of cases L be a partitioning function; L(T,π) returns the set of cases in T matched by condition π. Then the Expected Rule function and the Constrained Expected Rule function can be revised as shown below. Note that L(T , Π ( N i* ) is the total number of cases before partitioning for input i. In (12) and (14) below we divide by the number of cases in L(T , Π ( N i* ) to obtain sample estimates of h2 and h3. In (13) and (16), we iterate through each sample point rather than use the probability of each input state vector as in (4) and (9). These steps follow from the assumption of random sampling wherein the sample points have equal probability. However, we do not assume that the underlying marginal distributions are uniform, nor the joint distributions are independent. 1 h2 ′( N i ∗ ) = L(T , Π ( N i * ) h2 ′( N ij ) =
q
∑ h2 ′( N ij ) j =1
∑
C ( z , N ij ) z ∪ Π ( N ij )∈L ( T ,Π ( Nij ))
1 h3′( N i ∗ ) = L(T , Π ( N i * )
[
CC ′ (r , N ij ) =
− JC (Γ (Π ( N ij )))
(13)
q
∑ h3′( N ij )
(14)
j =1
h3′ ( N ij ) = Min CC ′ (r , N ij ) r ∈κ ( N i * )
(12)
]
∑ SCC(r , N ij , z)
z ∪Π( Nij )∈L ( T ,Π ( Nij ))
(15) (16)
Page 24
In the reformulation above, the worst case complexities of h2 and h3 are polynomial in the number of cases, rules, and inputs. The complexity depends more on the number of cases because a large sample (perhaps several thousand cases) may be necessary for a large number of inputs. With an appropriate indexing technique, the search cost in both functions could be reduced further, perhaps to a logarithm of the number of cases.
4. Empirical Comparison In this section, we empirically investigate the performance of the heuristics for both optimal and greedy search algorithms. For optimal search, we explore the precision-complexity tradeoffs among the heuristics and the maximum size problems in which optimal search is feasible. For greedy search, we compare the cost difference between greedy and optimal search among the heuristics when input costs vary. In some of the comparisons, the Min Rule function (h1) is used to provide an example of an easy to compute, but less precise heuristic. The Min Rule function computes a lower bound of the future cost as the minimum cost rule in a partition of rules. 4.1 Complexity-Precision Tradeoff We investigate the complexity-precision tradeoffs among the three optimistic heuristics (h1, h2, and h3) using the AO* search algorithm. In Section 3, we demonstrated some coarse results about their precision and computational complexity. Here, we investigate their search performance as the number of inputs increases. As explained in Section 2.2, the number of inputs is the primary factor affecting problem difficulty. To put the search effort for the three optimistic heuristics in perspective, we compare greedy versus optimal search effort. The greedy search algorithm makes an irrevocable decision at each node using one of the three heuristics. The greedy search algorithm uses the h1, h2, and h3
Page 25
heuristics but does not update heuristic calculations as AO* does because irrevocable choices are made. In addition, the “always necessary” pruning rule is used before computing the specified heuristic as in AO*. To test the heuristics, we generated rules and cases using a program based on specifications described by Bisson[3]. The left hand side of a rule randomly varies in complexity between a minimum and maximum number of conjunctive terms. For simplicity, each rule has a single recommendation in its right hand side and no intermediate outputs. For each class, the number of rules uniformly varies in a given range. The generated rules are consistent but rules for the same class may (and typically do) overlap. Once rules are generated, cases (fully enumerated input state vectors) are generated using the rules. We generated cases by first selecting a rule using preset rule probabilities. A case is generated by fixing the input values used in the rule and generating other values from a uniform distribution. Thus, the joint distribution of inputs used in rules contains dependencies while the inputs not used in any rule are independent. In the first experiment, we compared the Min Rule (h1), Expected Rule (h2), and Constrained Expected Rule (h3) heuristics on small problems (Table 6). The parameters used in generating the results were 40 rules, 200 cases, an average of 3 states per input, and the rule length uniformly varying from three conditions to the number of inputs minus one. As expected, the h1 heuristic is very poor as the number of nodes increases rapidly. The seven input problem was the largest size that could be handled for h1. (The experiments were executed on a Gateway P5-90 with 16MB of RAM running Microsoft Windows 95.The search algorithms were written in Microsoft Visual C++ 4.0). In contrast, the number of nodes created by h2 and h3 grows slowly.
Page 26
It is interesting to note that in the seven input case, both h2 and h3 create the minimum number of nodes (the same number as the greedy search algorithms). Table 6: Optimal Search Effort Results for Small Problems Inputs 5
6
7
Heuristic h1 h2 h3 h1 h2 h3 h1 h2 h3
Nodes Created 7,499 760 559 37,397 855 783 125,518 954 954
In the second experiment, we compared h2 and h3 on moderately large problems (Figure 4). The parameter values in this experiment were 100 rules, 200 cases, an average of 3 states per input, and the rule length uniformly varying from 4 to 8 conditions. Each input level represents a different rule set with the specified number of inputs. In Figure 4, notice that the number of nodes created by h3 is close to the number of nodes created by the greedy algorithms testifying to the precision of the h3 heuristic.
Page 27
Search Effort (Optimal vs. Greedy)
Nodes Created
100000 90000
AO* H2 Nodes
80000
AO* H3 Nodes
70000
Greedy H2 Nodes
60000
Greedy H3 Nodes
50000 40000 30000 20000 10000 0 10
12
14
16
18
20
Number of Inputs
Figure 4: Optimal Search Effort Results on Moderately Large Problems Because computing times are dependent on hardware and programming characteristics, we only provide selected results in Table 7. Note that our implementation was not the most optimized as it uses extensive recursion, sequential search for partitioning the rule set and cases, and uncompressed storage of the rule and data set partitions in each OR node. In addition, the simulation was conducted with only 16 MB of RAM resulting in extensive use of virtual memory on the 20 input problem. The processing times confirm that h3 involves considerably more computation than h2. With greedy search, the increased time is more than five times. However, on AO* search, the increase is only less than two times because h3 creates about three times fewer nodes than h2.
Page 28
Table 7: Processing Times for a 20 Input Problem * Algorithm AO* AO* GREEDY GREEDY
Heuristic h2 h3 h2 h3
CPU Seconds 464.67 875.02 31.47 187.29
* CPU times were computed using the clock() function from the Microsoft Foundation Library. Always necessary pruning had a modest effect on the performance of AO* as shown in Table 8. The number of “always necessary” nodes declines because the rule bases with a fixed number of rules became more sparse as the number of inputs increases. Even at ten inputs, the number of always necessary nodes is a small part of the total nodes. The saved nodes column is the immediate number of nodes not created because of the pruning rule. It does not measure indirect nodes that were not created. For example, if there were nine nodes under evaluation and one node was always necessary, the saved nodes would be eight. Table 8: Effect of Always Necessary Pruning on h2 Optimal Search Inputs 10 11 12 13 14 15 16 17 18 19 20
Total Nodes 12,906 30,951 26,869 34,659 32,856 58,549 39,031 89729 51,973 77,905 93,521
Always Necessary Nodes 114 105 44 26 41 91 26 35 25 13 24
Saved Nodes 477 507 272 147 335 765 242 369 302 160 323
4.2 Cost Performance of Greedy Search There are a number of reasons to investigate the performance of the heuristics using greedy search rather than optimal search. The first reason is that problems may be posed to the system with a varying number of inputs already acquired. For example, when a rule base is used
Page 29
with a database (as is quite common), some inputs may already be available in the database with near zero acquisition cost. In these situations, an optimal solution may need to be generated for each problem. If problems are solved on-line, there may be a time constraint so that fast generation of a decision tree is necessary. The second reason for investigating greedy search is that informed optimal search may not be possible for large rule bases. The search space grows so quickly that even a highly informed optimal search may be overwhelmed. It is instructive to understand how greedy choices can lead to non-optimal decision trees. The problem with a greedy choice is that a non-greedy choice (i.e., an input not having the lowest heuristic value) may result in lower cost in the remainder of the tree. This deception is due to differences in partitioning behavior among the inputs. A non-greedy input choice may partition the rules such that there is lower cost in the remainder of the tree. Thus, it is sometimes beneficial to tradeoff more cost at a higher node for less cost in the remainder of the tree. In the credit granting example, I1 provides a very good partition of rules as there are always necessary inputs in all its states. Even if another input has a lower heuristic value, I1 may be a better choice because of its excellent partitioning behavior. To see the extent of deception at a coarse level, we studied greedy search performance as input cost variance increases and the ratio of fixed (joint) to variable cost increases. In the input cost variance experiment, we fix a mean that stays the same across coefficient of variation (CV) levels. For each CV level, we uniformly draw both fixed and incremental costs from the distribution determined by the CV level and the fixed mean. In the fixed to variable ratio experiment, we fixed the mean and CV of the incremental costs. Using a uniform distribution determined by the fixed mean and CV, we drew incremental costs. Within each group of inputs,
Page 30
the fixed cost for the group was the average input cost in the group multiplied by the fixed to variable cost ratio. In both experiments, inputs were randomly partitioned into groups sharing the same fixed cost. Figures 5 and 6 show the results of the two cost experiments. In Figure 5, the worst performance of the heuristics (h2) is less than 1% of the optimal solution. Thus, there does not appear to be much deception for a good heuristic in this problem. Also note that h3 dominates h2 as h3 is lower at most points, and h3 has less variation in performance. Figure 6 shows that there is not much deception when the ratio between fixed and variable costs increases. A similar result was seen when the number of inputs increased. Greedy-Optimal Cost Differences (20 inputs)
Greedy-Optimal Cost Differences (15 inputs)
H2-CostDiff% 0.8
H2-CostDiff%
0.9
0.7
H3-CostDiff%
0.8 0.7
0.6
0.6
0.5
%DiffOpt
%DiffOpt
H3-CostDiff%
0.4 0.3
0.5 0.4 0.3
0.2
0.2
0.1
0.1
0
0 0
0.2
0.4
Input Cost CV
0.6
0
0.2
0.4
Input Cost CV
Figure 5: Cost Performance of Greedy Search as Input Cost Variance Increases
0.6
Page 31
Greedy-Optimal Cost Differences (11 Inputs) 1.2 H2-CostDiff%
1
H3-CostDiff%
%DiffOpt
0.8 0.6 0.4 0.2 0 0
-0.2
2
4
6
8
10
Fixed to Var Ratio
Figure 6: Cost Performance of Greedy Search as Fixed to Variable Cost Ratio Increases To study deception at a finer level, we devised a simple model of the partitioning process. The model abstracts the partitioning process into equations that measure the loss of selecting an input because minimizing loss is identical to minimizing cost. The immediate loss (ILoss) of an input is the cost of the input times the number of cases in which the best matching rule does not include the input. The loss of the tree through an input (Loss*(Xi)) is the immediate loss plus the minimum expected loss in the remainder of the tree given the input previously acquired. Deception occurs when the loss due to the greedy choice (Lossg) is greater than the minimum expected loss without any constraints (Loss*(φ)). Equations (17) and (18) measure the effects of one greedy choice followed by optimal choices. Thus, we are underestimating the entire effect of greedy choices.
[
Loss * (φ ) = Min ILoss( X i ) + Loss * ( X i ) i
[
]
]
Loss g (φ ) = Min ILoss( X i ) + Loss * ( X g ) where Xg is the greedily chosen input. i
(17) (18)
Page 32
Figure 7 displays simulation results using the simple partitioning model. The graphs are based on 100 trials of randomly generated partitioning processes for each input coefficient of variation (CV) level. Each partitioning process involves a partition and set of subpartitions for each input. Deception was measured at the parent partition for a three input problem. Increasing input cost variance appears to decrease the occurrence of deception (Figure 7a) but has no discernible effect on the amount of deception (Figure 7b). The amount of deception in Figure 7b is computed only over trials with deception. When deception occurs, it can cause a significant increase in cost (as much as about 15% above the optimal).
(b) Amount of Deception
(a) Occurrence of Deception 16 14 20 Avg % Difference
Deception Occurrence %
25
15 10 5
12 10 8 6 4 2
0
0 0
0.2
0.4
0.6
Input Cost CV
0
0.2
0.4
0.6
Input Cost CV
Figure 7: Deception Results in the Simple Partitioning Model The simple partitioning model provides a more conservative picture of the greedy heuristic performance than our full simulation results. The more conservative results may be due to some pathological partitions occurring in the simple partitioning model. Since the simple model does not use an actual rule base to generate partitions, there is no guarantee that an actual rule base with such partitioning characteristics is likely.
Page 33
Beyond deception, a final point to study is the robustness of greedy and optimal search when sample data used to construct the tree differs from field data. This situation may occur when past data is not a good indicator of future data or a good probability model cannot be constructed. Sensitivity of tree construction to probability estimates is an issue with any approach that attempts to minimize expected cost. To test the sensitivity of our heuristics, we constructed trees using both representative and perturbed samples. In the perturbation process, after a rule is initially selected, a different rule may be used because of the perturbation process. The perturbation probability indicates the fraction of cases that did not use the true rule in the data generation process. For example, a perturbation probability of 0.10 indicates that the true rule is not used for 10% of the cases. After perturbation, the true joint distribution of the inputs does not match with the actual distribution in the perturbed data set. Figure 8 indicates that both greedy and optimal search are only mildly sensitive to probability errors in the sample data. The y-axis shows the percentage increase above the expected cost using the true data set. Figure 8 does not seem to show any pattern as the probability of perturbation increases. Optimal search and greedy (H3) search were somewhat more robust than greedy (H2) search.
Page 34
Robustness Under Sampling Errors (11 Inputs, 100 rules, 200 cases)
14 13
%Increase
12 11 10 9 8 7
AO*
GR(H2)
6
GR(H3)
5 0
0.2
0.4 0.6 Probability of Perturbation
0.8
1
Figure 8: Robustness Results
5. Conclusion We studied the sequential information acquisition problem for rule-based expert systems. This problem involves redesigning an expert system so that its recommendations are preserved while achieving lower cost to acquire the information to make its recommendations. We characterized the problem as a search through a space of decision trees. Our most significant contribution was to develop and analyze two “optimistic” heuristics providing choices for precision and computational effort. We evaluated the effectiveness of these heuristics for both optimal and greedy search strategies. Simulation experiments demonstrated that the expected rule and constrained expected rule heuristics are very precise leading to excellent performance in both optimal and greedy search. This study contrasts with previous studies where rule simplification and strategy formation are jointly performed. When these tasks are performed jointly, there is a high search effort since good heuristics are more difficult to find. The most significant insight from this work is that separating the tasks of rule simplification from strategy formation facilitates effective search.
Page 35
Both heuristics (h2 and h3) use the existing rule base to estimate the remaining cost. Without reliance on an existing rule base, good heuristics appear to be more difficult to develop. An interesting extension of the current work is to study the performance of optimal and greedy search when the rules are not fully compressed. It might be possible to modify the heuristics so that an optimal strategy can still be found even when some compressed rules are missing without incurring significantly more search effort. This work is part of our long term interest about the economics of expert systems. In past work, we have studied economic design considerations in decision tree induction (see Mookerjee and Dos Santos[14]), noise handling (see Mookerjee and Mannino[15]), and case based reasoning (see Mookerjee and Mannino[16]). We are currently studying economic design considerations in Bayesian Belief Networks and uncertain cost measurement for classification systems. Together, we hope that these studies contribute to a better understanding of economic design factors and improved operating performance of expert systems.
Acknowledgment We are thankful for financial support provided by the School of Nursing at the University of Washington. We thank Kerry Meyer for her encouragement and Sendi Widjaja for his programming support.
Page 36
Appendix 1: Propositions about Heuristics We prove propositions that the Expected Rule (h2) and Constrained Expected Rule (h3) functions are optimistic. We begin with proposition A1.1 about the Expected Rule function. Proposition A1.1: The Expected Rule function is optimistic. The intuition behind this proposition is that when sequential constraints of a decision tree are imposed, the true cost will never be lower than an estimate that ignores these constraints. Since h2 ignores sequential constraints, it is guaranteed to never overestimate the future cost. We prove that the h2 is optimistic (less than or equal to the optimal expected cost) by showing that (i) h2 cannot be greater than the optimal expected cost and (ii) the optimal expected cost can be greater than h2. The first point is true because h2 chooses the minimum rule to satisfy each input state vector in a partition. More precisely, let z be an input state vector generated using the set of remaining inputs at node Nij, that is, z ∈ Z (κ ( N i* )) . For each fully enumerated input state vector (called a case) in the partition, Π( N ij ) ∪ z , h2 chooses the rule that minimizes the cost of satisfying the case. Hence, h2 cannot be greater than the optimal expected cost of solving all the cases in the partition. The second point is true because h2 does not have any sequential constraints. For a given case, the cost of the path on the optimal subtree can be larger than h2 because the paths in the optimal subtree share a common root, and the input at the root may not be required in the minimal rule for that case. Hence, the optimal expected cost can be greater than h2. To depict the second point, assume there are two remaining binary inputs, a and b. There are four remaining cases corresponding to the combinations of the states of a and b. Let R1 and R2
Page 37
be two remaining rules consistent with the states of the inputs acquired so far, that is, ( Π( N ij ) ). Let R1 use input a and R2 use input b. Hence, some paths in the optimal subtree will acquire both inputs a and b. However, the expected rule function h2 assumes that only a or only b will be acquired (not both). Proposition A1.2: The Constrained Expected Rule function (h3) is optimistic. Because the Constrained Expected Rule function refines the Expected Rule function, it is only necessary to show that the refinement always underestimates the expected cost of the optimal subtree. In the optimal subtree, future loss could occur due to the input at the root as well as due to losses from inputs below the root. Since the Constrained Expected Rule function only accounts for loss at the root, it can never exceed the loss in the optimal subtree. Note also that the Constrained Expected Rule function calculates the minimum loss that can occur at the root of the optimal subtree. Hence, the Constrained Expected Rule function is optimistic.
Page 38
Appendix 2: Always Necessary Pruning To formalize the always necessary pruning rule, we define a proposition about how an input’s loss changes if it is moved higher in a tree. The loss of an input (or AND) node in a path is zero if the input node is needed in the cheapest rule that satisfies the path; otherwise the loss is the incremental cost of the input. The expected loss of an input node is the expected loss over all paths. Proposition A2.1 states an important property about how an input’s expected loss changes when the input is moved higher in a tree. Proposition A2.2, defining the Always Necessary pruning rule, follows from Proposition A2.1. Proposition A2.1: The expected loss of an input node never decreases by moving it higher in the tree. By moving an input above, it is acquired on more paths of the tree than before. The number of paths in which it is not needed will increase (never decrease). Hence its loss should increase (never decrease). Proposition A2.2: If an input is required in all rules that are consistent with the path from Start to Nij (that is, Π( N ij ) ∪ N ij ) , then it must be the root of an optimal subtree below Nij. Consider two inputs that are candidates for acquisition beyond the node Nij: one which is always required (say, x) and another that is required in only some rules (say, y). Input x should be acquired before input y. From proposition A1, if y is acquired before x, its loss will increase (never decrease) and hence, the overall cost of the tree will increase (never decrease). On the other hand, the loss of x is zero, regardless of where x is acquired. Hence, always necessary inputs should be acquired before ones that are required in only some rules.
Page 39
Next consider the case where there is more than one always necessary input. The cost of the tree does not change with the order in which always necessary inputs are acquired. This is because the loss of these inputs is zero regardless of where in the tree they are acquired. Hence, always necessary inputs can be acquired in any order as long as they are acquired before inputs that are required in only some rules.
Page 40
References 1.
A. Bagchi. and A. Mahanti, 1985. “AND/OR Graphs Heuristic Search Methods,” J. ACM 32, 1, 28-51.
2.
V. Barker and D. O’Connor, 1989. “Expert Systems for Configuration at Digital,” Communications of the ACM 32, 3 (March), 298-318.
3.
H. Bisson, 1991. "Evaluation of Learning Systems: An Artificial Data-Based Approach," in Proceedings of the European Working Session on Machine Learning, Y. Kodratoff (ed.), Springer-Verlag, Berlin, F.R.G., pp. 463-481.
4.
L. Breiman, J. Friedman, R. Olshen, and C. Stone, 1984. Classification and Regression Trees, Wadsworth Publishing, Belmont, CA.
5.
S. Chakrabarti, S. Ghose, and S. DeSarkar, 1988. “Admissibility of AO* When Heuristics Overestimate,” Artificial Intelligence, 34, North Holland, 97-113.
6.
F. Coenen and T. Bench-Capon, 1993. Maintenance of Expert Systems, The A.P.I.C. Series, Number 40, Academic Press, San Diego, CA.
7.
B. Dos Santos and V. Mookerjee, 1993. “Expert System Design: Minimizing Information Acquisition Costs,” Decision Support Systems, 9, North Holland, 161-181.
8.
S. Ganapathy and V. Rajaraman, 1973. “Information Theory Applied to the Conversion of Decision Tables to Computer Programs,” Communications of the ACM 16, 9 (September), 532-539.
9.
D. Heckerman, 1997. “Bayesian Networks for Data Mining,” Data Mining and Knowledge Discovery 1, (March).
10.
D. Heckerman, J. Breese, and K. Rommelse, 1995. “Decsion Theoretic Troubleshooting,” Communications of the ACM 38, 3 (March), 49-57.
11.
L. Hyafil and R. Rivest, 1976. “Constructing Optimal Binary Decision Trees is NPComplete,” Information Processing Letters 5, 1 (May), 15-17.
12.
M. Mannino and V. Mookerjee, 1997. “Sequential Decision Models for Expert System Design,” IEEE Transactions on Knowledge and Data Engineering, 9, 5 (SeptemberOctober), 675-687.
Page 41
13.
A. Martelli and U. Montanari, 1977. “Optimizing Decision Trees Through Heuristically Guided Search,” Communications of the ACM 21, 12 (December), 1025-1039.
14.
V. Mookerjee and B. Dos Santos, 1993. “Inductive Expert System Design: Maximizing System Value,” Information Systems Research 4, 2 (June), 111-140.
15.
V. Mookerjee and M. Mannino, 1995. “Improving the Performance Stability of Inductive Expert Systems Under Input Noise,” Information Systems Research 6, 4 (December), 328-356.
16.
V. Mookerjee and M. Mannino, 1997. “Redesigning Case Retrieval Systems to Reduce Information Acquisition Costs,” Information Systems Research, 8, 1 (March), 51-68.
17.
J. Moore and A. Whinston, 1986. “A Model of Decision Making with Sequential Information Acquisition - Part I,” Decision Support Systems 2, 4, North Holland, 285307.
18.
H. Newquist, 1990. “No Summer Reruns,” AI Expert 10, 5 (October), 65-66.
19.
M. Nunez, 1991. “The Use of Background Knowledge in Decision Tree Induction,” Machine Learning, 6, 231-250.
20.
K. Patttipati and M. Alexandridis, 1990. “Application of Heuristic Search and Information Theory to Sequential Fault Diagnosis,” IEEE Transactions on Systems, Man, and Cybernetics 20, 4 (July/August), 872-887.
21.
J. Pearl, 1984. Heuristics: Intelligent Search Strategies for Computer Problem Solving, Addison Wesley.
22.
L. Reinwald and R. Soland, 1966. “Conversion of Limited-Entry Decision Tables to Optimal Computer Programs,” Journal of the ACM 13, 3 (July), 339-358.
23.
H. Schumacher and K. Sevcik, 1976. “The Synthetic Approach to Decision Table Conversion,” Communications of the ACM 19, 6 (June), 343-351.
24.
K. Schwayder, A. Kenney, and R. Ainslie, 1972. “Decision Tables: a Tool for Tax Practitioners,” The Tax Advisor 3, 6 (June), 336-345.
25.
R. Smith and C. Scott, 1991. Innovative Applications of Artificial Intelligence, Volume 3, AAAI Press, Menlo Park, CA.
26.
R. Tiong and T. Koo, 1991. “Selecting Construction Formwork: An Expert System Adds Economy,” Expert Systems Journal 3, 1 (Spring), 5-16.
Page 42
27.
P. Turney, 1995. “Cost Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm,” Journal of Artificial Intelligence Research 2, 369-409.