redesigning case retrieval to reduce information ... - CiteSeerX

REDESIGNING CASE RETRIEVAL TO REDUCE INFORMATION ACQUISITION COSTS

Vijay S. Mookerjee and Michael V. Mannino Department of Management Science, DJ-10 University of Washington, Seattle, WA 98195 [email protected] and [email protected]

August 2, 1996

REDESIGNING CASE RETRIEVAL TO REDUCE INFORMATION ACQUISITION COSTS Abstract Retrieval of a set of cases similar to a new case is a problem common to a number of machine learning approaches such as nearest neighbor algorithms, conceptual clustering, and case based reasoning.

A limitation of most case

retrieval algorithms is their lack of attention to information acquisition costs. When information acquisition costs are considered, cost reduction is hampered by the practice of separating concept formation and retrieval strategy formation. To demonstrate the above claim, we examine two approaches. The first approach separates concept formation and retrieval strategy formation. To form a retrieval strategy in this approach, we develop the CRlc (case retrieval loss criterion) algorithm that selects attributes in ascending order of expected loss. The second approach jointly optimizes concept formation and retrieval strategy formation using a cost based variant of the ID3 algorithm (ID3c). ID3c builds a decision tree wherein attributes are selected using entropy reduction per unit information acquisition cost. Experiments with four data sets are described in which algorithm, attribute cost coefficient of variation, and matching threshold are factors. The experimental results demonstrate that (i) jointly optimizing concept formation and retrieval strategy formation has substantial benefits, and (ii) using cost considerations can significantly reduce information acquisition costs, even if concept formation and retrieval strategy formation are separated.

1. Introduction In recent years, large databases of cases have become an important part of many inductive expert systems. A number of machine learning approaches using case histories have been proposed including nearest neighbor algorithms [Aha, Kibler, and Albert, 1991], conceptual clustering [Gennari, Langley, and Fisher, 1989], and case based reasoning [Kolodner, 1991]. Usage of these algorithms is reported in areas such as industry and occupation code classification [Creecy et. al., 1992], real estate appraisal [Gonzalez and Laureano-Ortiz, 1992], market surveillance [Barletta and Buta, 1991], assembly planning [Zarley, 1991], and sales prediction [Stottler, 1994]. We motivate the problem studied here with an example of a case based system designed to support customers with problems using a backup tape drive for a personal computer (see figure 1). This hypothetical system is similar to reported help desks for the VMS operating system [Simoudis, 1992] and personal computer software [Breese and Heckerman, 1995]. The first step in developing the system is to cluster the cases into categories to identify faults such as “Incomplete Installation”, “Incompatible Driver”, “Incompatible Parallel Port”, and “Tape Drive Malfunction”. Solutions recorded for cases in the “Incomplete Installation” cluster could include “Reinstall Tape Drive Software” and “Reconfigure Tape Drive Software”. The second step is to form concepts for each cluster. Concept formation generates functions or concept definitions that assign new cases to clusters. For example, the concept definition for the “Incomplete Installation” cluster may use rules with attributes such as Operating_System_Version, Tape_Drive_Version, Loads_Ok, and Menu_Missing.

2

The final step in developing the help desk is to determine a retrieval strategy that specifies how information should be collected. A retrieval strategy may be represented as a total order or a context (a partial order). A total order is a list of attributes to collect. For example, a total order is: Tape_Drive_Version, Loads_Ok, and Menu_Missing. A context is a decision tree in which the next attribute collected depends on the values observed for the previous attributes. For example, a context is: first collect Loads_Ok, then collect Menu_Missing if Loads_Ok is true, else collect Tape_Drive_Version. The support engineer at the help desk may query a user until a matching cluster of cases can be identified. The system would retrieve the most similar cases in the cluster that match the new case. The engineer would then adapt the solutions to determine appropriate actions. Adaptation is typically left to the user because humans have been found to be better than computers at adapting cases to solve new problems [Kolodner and Simpson, 1989; Allen, 1994]. After determining the solution for the new case, the system may add unique cases and solutions to the case base for future consulting use. The tape drive example is typical of a case based system that involves sequential decision making. When a user calls about a problem, the details are not known until a support engineer asks questions and conducts diagnostic probes. Some attributes such as the software version and tape drive model are easy to obtain. Other attributes such as asking the user to check for a parallel port conflict may be more difficult. Still other attributes may involve the support engineer logging onto to the remote system to generate diagnostic information.

3

Retrieval Strategy

Attributes of New Case

1. Tape Drive Version 2. Loads Ok 3. Menu Missing

Loads Ok Y

Menu Missing

N

Tape Drive Version

Retrieved Cases

Adaptation

Re-Install Software

Ne

Concept Definitions and Clusters

wC a

se Sto and S red olu tio n

Incomplete Installation

Incompatible Driver

Incompatible Parallel Port

Tape Drive Malfunction

Figure 1. Case Based Systems for Tape Drive Help Desk Help desks are a special case of troubleshooting where the goal is to repair a faulty device such as an automobile engine or commercial software. Case based systems are increasingly being used for troubleshooting applications because previous cases are a good explanatory tool and case bases can be easier to develop and maintain than rule based systems [Allen, 1994]. Reports of

4

case based systems for troubleshooting are described in [Simoudis 1992; Breese and Heckerman, 1995; Heckerman et al., 1995]. In this paper, we are concerned with concept and retrieval strategy formation. Specifically we examine the practice of forming the retrieval strategy for a given set of concepts. The most important contribution of the paper is to demonstrate that jointly optimizing the tasks of concept formation and retrieval is superior to optimizing the retrieval strategy independent of the concepts. To our knowledge, this is the first study that clearly demonstrates this design limitation in existing case retrieval algorithms. The alternative presented here, the ID3c algorithm, jointly develops concept definitions and a retrieval strategy in the form of a decision tree. Attributes in the decision tree are selected based on entropy reduction per unit information acquisition cost. The second contribution of this paper is to demonstrate that information acquisition costs can be significantly reduced even if concept formation and retrieval strategy formation are separated. For the second contribution, we develop the CRlc (case retrieval loss criterion) algorithm that finds a retrieval strategy using the notion of the expected loss of an attribute. The expected loss of an attribute is the probability of unnecessarily collecting an attribute times its information acquisition cost. The rest of this paper is organized as follows. Section 2 reviews research on the economics of case based systems and expert systems. Section 3 discusses approaches that separate concept formation and retrieval strategy formation. Section 4 describes joint concept formation and retrieval strategy formation. Section 5 reports on experiments comparing the different case retrieval approaches. Section 6 summarizes the paper and discusses future research directions.

5

2. Economics of Case Based Systems In this section, we review research on economic considerations in expert systems and discuss how the economic performance of a case based system may be measured. 2.1 Economic Considerations in Previous Research Although most research on Case Based Systems emphasizes computational efficiency and accuracy, attention has recently been paid to reducing information acquisition costs. Simoudis [1992] has developed a retrieval procedure for help-desk retrieval problems. This procedure has two limitations that may lead to higher information acquisition costs than necessary. First, the usefulness of a costly attribute is not measured across a set of potentially matching cases. Second, the retrieval procedure separates concept formation from retrieval strategy formation. The significance of the second limitation will be demonstrated later. Several other studies have used retrieval processes with similar limitations [Hammond, 1986; Koton, 1988]. Although economic considerations are uncommon in the Case Based Systems area, they are more prevalent in other kinds of expert systems. Moore and Whinston (1986, 87) have proposed a decision theoretic framework, applicable to a variety of deductive expert systems. For rule-based expert systems, the focus has been to reduce information costs without affecting decision making performance [Pattipati and Alexandridis, 1992; Dos Santos and Mookerjee, 1993]. Similar objectives have been used to develop a retrieval strategy for Bayesian Belief Networks [Breese and Heckerman, 1995; Heckerman et al., 1995]. For inductive expert systems, both cost minimization [Nunez, 1991] and value maximization [Mookerjee and Dos Santos, 1993] have been attempted.

6

2.2 Measuring Economic Performance From an economic standpoint, a Case Based System may be evaluated in terms of its expected cost to support problem solving. This cost is the sum of two costs: (i) expected information acquisition costs and (ii) expected classification cost. Information acquisition cost only includes the direct costs of supplying attributes requested by the system to retrieve a set of similar cases. It excludes the costs of constructing the case base because such costs are fixed, more typically associated with knowledge engineering. The expected classification cost is the sum of the expected cost of correct and incorrect assignment of cases to clusters. The use of classification costs in evaluating a case based system is quite complex. After a case is assigned to a cluster, two activities occur before the case can be solved. First, a set of most similar cases is chosen from the cluster. Second these cases are adapted by the user to solve the problem. Since two non-identical cases can be assigned to the same cluster, the two sets of most similar cases may be different. Even if the most similar sets are identical, the adaptation process can lead to different solutions. Due to these complications, assignment to the same cluster does not ensure the same solution. Hence the eventual costs of correct and incorrect assignments can be difficult to assess. Given the difficulties in specifying classification costs, we use expected classification accuracy as a measure of system performance. Expected classification accuracy is estimated by the proportion of correct assignments made by the system in a sample of unseen cases. An external source, such as an expert, determines if an assignment is correct. From the preceding discussion, two distinct measures of system performance emerge: (i) information acquisition cost, and (ii) classification accuracy. In this study, we attempt to reduce expected information acquisition costs without sacrificing classification accuracy.

7

3. Separation of Concept Formation and Retrieval In this section, we present background on concept formation in Case Based Systems. This discussion is followed by a definition of the case retrieval problem. We then present two heuristic solutions to this problem: CRf, the baseline algorithm for the study, and CRlc, the loss criterion algorithm. 3.1 Concept Formation Concept formation or case indexing as it is sometimes referred to [Kolodner, 1991], is the problem of organizing cases to enable the efficient retrieval of similar cases. In many case based systems, the concept definition for a cluster is referred to as the norm of the cluster. A norm consists of a set of attribute-value pairs. A cluster and its norm are referred to as a node. An attribute-value pair in a norm is usually selected using two probabilities [Kolodner, 1991; Becker, 1973; Hall, 1989]: (i) predictive probability, and (ii) predictable probability. These probability measures have been used in a number of working systems such as EPAM [Feigenbaum, 1963], UNIMEM [Lebowitz, 1987], COBWEB [Fisher, 1987], CLASSIT (an extension of COBWEB) [Gennari, Langley, and Fisher, 1989], and Mediator [Kolodner and Simpson, 1989]. The predictive probability can be denoted by P(Nk | Ai = vij), where Ai is an attribute, vij is the jth state of the attribute, and Nk is the kth node. The predictive probability of an attribute-value pair is estimated by the number of cases of a node matching the pair divided by the total number of cases of the node. Thus, a predictive probability of 1 means that all cases with this attribute value belong to the node.

8

The predictable probability can be denoted by P(Ai = vij | Nk). This probability is estimated as the number of cases at a node matching the attribute-value pair divided by the number of cases across all nodes matching the attribute-value pair. Thus, a predictable probability of 1 means that every case of the node has this value for the attribute. The probability measures for an attribute-value pair must exceed specified construction thresholds to be used in a norm. For example, a construction threshold of 0.67 for the predictable probability is used in Mediator [Kolodner and Simpson, 1989]. Construction thresholds are similar in purpose to pruning in decision tree induction. Pruning leads to more general rules (that is, rules with fewer attribute-value pairs) that often have higher classification accuracy than specialized rules. We extend the tape drive example described in the introduction to demonstrate norm formation. Table 1 shows a number of attributes that may be useful in diagnosing a tape drive problem. Hypothetical predictable and predictive probability measures for some attribute-value pairs are shown in Table 2. Table 1. Input Attributes and Descriptions Attribute Name

Description

Operating_System

Multi-valued

Tape_Drive_Model

Multi-valued

Backup_Software_Version

Multi-valued

Loads_Ok

Boolean

Menu_Missing

Boolean

Drive_Recognized

Yes, No, Partial

Port_Conflict

Boolean

Bi-directional_Port

Boolean

9

Table 2. Predictive and Predictable Probability Measures for Two Nodes Incomplete Installation (II)

Incompatible Driver (ID)

P(value|II); P(II|value)

P(value|ID); P(ID|value)

Loads_Ok = No [0.70; 0.90]

Loads_Ok = Yes [0.99; 0.30]

Menu_missing = Yes [0.68; 0.90]

Menu_missing = No [0.99; 0.10]

Drive_Recognized = No [0.80; 0.30]

Drive_Recognized = Partial [0.95; 0.67]

Port_Conflict = No [0.20; 0.20]

Port_Conflict = Yes [0.70; 0.75]

Bi-directional_Port = Yes [0.5; 0.1]

Bi-directional_Port = No [0.85; 0.50]

Let us assume that construction thresholds of 0.67 are used for both the predictive and predictable probabilities. Applying these thresholds to the “Incomplete Installation” node, we reject the attribute-value pairs Drive_Recognized = No, Port_Conflict = No, and Bidirectional_Port = Yes. Thus the norm for “Incomplete Installation” is represented by the following conjunction of attribute-value pairs: [Loads_Ok = No, Menu_missing = Yes]. Similarly the norm for the “Incompatible Driver” node is: [Drive_Recognized = Partial, Port_Conflict = Yes]. 3.2 Case Retrieval Problem The case retrieval problem involves finding an optimal information acquisition order. An information acquisition order consists of two orders: a node order and a set of attribute orders, one for each node. A node order specifies the sequence in which nodes should be considered for matching. An attribute order specifies the sequence in which norm attributes should be collected to match the norm. An optimal information acquisition order has the least expected cost among all possible orders. More precisely, the case retrieval problem is defined as:

10

Objective Minimize EAC( S , N , η, α , f , Cost ) where α ,η

EAC( S , N , η, α , f , Cost ) is the expected attribute acquisition cost, S is a set of cases, N is a set of nodes, η is a node order, α is a set of attribute orders, f is a matching function; f maps a case and a norm to a Boolean, Cost is a function; Cost maps an attribute to its attribute acquisition cost Comment 1. The above model does not include classification accuracy in either the objective function or constraints. If norms do not overlap, then the information acquisition order will not affect classification accuracy. Two norms overlap if there are cases that can match both norms. However, even with overlapping norms, the order does not typically affect classification accuracy. In a later section, we experimentally observe that a cost based ordering has a slightly positive impact on classification accuracy. Since this impact is not substantial, the impact of information acquisition order on classification accuracy is ignored. Comment 2. The specific matching function (f) used in this study is known as X of N [Hanson and Bauer, 1986]. In X of N matching, a node is matched if at least X of N terms in the norm match the attribute-value pairs of the case. The search continues to the next node in the order when less than X attribute-value pairs match the current norm. The search terminates unsuccessfully if all nodes are searched without a match. The value X divided by N is known as the matching threshold.

1

1

The term “construction threshold” is used to mean a threshold value used in the formation of norms. The

11

Comment 3. The solution space for this problem is much too large to enumerate. A feasible solution consists of a node order and a set of attribute orders. If there are p nodes and q attributes per norm, then there are p! node orders and for each node, there are q! attribute orders. Hence, there are a total of (q!)p(p!) feasible solutions. In subsections 3.3 and 3.4, we describe two heuristic algorithms (CRf and CRlc) to determine a good information acquisition order. These algorithms assume: (i) attribute acquisition costs are independent, and (ii) nodes are non-hierarchical and mutually exclusive. Each algorithm has two phases, one to construct the node order and the other to construct the set of attribute orders. Before presenting the algorithms, we introduce some basic notation. Notation Nm ∈ N: node m, m = 1.. p , Freq(N m ): number of cases in the cluster at node Nm , p

TC =

∑ Freq( N m =1

m

) : total number of cases in the clusters,

Norm(N m ) : Norm is a function that provides the norm of a node, th Aj ∈ A : j attribute, j = 1..n ,

Attrs(Norm(Nm )) : Attrs is a function that provides the set of attributes in the norm of a node, Stage i: a condition when i nodes have been selected, i = 0.. p − 1, NR(i): set of nodes remaining at stage i, NC(i): set of nodes chosen in previous stages, ∀i(NR(i) ∪ NC(i)) = N .

term “matching threshold” is used to mean the threshold value used during retrieval when a case is matched to a norm.

12

3.3 The CRf Algorithm The CRf algorithm uses two simple heuristics to order the nodes and attributes within a node. Nodes are sorted in descending order by the number of cases at the node. The first node searched is the one with the most cases and hence the most likely to match a new case. Within each node, norm attributes are sorted in the descending order of predictable probability. Thus, the first attribute acquired is the one most likely to match a case if the case is a member of the node. In the CRf algorithm, the next node and attribute are selected using the node frequency selection heuristic (NFSH) and the attribute frequency selection heuristic (AFSH). NFSH (i ) = Max ( Freq ( N m )), N m ∈ NR(i )

(1)

AFSH (i , N m , α ) = Max ( ASP ( A j , N m , i )), A j ∈ AR ( N m , α )

(2)

m

j

where α: is the set of attributes that have already been selected in the norm of node m AR( N m , α ) : the set of remaining (not selected) attributes. = Attrs( Norm( N m )) − α ASP(A j , Nm ,i) is the attribute stage probability of attribute Aj ∈ Attrs(Norm( Nm )) at stage i = P[ Aj = V | Nm ] (the predictable probability) if Aj ∉ Attrs(Norm( N k )), Nk ∈ NC(i) = 1, otherwise The complexity of the CRf algorithm is governed by the complexity of sorting p nodes and sorting an average of q attributes per norm. Formally, the complexity is O(p log p + p (q log q), where the complexity of sorting n items is n log n. The CRf algorithm is a simple and efficient one among those that do not use cost information. In addition, the retrieval strategy in the CRf algorithm approximates the search used

13

in many case based systems, including COBWEB (Fisher, 1987), MEDIATOR (Kolodner, 1988), UNIMEM (Lebowitz, 1987), etc. In these systems, the order in which nodes are searched depends on the frequency of cases in the nodes. 3.4 The CRlc Algorithm Like the CRf algorithm, the CRlc algorithm heuristically computes a node order followed by an attribute order for each node. Unlike the CRf algorithm, it uses cost information to construct the orders. Another difference between CRf and CRlc is that CRf only considers how likely a match will occur at a particular node. CRlc, on the other hand, also considers how useful the attributes are to matching at other nodes. The CRlc algorithm uses heuristics to greedily search for node and attribute orderings. Greedy means that node and attribute selections are irrevocable decisions, that is, there is no backtracking. At each step, the algorithm chooses the node (attribute) that minimizes the heuristic value. An optimal solution cannot be guaranteed by greedy search. Sometimes, choosing a node (attribute) with a larger heuristic value may lead to an lower overall cost than that of a greedy selection process. Because of the detailed nature of the CRlc algorithm, we first present the node ordering component of CRlc followed by its attribute ordering component. For the node ordering component, we begin with the basic heuristic and then extend it to account for threshold matching. We then present the entire node ordering algorithm and analyze its computational complexity. Finally, we present the attribute ordering component.

14

3.4.1 Node Ordering In CRlc, nodes are arranged in ascending order by a heuristic that we call the loss criterion. At the initial stage (that is, selecting the first node), the node with the smallest loss criterion value is selected. At the next stage, the loss criterion values of the remaining nodes are recomputed and the node with the lowest value is chosen. Thus for p nodes, there are p-1 node selection stages. The loss criterion value of a node is the sum of its attribute loss criterion values times the probability that the node will not match (failure probability). The failure probability is one minus the node probability. The node probability is the number of cases in the node’s cluster divided by the total number of cases. The loss criterion of an attribute is its loss probability times its cost. The loss probability of an attribute is the probability of not needing the attribute to assign a case to a node. The cost of obtaining an attribute depends on the stage of ordering the nodes. The cost of obtaining an attribute at stage i is zero if the attribute was collected in a previous stage. Otherwise, the cost is the given attribute cost. Formal definitions of the node loss criterion and attribute loss criterion are given below. We begin with some new notation: NRE(A j ,i) : set of remaining nodes at stage i where the norm excludes Aj = { N m N m ∈ NR(i ) ∧ A j ∉ Attrs( Norm( N m ))} Cost(Aj ) : Cost is a function that provides the cost to collect attribute Aj SCost ( A j , i ) : Scost is a function that provides the cost of acquiring attribute Aj at stage i = 0, if Aj ∈

Attrs(Norm(N )) m

N m ∈NC ( i )

= Cost( Aj ) otherwise

15

LC(Nm ,i): loss criterion of node Nm at stage i (definition below) ALC(A j ,i) : loss criterion of attribute Aj at stage i (definition below) ALP( A j , i ) : loss probability of attribute Aj at stage i (definition below) The loss criterion LC is formally defined as:  Freq ( N m )  LC ( N m , i ) = 1 −  TC  

∑ ALC( A , i )

A j ∈Attrs ( Norm( N m ))

j

ALC ( A j , i ) = ALP ( A j , i ) * SCost ( A j , i ) ALP( A j , i ) =

∑

N m ∈NRE ( A j ,i )

Freq ( N m ) TC

(3)

(4)

(5)

At stage i , the node ordering algorithm chooses the node that minimizesLC ( N m , i ) for all nodes in the set NR(i). 3.4.1.1 Example To depict the loss criterion, consider the following hypothetical example with three nodes representing N1 = “incomplete installation,” N2 = “incompatible driver,” and N3 = “tape drive failure.” Assume that the attributes A1 (Port_Conflict) and A5 (Bi-directional_Port) are difficult to acquire. Low cost attributes are A2 (Loads_Ok), A3 (Menu_Missing), and A4 (Drive_Recognized). Calculation results and expressions for the loss criterion are shown below. In the first stage, N3 is chosen because it has the lowest loss. In the second stage, only nodes N1 and N2 would be considered because node N3 has already been selected. Note that CRf would select node N2 in stage 1 because it has the highest frequency of cases. Norm(N1) = {,} Norm(N2) = {,}

16

Norm(N3) = {,} Freq(N1) = 25, Freq(N2) = 45, Freq(N3) = 30 Cost(A1) = 20, Cost(A2) = 3, Cost(A3) = 5, Cost(A4) = 6, Cost(A5) = 8. The loss criterion for node 1 is LC(N1,0) = [ALP(A1,0) * SCost(A1,0) + ALP(A2,0) * SCost(A2 , 0)] * [1 - {Freq(N1) / TC}] ALP(A1,0) = Freq(N3) / TC = 0.3; ALP(A2,0) = Freq(N2) / TC = 0.45 LC(N1,0) = (0.3 * 20 + 0.45 * 3) * 0.75 = 5.51 The loss criterion for node 2 is LC(N2,0) = [ALP(A1,0) * SCost(A1,0) + ALP(A4,0) * SCost(A4,0) + ALP(A5,0) * SCost(A5,0)] *[1 - {Freq(N2) / TC}] ALP(A1,0) = Freq(N3) / TC = 0.3 ALP(A4,0) = Freq(N1) / TC + Freq(N3) / TC = 0.55 ALP(A5,0) = Freq(N1) / TC + Freq(N3) / TC = 0.55 LC(N2,0) = (0.3 * 20 + 0.55 * 6 + 0.55 * 8) * 0.55 = 7.54 The loss criterion for node 3 is LC(N3,0) = [ALP(A2,0) * SCost(A2,0) + ALP(A3,0) * SCost(A3,0)] * [1 - {Freq(N3) / TC}] ALP(A2,0) = Freq(N2) / TC = 0.45 ALP(A3,0) = Freq(N1) / TC + Freq(N2) / TC = 0.7 LC(N3,0) = [0.45 * 3 + 0.7 * 5] * 0.7 = 3.40 3.4.1.2 Threshold Matching The attribute loss criterion defined in equation (4) does not reflect the use of threshold matching in case retrieval. Recall that with threshold matching only a fraction of the norm attributes need be matched. Thus, even if an attribute is an element of a norm, it may not be needed to make a matching decision. The number of norm attributes and the matching threshold determine if an attribute is needed to make a matching decision. For example, if there are three

17

norm attributes and the matching threshold is 0.66, then only two attributes are needed to make a matching decision. Consider an attribute Aj whose attribute loss criterion needs to be evaluated at stage i (ALC(Aj, i)). There are two sets of nodes remaining at this stage: (i) those that do not contain Aj , denoted by NRE(A j ,i) , and (ii) those that contain Aj , denoted by NRI(A j ,i) . The probability of making a classification decision without needing Aj is given by the sum of: (i) the probability of making a decision in the set NRE(A j ,i) , and (ii) the probability of making a decision in the set NRI(A j ,i) without needing Aj . Let, XEm denote the event that a classification decision has been made at node Nm in the set NRE(A j ,i) (that is, the matching threshold at the norm of the node Nm has been exceeded), XI m denote the event that a classification decision has been made at node Nm in the set NRI(A j ,i) (that is, the matching threshold at the norm of node Nm has been exceeded), P(XEm ) =

Freq(Nm ) ; Nm ∈ NRE(A j ,i) TC

P(XI m ) =

Freq(Nm ) ; Nm ∈ NRI(A j ,i) TC

Y denote the event that the attribute Aj has not been collected PE = probability that a decision is made in the set NRE(A j ,i) and Aj is not collected = P(XEm ∩ Y ) = P(XEm )∗P(Y| XEm ) = P(XEm ); P(Y| XEm ) = 1 PI = probability that a decision is made in the set NRI(A j ,i) and Aj is not collected = P(XI m ∩ Y ) = P(XI m )∗P(Y| XI m ) P(Y | XIm ) is the probability that attribute A is not needed given that the norm of the node j Nm is matched.

18

We estimate P (Y | XI m ) as the number of minimal conjunctions in Nm that do not contain Aj divided by the total number of minimal conjunctions in Nm.2 A conjunction is minimal if it does not contain any more attributes than the number required by the matching threshold. The set of these minimal conjunctions can be denoted by: CONJ ( A j , N m ) = {CONJ q CONJ q is a minimal conjunction of the norm of node Nm and Aj ∉ CONJq }, and CONJ (Nm ) = {CONJq CONJq is a minimal conjunction of the norm of node Nm } Hence, P (Y | XI m ) = FRAC ( A j , N m ) = Norm ( N m ) −1

=

=

Norm ( N m )

C

C





CONJ ( A j , N m ) CONJ ( N m )

Norm ( N m ) * MT 

Norm ( N m ) * MT 

Norm( N m ) −  Norm( N m ) * MT  Norm( N m )

where, X is the cardinality of set X, FRAC(Aj , Nm ) : FRAC is a function that provides the fraction of minimal conjunctions in the norm of a node that containAj MT is the matching threshold, MT ≤ 1. We revise the attribute loss probability in equation (5) to account for threshold matching effects. In the revised expression, the probability of not needing the attribute in norms that contain

2

An equivalent assumption is that given the node Nm has matched, the match could have occurred at any of the minimal conjunctions with equal probability. This assumption is required to compute the loss criterion in polynomial time.

19

it (the With Probability, WP) is added to the expression in equation (5). Thus the revised definition of attribute loss probability is: ALP( A j , i ) =

Freq ( N m ) + WP( A j , i ) TC N m ∈NRE ( A j ,i )

∑

(6)

where WP( A j , i ) =

∑ FRAC ( A , N

N m ∈NRI ( A j ,i )

j

m

)∗

Freq ( N m ) TC

(7)

Continuing with the previous example, we show impact of the new ALP expression on LC(N1,0). Note that LC(N2,0) will also be affected. In the example, assume that MT is 0.66. In the revised calculation, N2 alone will be used in the right hand side of equation (7) for the first norm attribute of node N1. LC(N1,0) = [ALP(A1,0) * SCost(A1,0) + ALP(A2,0) * SCost(A2,0)] * [1 - {Freq(N1) / TC}] ALP(A1,0) = Freq(N3) / TC + WP(A1,0) = 0.3 + WP(A1,0) WP(A1,0) = {Freq(N2) / TC} * FRAC(A1,N2) = 0.45 * 0.33 = 0.15 ALP(A2,0) = Freq(N2) / TC = 0.45 LC(N1,0) = (0.45 * 20 + 0.45 * 3) * 0.75 = 7.76 Note that the loss criterion value of a node will always increase (never decrease) due to partial matching. With partial matching there is a positive probability that an attribute in a norm will not be needed. Hence, the ALP for attributes should increase due to partial matching and thus the LC for the node should increase. In the above example, the loss criterion value for node 1 increased from 5.51 to 7.76. This was due to partial matching using a matching threshold of 0.66.

20

3.4.1.3 Algorithm and Analysis Figure 2 depicts the CRlc node ordering algorithm. The algorithm is simple because most computation occurs in calculating the loss criterion (LC). The outer loop iterates over the node selection stages. The inner loop computes the loss criterion for each node not selected in a previous stage. After processing each stage, the attribute costs of the norm attributes of the selected node are set to zero. Input N: set of nodes Output S: the list of nodes ordered by the loss criterion Procedure 1. S := φ; 2. For i := 0 to N − 2 2.1 BestLC := HIGHVALUE; 2.2 M := Remove(N, S); M is the set of remaining nodes obtained by removing the nodes in list S from set N. 2.3 For each m ∈ M 2.3.1 NewLC := LC(m, i); 2.3.2 if NewLC < BestLC then 2.3.2.1 BestNode := m; 2.3.2.2 BestLC := NewLC; 2.4 S := Append(S, BestNode); append BestNode to S 2.5 Assign 0 to the costs of norm attributes of BestNode; 3. S := Append(S, Remove(N, S)); append the last remaining node to S 4. Return S; Figure 2. Algorithm CRlc Node Ordering The complexity of the CRlc node ordering algorithm is dominated by the loss criterion computation. The outer loop (2.) executes p-1 times, once for each node except for the last node in the ordering. The inner loop (2.3) executes p/2 times because on the average there are p/2 nodes that have not been selected. There are two implied loops in the LC computation. To calculate the LC for each node, the attribute loss criterion is calculated r

21

times, assuming r attributes per norm on the average. To compute the attribute loss criterion, each unselected node must be visited resulting in an average of p/2 nodes visited. Thus, the complexity of the CRlc node ordering algorithm is O(rp3). 3.4.2 Attribute Ordering For each selected node, an attribute ordering can be computed using a loss criterion calculated for the attributes of the node’s norm. The loss criterion is the probability of not needing the attribute to make a matching decision at the node (attribute loss probability) times the cost of the attribute. This attribute loss probability is the sum of the probabilities of the minimal conjunctive terms of the norm that do not contain the attribute of interest, but exceed the matching threshold. Computing and storing conditional probabilities for minimal conjunctive terms would make the attribute loss criterion exponentially complex. Therefore, we have implemented a simple heuristic in its place. The attributes are ordered by one minus their predictable probability times their cost. The predictable probability is a measure of an attribute’s usefulness given the node. Note that the cost is zero if an attribute has been collected in a previous stage. More precisely, the Attribute Cost Selection Heuristic (ACSH) is defined below. ACSH (i , N m , α ) = Min(1 − P[ A j = V | N m ] * SCost ( A j , i )), A j ∈ AR ( N m , α ) j

(7)

where P[Aj = V | Nm ] is the predictable probability of attribute Aj of node Nm The overall complexity of the loss criterion algorithm (CRlc) is O(LCNO + p LCAO). In the expression for complexity, p is the number of nodes and LCNO and LCAO are the complexities of the node and attribute ordering components of CRlc. The complexity of LCNO is O(rp3) as described in Section 3.4.1.3. The complexity of LCAO is O(r log r) because on the average r

22

attributes per norm are sorted by their predictable probability. Therefore, the overall complexity of the loss criterion algorithm is polynomial in the number of nodes and attributes.

4. Joint Concept Formation and Retrieval In contrast to case based systems, ID3c jointly computes a decision tree that combines concept definitions and an information acquisition context. Attributes are collected in an order depending upon the case and the strategy prescribed in the decision tree. The ID3c algorithm demonstrates that combining concept formation and retrieval strategy formation can significantly reduce information acquisition costs. Decision tree induction algorithms, such as ID3 [Quinlan, 1986], typically construct a decision tree using a recursive partitioning approach. The attribute names used to label non-leaf nodes are determined using an attribute selection criterion and a set of cases. Once a non-leaf node has been labeled, q outgoing arcs are created at this node, where q is the number of possible states of the attribute. The set of cases used to label the non-leaf node is then partitioned into q subsets, where the state of the labeling attribute is the same within each subset. Creation of nonleaf nodes continues along each path of the tree until a stopping condition is reached, at which stage a leaf node is created. Leaf nodes are labeled using a classification function. A more detailed description of the above induction process is presented in Appendix A1. The design of the ID3c algorithm is identical to that of the ID3 algorithm except for a modification in the manner in which attributes are selected (the attribute selection criterion). In the ID3 algorithm, attributes are selected based solely upon their information content, measured by the reduction in information entropy [Shannon and Weaver, 1949]. The attribute that provides the highest reduction in information entropy is selected. On the other hand, the ID3c algorithm

23

selects attributes based upon information content per unit information acquisition cost. Thus, the attribute selection criterion used in the ID3c algorithm, is the entropy reduction expression used in ID3 divided by the information acquisition cost for the attribute. A more precise definition of the attribute selection criterion is presented in Appendix A2. In summary, the attribute selection criterion for ID3c is designed to reduce the expected cost of classifying a case. The same criterion in ID3 is, on the other hand, designed to reduce the expected number of attributes needed to classify a case.

5. Experimental Comparison In this section we describe simulation experiments to study: (i) joint versus separate optimization of concept formation and retrieval (that is, ID3c versus CRlc and CRf), and (ii) frequency based versus cost based case retrieval (that is, CRf versus CRlc). The primary measure of performance of an algorithm is the expected information acquisition cost to assign a case to a cluster. Accuracy and number of attributes collected are secondary measures. We discuss factors affecting performance, experimental design, experimental data, experimental procedures, and results. 5.1 Factors Affecting Performance To explore performance differences between the algorithms, we use two quantitative variables or covariates: (i) attribute cost coefficient of variation, and (ii) matching threshold. The qualitative variable, namely algorithm, is coded using two 0−1 indicator variables.

24

5.1.1. Algorithm There are three algorithms used in this study: (i) CRf, (ii) CRlc, and, (iii) ID3c. We expect CRlc to dominate CRf because the former uses costs in computing an order while the latter does not. We also expect ID3c to dominate CRlc because ID3c jointly optimizes concept formation and retrieval strategy formation whereas CRlc separately optimizes these tasks. Thus we expect the following: Proposition 1: ID3c should incur lower information acquisition costs than CRlc which should incur lower costs than CRf. 5.1.2. Attribute Cost Coefficient of Variation In some situations, certain attributes can be much costlier than others; that is, attribute cost coefficient of variation can be high. In these situations, it may prove extremely important to select a particular information acquisition strategy. When attribute costs vary, an algorithm that develops concepts and/or a retrieval strategy considering attribute costs would perform at a relative advantage over one that ignores attribute costs. At higher coefficient of variation, we expect CRf costs to increase relative to CRlc costs because CRlc is sensitive to costs while CRf is not. Similarly, we expect the coefficient of variation to affect the relative performance of ID3c versus CRlc because ID3c considers costs in concept formation whereas CRlc does not. Proposition 2: The cost performance difference between CRf and CRlc becomes larger as the attribute cost coefficient of variation increases. Proposition 3: The cost performance difference between CRlc and ID3c becomes larger as the attribute cost coefficient of variation increases. 5.1.3. Matching Threshold We define degree of search in a case retrieval algorithm as the average number of attributes collected to match a case. The degree of search can be controlled by varying the

25

matching threshold. For a case to match at a norm, the number of case attributes that match with norm attributes must be greater than or equal to the matching threshold. Increasing the matching threshold increases the extent of fit. The use of a matching threshold in case retrieval is similar to pruning in decision tree induction [Quinlan, 1987]. However, the analogy is not exact because pruning techniques typically employ significance testing, whereas matching thresholds do not. Matching threshold is a factor relevant only to the case retrieval algorithms. Since the node ordering component of CRf does not consider partial matching, its performance relative to CRlc could be poor if there is more potential for partial matching (that is, when matching threshold is low). However, as matching threshold increases, the potential for partial matching reduces and the relative performance of CRf with respect to CRlc can be expected to improve. Proposition 4: CRlc performs better than CRf in the entire range, but the cost performance difference between CRf and CRlc becomes smaller as the matching threshold increases. 5.2. Experimental Design The response function for our model is: E(Cost) = β0 + β1 (CV) + β2 (MT) + β3 (i1) + β4 (i2) + {interaction-terms}

(9)

The response variable, E(Cost), is calculated as the average information acquisition cost of 3

assigning an unseen case to a cluster. The covariate CV (coefficient of variation of the attribute costs) is chosen between 0 and 0.5. The other covariate, MT (matching threshold), is chosen between 0.5 and 1. The indicator variables i1 and i2 are 0,1 variables coded as follows: ID3c: i1 = 1, i2 = 0; CRf: i1 = 0, i2 = 1; and CRlc: i1 = i2 = 0. Finally, {interaction-terms} represents the

26

following second and third order terms: β5 (i1*CV), β6 (i2*CV), β7 (i1*MT), β8 (i2*MT), β9 (CV*MT), β10 (i1*CV*MT), and β11 (i2*CV*MT). 5.3 Experimental Data Four experiments were conducted to investigate propositions 1 through 4. The main difference in these experiments is that different data sets were used. Two of the four data sets were artificially generated and the remaining two were taken from real domains. All the data sets are preclassified, and the classes are non overlapping. Since we were not interested in the clustering component of case based systems, the cases in a cluster were chosen as those with the same class. The artificial data sets were generated by a program based on specifications described in [Bisson, 1991]. The data set generator can control the number of cases, classes, attributes, states per attribute, and the complexity of rule sets for each class. Data set 1 contains 4 equally distributed classes and 10 input attributes. Data set 2 contains 8 moderately skewed classes and 15 input attributes. Half the cases in data set 2 are uniformly distributed between two classes while the remaining cases are uniformly distributed among the other 6 classes. Both artificial data sets share the following characteristics: (i) the number of cases is 200, (ii) the average number of states per attribute is 3 (between 2 and 5), and (iii) the average size of the rule sets is 2 rules per class with 3 attributes per rule. The two real domain data sets, Zoo and Lymphography, were selected from the Repository of Machine Learning Databases and Domain Theories [Murphy and Aha, 1991]. Both

3

We also observe classification accuracy and number of attributes collected by an algorithm in these experiments.

27

data sets have a reasonable number of attributes and classes. In addition, they have only nominal attributes. The Zoo data set has 7 classes, 16 attributes (mostly Boolean), and 101 cases. The Lymphography data set has 4 classes, 18 attributes (mix of Boolean and nominal with a few states), and 148 cases. In this data set, 2 classes are infrequent compared to the other classes. Both real domain data sets have some noise from conflicting cases. Two cases conflict if they have identical values for the input attributes but different values for the class. 4

5.4 Experimental Procedure

The two case retrieval algorithms use norms that are computed by applying norm construction thresholds, namely, the predictable probability and the predictive probability. Rather than use arbitrary construction threshold values, we selected values to achieve maximum classification accuracy in a pilot study. In the pilot study, both norm construction values were independently varied in steps of 0.1 from 0.2 to 0.9. For each combination of values for the thresholds, we drew 66% cases from the data set and constructed norms using these cases. The remaining (34%) cases were assigned to clusters, resulting in one observation for classification accuracy. The average accuracy across 25 splits was then taken. The best values for each data set were used for norm construction thresholds in the main experiments (see Table 3).

Data Set Lymph Zoo DS1 DS1

4

Table 3. Norm Construction Thresholds Predictive Probability Predictable Probability 0.5 0.5 0.5 0.4 0.46 0.32 0.5 0.38

The algorithms and experiments were implemented using Think Pascal on a Quadra 700.

28

The following procedure, recommended by Weiss and Kulikowski (1991), was used to generate observations. A data set was randomly split into a training set (66% of cases) and a holdout set (34% of cases). The training set was used to construct norms for clusters of cases with the same class. The same training set was used to construct node and attribute orders using the CRf and CRlc algorithms, and a decision tree using the ID3c algorithm. The two sets of norms and the decision tree were then used to assign cases in the holdout set. Because the choice of a training set can affect the performance of the three algorithms, the algorithms were run on 25 different, randomly generated training sets. One observation for each of the three algorithms was the average cost across the 25 training sets. To avoid random differences occurring from the choice of the training set, the same 25 training sets were used for each observation. For a given data set, the experiment generated 300 observations, 100 for each algorithm. Each observation was the average cost over 25 splits. To reduce unnecessary variance in the response variable, the values for CV and MT were held constant over the 25 splits of an observation. In addition, since the average cost of the attributes is not a factor of interest, it was held constant over all observations and data sets. 5.5 Results The parameter estimates for the response function in equation (9) for the four data sets are presented in Tables 4, 5, 6 and 7 respectively. In these tables, only those variables that were found to be significant at a P-value of 0.10 or below are shown.

29

Table 4. Parameter Estimates for Lymph Data R-square 0.9178; Adj R-sq 0.9149 Variable DF Parameter Est. Std. Error t-value INTERCEPT 1 14.330270 0.73800292 19.418 CV 1 -8.471102 2.21343471 -3.827 i1 1 45.301786 2.62000580 17.291 i2 1 38.962328 2.44214934 15.954 CV*i1 1 10.398777 3.83378138 2.712 MT*i1 1 -12.392094 3.13026939 -3.959 MT*i2 1 -16.056501 3.13026939 -5.129

P-value 0.0001 0.0002 0.0001 0.0001 0.0071 0.0001 0.0001

Table 5. Parameter Estimates for Zoo Data R-square 0.9867; Adj R-sq 0.9864 Variable DF Parameter Est. Std. Error t-value INTERCEPT 1 11.183819 0.31553049 35.444 CV 1 -5.585208 0.94634603 -5.902 i1 1 66.447159 1.12017403 59.319 i2 1 47.427607 1.04413214 45.423 CV*i1 1 3.315684 1.63911941 2.023 MT*i1 1 -34.639359 1.33833539 -25.882 MT*i2 1 -22.131908 1.33833539 -16.537

P-value 0.0001 0.0001 0.0001 0.0001 0.0440 0.0001 0.0001

Table 6. Parameter Estimates for DS1 data R-square 0.9714; Adj R-sq 0.9707 Variable DF Parameter Est. Std. Error t-value INTERCEPT 1 14.153754 0.25084285 56.425 CV 1 -5.135621 0.85431220 -6.011 i1 1 31.623122 0.73051499 43.289 i2 1 19.915360 0.73051499 27.262 CV*i1 1 6.016483 1.20817990 4.980 CV*i2 1 4.491410 1.20817990 3.718 MT*i1 1 -20.399257 0.85431220 -23.878 MT*i2 1 -12.066499 0.85431220 -14.124

P-value 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0001 0.0001

Table 7. Parameter Estimates for DS2 data R-square 0.9760; Adj R-sq 0.9755 Variable DF Parameter Est. Std. Error t-value INTERCEPT 1 16.809552 0.13330435 126.099 CV 1 -9.563200 0.45400351 -21.064 i1 1 18.138876 0.38821448 46.724 i2 1 9.335724 0.38821448 24.048 CV*i1 1 9.514370 0.64205792 14.819 CV*i2 1 8.084477 0.64205792 12.592 MT*i1 1 -14.705913 0.45400351 -32.392 MT*i2 1 -12.309786 0.45400351 -27.114

P-value 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001

30

Table 8 shows the response functions for the various algorithms and the data sets. To obtain a specific response function (for example for CRf and Lymph), set i1 = 1 and i2 = 0 in equation (9) and substitute parameter values from Table 4. Table 8 shows that the ID3c algorithm was cheaper than the CRlc algorithm that in turn was cheaper than the CRf algorithm. Hence, proposition 1 is supported. Table 8. Response Functions Lymph

Zoo

E(Cost) = 59.66 + 1.92*CV − 12.39*MT

(CRf)

E(Cost) = 77.62 − 2.27*CV − 34.64*MT

(CRf)

E(Cost) = 53.29 − 8.47*CV − 16.05*MT

(CRlc)

E(Cost) = 58.60 − 5.58*CV − 22.13*MT

(CRlc)

E(Cost) = 14.33 − 8.47*CV

(ID3c)

E(Cost) = 11.18 − 5.58*CV

(ID3c)

DS1

DS2

E(Cost) = 45.77 + 0.88*CV − 20.39*MT

(CRf)

E(Cost) = 34.94 − 0.05*CV − 14.71*MT

(CRf)

E(Cost) = 34.06 − 0.64*CV −12.06*MT

(CRlc)

E(Cost) = 26.14 − 1.48*CV − 12.31*MT

(CRlc)

E(Cost) = 14.15 − 5.13*CV

(ID3c)

E(Cost) = 16.80 − 9.56*CV

(ID3c)

Table 8 reveals support for proposition 2, but only partial support for proposition 3. Concerning proposition 2, note that as CV increases, the performance difference between CRf and CRlc becomes larger. Concerning proposition 3, note the differential effect of CV in the artificial and real data sets. An increase in CV causes the performance of CRlc to deteriorate relative to ID3c in DS1 and DS2. In contrast, the performance difference is not affected by CV in Zoo and Lymph. Note, however, that in Zoo and Lymph, ID3c dominates CRlc even at low levels of CV. This finding implies that a high CV is not required for ID3c to outperform CRlc. For the Zoo, DS1 and DS2 data sets, increasing MT reduces the performance difference between CRf and CRlc. As MT becomes close to one, the effect of partial matching on the case retrieval algorithms vanishes. Since CRf ignores partial matching, its performance becomes closer

31

to CRlc at high values of MT. For the Lymph data set, however, this effect of MT does not hold. Thus proposition 4 is supported for 3 of the 4 data sets. To depict performance difference magnitudes, we show the impact of MT and CV on the cost and accuracy of the three algorithms for the Lymph data set. Each point in these graphs is the average of 25 splits. Figure 3 shows that MT has no substantial effect on the cost difference between the two case retrieval algorithms. In some parts of the range, the difference increases whereas in other parts the difference decreases slightly. The dome shaped curves of Figure 3 can be explained as follows. Initially, as MT increases, more attributes need to be collected to match a case. Hence, the average cost per case initially increases. However, increasing MT beyond a point causes more early node failures, leading to a decrease in the average cost per case. Figure 4 shows that the impact of CV on the relative cost performance of the three algorithms is not substantial. As CV increases, there is a slight increase in the cost performance difference between the two case retrieval algorithms. There is, however, no substantial cost difference between CRlc and ID3c in the entire CV range. Figures 5 and 6 demonstrate the effect of MT and CV on accuracy. Figure 5 shows that increasing MT causes substantial overfitting by the case retrieval algorithms. At low levels of MT, the case retrieval algorithms have slightly lower accuracy than ID3c. CRlc has better accuracy than CRf at low MT levels because CRlc has a more global strategy for choosing the node order. As MT increases, the accuracy difference disappears as the orders either become the same or cease to affect accuracy. Figure 6 shows that increasing CV has no effect on the accuracy of the case retrieval algorithms. However, when CV increases, ID3c collects more cheap attributes and tends to overfit.

32

60

60

50

50 40

CRf

30

Cost per Case

Cost per Case

40

CRlc ID3c

20 10

CRlc 20

ID3c

10 0

0.94

0.83

0.72

0.61

0.5

0

0

0.1

0.2

0.3

0.4

0.5

Matchi ng Thre shold

C oe fficie nt of Variati on

Figure 3. Impact of MT on Cost

Figure 4. Impact of CV on Cost

0.8

0.8

0.7

0.7 0.6

0.6 CRf CRlc

0.4

ID3c

0.3

0.5 Accuracy

0.5 Accuracy

CRf

30

0.4 0.3

CRf

0.2

0.2

CRlc

0.1

0.1

ID3c

0

0 0.5

0.6

0.7

0.8

0.9

1

Matching Threshold

Figure 5. Impact of MT on Accuracy

0

0.1

0.2

0.3

0.4

0.5


Figure 6. Impact of CV on Accuracy

Figures 7 and 8 demonstrate the effect of MT and CV on the number of attributes collected by the different algorithms. As expected, the number of attributes collected is highly correlated with cost. Since we do not vary the average cost in these experiments, at low levels of CV, the cost is almost perfectly correlated with the number of attributes collected. Near the high end of the CV range, ID3c collects slightly more attributes to exploit the availability of cheap attributes.

33

CRlc

12

ID3c

10 8 6 4 2 0.5

0.6

0.7

0.8

0.9

1

Matching Threshold

Figure 7. Impact of MT on Attributes Collected

12 Number of Attributes Collected

Number of Attributes Collected

CRf 14

10 8 CRf 6

CRlc ID3

4 2 0 0

0.1

0.2

0.3

0.4

0.5


Figure 8. Impact of CV on Attributes Collected

5.6 Discussion There are two major lessons to draw from this study. First, substantial cost differences between the algorithms indicate that separating concept formation and retrieval strategy formation is a design limitation. It seems unlikely that an algorithm that separates these tasks can compete with an algorithm that jointly performs these tasks. Second, cost considerations can significantly reduce information acquisition costs, even if concept formation and retrieval strategy formation are separated. The substantial cost difference between cost and frequency based retrieval supports this conclusion. System designers must therefore pay careful attention to the design of the retrieval strategy in case based systems. However, given two possible interventions: (i) designing a better retrieval strategy, and (ii) jointly performing concept formation and retrieval strategy formation, the second intervention appears to have a larger impact. Before leaving this section, we raise two broad issues concerning cost considerations in case based systems. The first issue is the distinction between sequential and parallel acquisition of attributes. As mentioned earlier, the results in this paper apply to case based systems where

34

information can be acquired sequentially and hence information costs are variable. However, even when information costs appear fixed, a variety of factors may require that information costs be treated as variable instead of fixed. These factors include: (i) changes in the competitive environment due to deregulation, (ii) unbundling of product and information costs by firms, (iii) explicit pricing of information by vendors of information services (for example, search agents on the Internet), and (iv) outsourcing of information collection. Thus, reducing variable information costs could become an important addition to a manager’s responsibilities. The second issue concerns estimating the information costs required by the models developed here. Estimating information costs may not always be easy. Estimation difficulties include: (i) sequential dependencies between the cost of acquiring attributes, (ii) variance in the cost of collecting an attribute, and (iii) the relationship between the quality of information and the cost. Although this study used a simple model of information costs, we believe that our qualitative results will extend to more complex models of information costs.

6. Summary and Conclusion We studied the problem of incorporating information acquisition costs into case retrieval algorithms. In a number of business and engineering tasks, attribute costs are significant and unequal, and information acquisition may occur sequentially. A retrieval strategy that ignores the cost of acquiring information may be suboptimal. A major difficulty with lowering the cost in case based systems is that concept formation and retrieval strategy formation are separated. To study the implications of this limitation, we developed two cost sensitive algorithms, CRlc and ID3c, representative of separate and joint concept formation and retrieval strategy formation. We experimentally

35

compared the cost sensitive algorithms to a frequency based algorithm (CRf). Our results demonstrated that the cost sensitive algorithms produced significantly lower costs than CRf and that ID3c costs dominated CRlc costs. A useful extension of this research would be to study the performance of instance algorithms focusing on whether information acquisition costs can be lowered without significantly reducing accuracy. Instance algorithms are a special challenge because they search the entire space of concept definitions to return the K nearest neighbors using a distance measure rather than a threshold. The challenge is to develop a decision theoretic framework that evaluates whether more search in the norm space is useful given the current accuracy and cost incurred. Other extensions could include the use of economic considerations in the design of other machine learning techniques. Future research should also address topics such as the effects of different measurement assumptions for costs and benefits, functional relationships between the accuracy of different attributes and acquisition costs, and acquisition costs dependent on the order in which attributes are acquired. Such research would increase the effectiveness of machine learning techniques for business decision making.

36

Appendix A1 Formally, the induction process can be described using the following definitions: Let, D

represent the set of cases in the training set,

A1, A2, .., An

are n observable attributes that may be used to classify an object,

xi1, xi2, .., xiq

are q possible states for attribute Ai,

c1, c2, .., cm

are m possible ways in which an object may be classified, is a subset of cases in the training set; referred to as the "current set,"

k(L)

is a classification function that determines how a leaf node is labeled,

g(Ai, L)

is a criterion value for attribute Ai, given L,

w

is a cut-off value that is used to determine when a leaf node should be created.

The steps in an induction algorithm are described below. Initially, no nodes have been created. Step 1. Set L = D. Step 2. Using L, select attribute Aj such that, g ( A j , L) ≥ g ( Ai , L) , for i = 1, 2, .., n. If g ( A j , L) ≤ w , go to Step 5. Step 3. Create a non-leaf node labeled Aj. Generate q arcs originating at this node. Label each arc by a state of the attribute Aj. Assuming q states for each attribute arcs are labeled xjk, for k = 1, 2, .., q. Step 4. For each arc xjk determine M ⊆ L , such that Aj = xjk for every case in M. Set L = M. Go to Step 2. Step 5. Create a leaf node. Label this leaf node cλ such that k(L) = cλ where cλ ∈ {c1, c2, .., cm}.

37

Step 6. If leaf nodes have been created in all paths of the tree, stop; else, return to Step 4. Complexity The complexity of the above induction process is exponential in the number of attributes but polynomial in the size of the training set [Quinlan, 1986]. The maximum number of partitions that can be evaluated is O(qn), where q is the average number of states per attribute and n is the total number of attributes. The maximum number of partitions occurs in a complete tree in which each path is of length n-1. Generally, induced trees are much smaller than a complete tree. For each partition evaluated, a sequential scan of the training set is required. The average partition size is proportional to TC, the number of cases in the training set. Therefore, the complexity of the above induction process is O(TC ⋅ q n ) .

38

Appendix A2 There are three important factors that must be considered in the design of an induction algorithm: (a) the attribute selection criterion, (b) the stopping rule, and (c) the classification function. The attribute selection criterion determines the attribute to label a non-leaf node. The stopping rule determines when it is no longer beneficial to create non-leaf nodes in a path of the tree. The classification function determines how a leaf node should be labeled. In this appendix, we describe the attribute selection criterion, stopping rule and classification function used by the ID3c algorithm. Attribute Selection Criterion Formally, the expression for the attribute selection criterion in the ID3c algorithm can be described as below:

g ( Ai , L) =

EN ( L) − EN ( Ai , L) , Cost ( Ai )

where m

EN ( L) = − ∑ f (cr , L) log 2 f (cr , L) is the initial entropy, r =1

q

EN ( Ai , L) = ∑ f ( Ai = xik , L) EN ( L| Ai = xik ) is the entropy after observing Ai , k =1

m

EN ( L| Ai = xik ) = − ∑ f (cr , L| Ai = x ik ) log 2 f (cr , L| Ai = xik ) is the entropy in the partition r =1

L| Ai = x ik , f (c r , L) is the proportion of cases in L belonging to the class cr, f ( Ai = x ik , L) is the proportion of cases in L where the attribute Ai = xik, and,

39

f (cr , L| Ai = x ik ) is the proportion of cases in the subset of cases L given Ai = xik, belonging to the class cr The algorithm selects the attribute with the highest criterion value, that is, select Aψ such that g ( Aψ , L) ≥ g ( Ai , L), i = 1,2,.., n. Thus the ID3c algorithm chooses the attribute with the maximum information content per unit dollar spent in observing the attribute. The attribute selection criterion used by ID3c is designed to achieve a level of accuracy comparable to ID3 at lower cost. The Stopping Rule The stopping rule of the algorithm is the same as that of the ID3 algorithm. The cutoff value w is zero for this algorithm. Thus, the stopping rule for ID3c is: Stop if: g ( Ai , L) = 0 for all unobserved attributes. The Classification Function The classification function used by the algorithm is the same as that used by the ID3 algorithm. This algorithm chooses the class with the highest frequency in the current set of cases to label a leaf node. Thus, the classification function (k ( L) ) used by ID3c can be described as: Choose cλ such that f (cλ , L) ≥ f (c r ,L),r = 1,2,.., m.

40

References Aha, D., Kibler, D. and Albert, M. “Instance-Based Learning Algorithms,” Machine Learning 6, (1991), 37-66. Allen, B. “Case Based Reasoning: Business Applications,” Communications of the ACM 37, 3 (March 1994), 40-42. Barletta, R. and Buta, P. “Market Surveillance using Case Based Reasoning,” in Proceedings. First International Conference on AI Applications on Wall Street,October 1991, New York. Becker, J. “A Model for the Encoding of Experiential Information,” in Computer Models of Thought and Language, R. Schank and K. Colby (eds.), Freeman, San Francisco, CA, 1973, pp. 396-435. Bisson, H., “Evaluation of Learning Systems: An Artificial Data-Based Approach,” Proceedings of the European Working Session on Machine Learning, Y. Kodratoff (ed.), SpringerVerlag, Berlin, F.R.G., 1991. Breese, J. and Heckerman, D., “Decision-Theoretic Case-Based Reasoning,” forthcoming in IEEE Transactions on Systems, Man, and Cybernetics, August 1995, also available Microsoft Technical Report MSR-TR-95-03. Creecy, R., Masand, B., Smith, S., and Waltz, D. “Trading MIPS and Memory of Knowledge Engineering,” Communications of the ACM 35, 8 (August 1992), 48-64. Dos Santos, B. and Mookerjee, V. “Expert System Design: Minimizing Information Acquisition Costs,” Decision Support Systems, Vol. 9, pp. 161-181, North Holland, 1993. Feigenbaum, E. “The Simulation of Verbal Behavior,” in Computers and Thought, E. Feigenbaum and J. Feldman (eds.), McGraw-Hill, New York, 1963. Fisher, D. “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, 2, 1987, 139-172. Gennari, J., Langley, P., and Fisher, D. “Models of Incremental Concept Formation,” Artificial Intelligence, 40, September 1989, 11-62. Gonzalez, A. and Laureano-Ortiz, R. “A Case-Based Reasoning Approach to Real Estate Property Appraisal,” in Expert Systems with Applications, Volume 4, Pergamon Press, 1992, pp. 229-246. Hall, R. “Computational Approaches to Analogical Reasoning: A Comparative Analysis,” Artificial Intelligence 39, 1989, 39-120. Hammond, K. Case Based Planning: An Integrated Theory of Planning, Learning, and Memory, Ph.D. Thesis, Yale University, 1986. Hanson, S., and Bauer, M., “Machine Learning, Clustering, Polymorphy,” in Uncertainty and Artificial Intelligence, L. Kanal and J. Lemmer (eds.), Amsterdam, North Holland, 1986.

41

Heckerman, D. and Breese, J., and Rommelse, K. “Troubleshooting under Uncertainty,” Communications of the ACM 38, 3 (March 1995), 49-57. Kolodner, J. “Improving Human Decision Making through Case-Based Decision Making,” AI Magazine, American Association of Artificial Intelligence, Summer 1991, 52-67. Kolodner, J. and Simpson, R. “The MEDIATOR: An Analysis of an Early Case-Based Problem Solver,” Cognitive Science 13, 4 (1989), 507-549. Koton, P. Using Experience in Learning and Problem Solving, Ph.D. Thesis, Massachusetts Institute of Technology, 1988. Lebowitz, M. “Experiments with Incremental Concept Formation: UNIMEM,” Machine Learning, 2 (1987), 103-138. Mookerjee, V. and Dos Santos, B. “Inductive Expert System Design Maximizing System Value,” Information Systems Research, Vol. 4, No. 2, pp 111-140, August 1993. Moore, J. and Whinston, A. “A Model of Sequential Decision Making - Part I,” Decision Support Systems, 2, 4 (1986), 285-307. Moore, J. and Whinston, A. “A Model of Sequential Decision Making - Part II,” Decision Support Systems, 3, 1 (1987), 47-72. Murphy, P. and Aha, D. UCI Repository of Machine Learning Databases, Irvine, CA, University of California, Department of Information and Computer Science, 1991. Nunez, M., “The Use of Background Knowledge in Decision Tree Induction,” Machine Learning, Vol. 6, 1991, 231-50. Patttipati, K. and Alexandridis, M. “Application of Heuristic Search and Information Theory to Sequential Fault Diagnosis,” IEEE Transactions on Systems, Man, and Cybernetics 20, 4 (July/August 1990), 872-887. Quinlan, J. “Induction of Decision Trees,” Machine Learning, Vol. 1, 1986, 81-106. Quinlan, J. “Simplifying Decision Trees,” International Journal of Man Machine Studies, Vol. 27, 1987, 221-234. Shannon, C. and Weaver, W.. The Mathematical Theory of Communication, University of Illinois Press 1949, (published in 1964). Simoudis, E. “Using Case-Based Retrieval for Customer Technical Support,” IEEE Expert 7, 5 (October 1992), 7-12. Simoudis, E. and Miller, J. “The Application of CBR to Help Desk Applications,” in Proceedings. Workshop on Case Based Reasoning, 1991, pp. 25-36. Stottler, R., “CBR for Cost and Sales Prediction,” AI Expert, August 1994. Weiss, M. and Kulikowski, C. Computer Systems that Learn : Classification and Prediction Methods from Statistics, Neural nets, Machine learning, and Expert Systems, San Mateo, California, Morgan. Kaufmann Publishers, 1991. Zarley, D. “A Case-Based Process Planner for Small Assemblies,” in Proceedings. Case-Based Reasoning Workshop, May 1991, Washington, D.C., pp. 363-373.