ITERATE: A Conceptual Clustering Method for ... - Semantic Scholar

5 downloads 118 Views 350KB Size Report
Apr 16, 1995 - General Motors uses classi cation methods to study its ... amount of previous history from previously explored elds be systematically applied.
ITERATE: A Conceptual Clustering Method for Knowledge Discovery in Databases 

Gautam Biswas, Jerry Weinberg, and Cen Li Dept. of Computer Science Box 1679, Station B Vanderbilt University Nashville, TN 37235. Tele: (615)-343-6204 e-mail: [email protected]

April 16, 1995

1 Introduction As the eld of Arti cial Intelligence (AI) matures, researchers are turning more and more to real world applications. With the widespread use of computers, it is estimated that the amount of information collected in the world doubles every 20 months. A substantial amount of this data comes from studying the operations of complex engineering systems (e.g., manufacturing lines, nuclear plants), geological operations in oil and mineral prospecting, data collected by satellites and space missions, and medical data collected from patients and laboratory experiments. It is becoming increasingly important to devise sophisticated schemes for nding interesting concepts and relations between concepts from this large amount of potentially useful data. Frawley, et al.[16] cite examples of a number of forward looking companies that are developing tools and techniques to analyze their databases for interesting and useful patterns. For example, American Airlines uses knowledge discovery techniques to periodically search its frequent yer database to nd pro les of its better customers and Partially supported by grants from Amoco Research Labs, Tulsa, OK, Arco Labs, Richardson, TX, and Ecopetrol, Colombia. 

target them for speci c promotions. General Motors uses classi cation methods to study its automotive troubleshooting databases and derive diagnostic expert systems for its di erent car models. A.C. Nielsen is working with packaged-goods manufacturers to study the e ects of targeted promotions on sales at the super markets. Are there potential applications in the domain of hydrocarbon exploration and geology? It is common knowledge that drilling costs for new o shore prospects are in the range of $30-40 million and that chances of the site being an economic success are 1 in 10. Advances in drilling technology and data collection methods have led to oil companies and ancilliary companies collecting large amounts of geophysical and geological data from exploration sites and production wells in the last 20-25 years. These are now being organized into large company databases. The question is: can this vast amount of previous history from previously explored elds be systematically applied to evaluating new plays and prospects. A possible solution may be in developing the capability for retrieving historical data for the purpose of nding analogs for current prospects. Statistical risk analysis techniques may then be employed for computing distributions of possible hydrocarbon volumes for these prospects, and this may form the basis for developing more formal and objective prospect evaluation and ranking schemes. We use this scenario as a motivation for studying discovery techniques applied to databases. Section 2 reviews important concepts that apply to knowledge discovery or database mining techniques. Section 3 brie y summarizes numeric and non numeric clustering schemes. Section 4 discusses ITERATE, a conceptual clustering algorithm, that forms an important component of a database mining system. Section 5 illustrates the e ectiveness of ITERATE in concept formation, and Section 6 contains a summary and directions for future work in this area.

2 What is Knowledge Discovery ? Frawley et al.[16] de ne knowledge discovery to be \the non trivial extraction of implicit, previously unknown and potentially useful information in data." This suggests a generic architecture for a discovery system. At the core of the system is the discovery method, which computes and evaluates groupings, patterns, and relationships in the context of a problem solving task. The groupings, patterns, and relationships are derived from raw data extracted from a database, or a preprocessed form of the raw data. Preprocessing may be done by statistical or by knowledge-based techniques. Furthermore, feature descriptions may be numeric or non numeric. The traditional 2

method for data description is to describe data objects as vectors of feature-value pairs[11]. Features represent properties of objects that are relevant to the problem solving task. For example, if we wish to classify automobiles in terms of the speeds they can achieve, then body weight, body shape, and engine size are relevant features, but color of the car body is not. In some cases, feature-values may be represented in a structured or hierarchical form, indicating that the values can be expressed in di erent levels of detail. Real world data descriptions are usually made up of combinations of numeric and non-numeric features. For example, if one looks at geological data, features, such as age, porosity, and permeability are likely to be numeric, whereas descriptors, such as rock type and facies structure are non numeric or nominal-valued. Therefore, it becomes important to deal with algorithms that can work with a combination of numeric and nominal valued data. Further, data descriptions in real-world data sets are often incomplete, i.e., features may have missing values. It is important to study the nature of the missing values and tailor the discovery algorithm to account for their e ects. We discuss this issue in more detail later in the chapter. It is often an interesting challenge to extract raw data from a database and cast it into a form required by the chosen discovery algorithm. We do not deal with database organization and manipulation issues in this chapter. Depending on the discovery method used, the knowledge produced may be in di erent forms:

 data objects organized into groups or categories, and each group represents a

relevant concept in the problem solving domain. Inductive discovery methods in this category are called clustering methods,

 classi cation rules that identify a group of objects that have the same label

(i.e., structural traps, deltaic facies) or di erentiate among groups of objects that have di erent labels (deltaic versus submarine fan facies). These methods are termed classi cation methods, and

 descriptive regularities, qualitative or quantitative among sets of parameters drawn from object descriptions. Inductive methods in this category are called empirical discovery methods.

A wide variety of algorithms drawn from statistics, numerical clustering, numerical taxonomy, and machine learning can be used as discovery methodologies. Our focus is on clustering methods for discovery of concept hierarchies from realworld data. In relatively unknown and unexplored domains, the knowledge discovery 3

task can be described in terms of three key subtasks: (i) feature selection, (ii) discovery, and (iii) interpretation. Feature selection deals with issues for characterizing the data to be studied. Discovery deals with the process of grouping the data objects into clusters or groups based on the similarity of properties among the objects. The goal is to derive more general concepts and rules that describe the problem solving or classi cation task. The task of interpretation involves determining if the induced concepts are useful for the problem solving tasks that the user is interested in. This task involves the examination of the intentional description of a class in the context of background knowledge about the domain. If class structures have probabilistic de nitions, a class description is de ned in terms of the conditional probability that a feature Ai will take on a particular value Vij , i.e., P (Ai = Vij j Ck ). The higher the probability the more important is the feature for describing or characterizing the class. Interpretation would involve determining the relevance and usefulness of these important features (or sets of important features) in a problem solving context.

3 Overview of Clustering Methods Traditional approaches to cluster analysis (numerical taxonomy) represent the objects to be clustered as points in a multi-dimensional metric space and adopt distance metrics, such as Euclidean and Mahalanobis measures, to de ne dissimilarity between objects. Cluster analysis methods take on one of two di erent forms: 1. parametric methods: they assume that the probability distribution structure of each class is known, and the clustering algorithm estimates individual distribution parameters from the mixture densities[10], and 2. non parametric methods: they make no explicit assumption about the distribution of the classes, but describe class structures as mean vectors in the multidimensional space [21]. For example, the basic ISODATA algorithm[1] starts with randomly chosen mean vectors to represent cluster centers, assigns data objects to the class with the closest mean, recomputes the mean vectors as the average of the samples in the class, and repeats this process till the mean vectors converge. Variations and extensions to this basic scheme have been developed, such as determining the appropriate number of classes k when it is not known beforehand[21]. 4

Conceptual clustering schemes represent objects as vectors of nominal valued attributes, and typically rely on non parametric probabilistic measures to de ne groupings. CLUSTER/2[27], bases its criterion function on measures of common attribute values within a cluster, non intersecting attribute values between clusters, and simplicity of the conjunctive expression for describing a cluster. UNIMEM[22] builds a classi cation tree based on a Hamming distance1 measure which mainly focuses on intra cluster similarity. WITT[19], de nes intra cluster similarity in terms of the strength of pairwise attribute relationships (co-occurrences) that exist within a cluster or group. The strength of these relationships is de ned in terms of correlational measures that are represented as contingency tables. AUTOCLASS[8] and COBWEB[11] de ne classes as a probability distribution over the attributes of the objects. AUTOCLASS, a parametric scheme, adopts the Bayesian classi cation approach. Fisher's COBWEB uses the category utility measure developed by Gluck and Corter[18] to predict the preferred level of categorization of a group of objects. The creation of a taxonomy is the end result of cluster analysis. Conceptual clustering approaches attempt to go a step further by relating intermediate nodes of a classi cation hierarchy to salient concepts in the domain of interest[11]. This is usually accomplished by characterizing each group in terms of a general description. In general, this is a complex task because the nature of the groupings depends on the nature of the bias used in the learning algorithm[7, 29]. The bias has been de ned as \the set of all factors that collectively in uence hypothesis de nition"(i.e., the nature of the clusters or groupings formed). These factors include the de nition of the space of hypotheses, and de nition of the algorithm that searches the space of concept descriptions[7]. These in turn can be expressed in terms of a criterion function chosen to in uence grouping and concept formation and the control structures that guide the search for derivation of structure in the data.

4 Conceptual Clustering: The ITERATE System Research conducted by our group has led to the development of ITERATE, a conceptual clustering algorithm that works with combinations of numeric and non numeric data. The primary motivations for developing ITERATE were to extend previous conceptual clustering algorithms (e.g., COBWEB[11]) to generate stable and maximally distinct partitions[4] and to produce an ecient algorithm for an interactive 1 the Hamming distance is the same as the Manhattan metric when all feature values are considered to be binary, i.e., present or absent.

5

data analysis tool. Like other conceptual clustering algorithms, ITERATE builds a concept tree from domain objects or instances represented as a vector of attributesvalue pairs but tries to mitigate the e ect of incremental control structures. The algorithm exploits information on the entire object set in creating the object hierarchy. More speci cally, it adopts an ordering operator that preorders the object sequence to exploit the biases of the criterion function, in forming maximally distinct classes in the initial classi cation tree[4]. The tree is generated in a breadth- rst manner, therefore, class probabilities at a parent node are allowed to stabilize before child nodes are created.

4.1 Evaluation Functions and Control Structure ITERATE's overall control structure can be summarized as follows: 1. Order the data sequence based on an anchored dissimilarity ordering (ADO?) scheme, 2. Generate a hierarchical concept tree using the partition score measure, 3. Choose a representative set of concepts from the hierarchy to create an initial class partition, and 4. Consider objects one by one, and based on category match measure redistribute objects to the most similar class. Repeat this step till no objects change class. The intuition behind the ITERATE control structure is to use the results of a hierarchical clustering scheme to choose an initial starting point for a partitional clustering scheme.

4.1.1 Category Utility and the ADO algorithm ITERATE uses the category utility measure as the criterion function for determining partitions in the hierarchy in Step 2. Gluck and Corter[18] showed that the expectedscore measure of category utility is linked to a probability matching strategy, given by P (Ai = Vij jCk )2, i.e., the value Vij of attribute Ai for an object based on the likelihood of Ai taking on value Vij given that the object belong to class Ck . The category utility measure for a class Ck is de ned as:

CUk = P (Ck )

X X P (A = V j C )2 ? P (A = V )2: i ij k i ij i

j

6

(1)

This represents an increase in the number of attribute values that can be correctly predicted for class k over the expected number of such predictions given no class information (P (Ai =PVKij )2). Given a partition structure with K classes, the partition score is de ned as: k K CUk . Gluck and Corter demonstrated the ecacy of this measure in predicting the preferred level of categorization given a pre-existing hierarchy. Fisher adapted this probability matching measure to develop a conceptual clustering algorithm called COBWEB[11], that given a set of objects expressed as feature-value vectors builds a classi cation tree. COBWEB uses an incremental approach to forming the hierarchy by incorporating data objects one-at-a-time into a current hierarchy. A data object is placed into a level of the hierarchy using one of two operations: (i) classify data object into existing class or (ii) create a new class. The operation that produces a partition with the higher partition score is the one applied to update the partition. The data object is recursively classi ed until the object is placed in a singleton class (a class covering a single instance). Empirically, Fisher demonstrates that the method does well in modeling the basic level phenomena and the resulting classi cation trees perform admirably in the exible prediction task[12, 13]. One of the well studied characteristics of COBWEB and related algorithms is their order dependency. The incremental nature of the control results in di erent data orders often producing di erent classi cation trees. Note that the CU represents a trade-o between size, P (Ck ), and cohesiveness or predictive accuracy of feature-values, [(Pi Pj (P (Ai = Vij j Ck )2 ? P (Ai = Vij )2)] of a class or category. The term P (Ck ) causes a bias toward larger categories, and the consecutive, incremental presentation of a group or groups of highly similar data objects may skew the partition structure. A number of studies[6, 17, 26] have examined the orderdependency phenomenon of COBWEB and its e ects on classi cation tree structure and concept formation. Gennari, et al.[17] and McKusick and Langley[26] have established that the COBWEB control structure and evaluation function are oriented toward \maximizing predictive accuracy, (and) the hierarchies it constructs may not re ect the underlying class structure of the domain"[26]. What this implies is that COBWEB is likely to produce spurious intermediate nodes in the classi cation trees it generates[26], which can cause unnecessary fragmentation in the nal partitions. Fragmentation makes it dicult to interpret and extract useful information from a partition structure. By creating an extreme case of uneven growth, i.e., creating an intermediate partition structure with one large class and a member of singletons, a systematic and =1

7

formal analysis revealed that as far as placing the next object was concerned, the larger class is actually favored by a factor that equals the size of the class plus the size of the partition. A more complete analysis of this phenomenon is discussed elsewhere[30]. A number of attempts have been made to modify the basic COBWEB control structure to reduce order dependency of the tree structure generated. Most of these come under the auspices of reclassi cation methods[15]. The COBWEB system described in [11] introduced split and merge operators to mitigate order e ects. The split operator divides a category in a partition into its sub-categories thereby promoting them in the hierarchy. Conversely, the merge operator combines a number of categories into a more general super-category thereby demoting the combined categories in the hierarchy. Both operators are applied locally within each partition, with the merge operator being applied to all pairwise combinations of categories. The operation that results in the best improvement in the partition score over that of the original partition is selected. McKusick and Langley[26] have experimented with promote operators which extend the split and merge operators used in the COBWEB algorithm. This operator promotes a class or grouping that is more similar to an ancestor than its direct parent, and allows subsequent redistribution to place these objects in proper categories lower down in the hierarchy. A di erent viewpoint was adopted in developing ITERATE, where the order dependency property was exploited by manipulating the data order so that the cohesiveness factor plays a more important role in the early partition formation process. Previous work showed that interleaved2 orders produce better classi cation trees and better nal groupings in terms of the rediscovery task and interpretation task[6]. To achieve this, data objects at each node are ordered using the anchored dissimilarity ordering (ADO ) ordering algorithm, which adopts a complete-link approach in generating data orderings3 . The object chosen to be the next in the order is the one that maximizes the sum of the Manhattan distance between it and the previous n objects in the order. The Manhattan distance between two objects de ned by nominal-valued attributes is simply the number of di erences in the attribute-value pairs. The window size, n, is user de ned, and empirically corresponds to the actual number of classes expected in the data. 

Interleaved corresponds to an order where objects from di erent classes are presented in sequence in an attempt to obtain a maximally dissimilar ordering among the objects. 3 ADO is a variation of a best case ordering algorithm developed by Xu[31]. 2

8

Figure 1: CU values along a path of Classi cation tree

4.1.2 Creating the Classi cation Tree This step of the algorithm is adapted from the original COBWEB algorithm with some modi cations. A primary di erence in the implementation is that this algorithm exploits the fact that all objects to be clustered are available before hand, and, therefore, as a preprocessing step all priori probabilities, P (Ai = Vij ), are precomputed. Rather than adopt a depth rst approach used by COBWEB, where every object is classi ed down to its instance-level (i.e., singleton) node, the current version generates the tree level by level; i.e., all objects are classi ed into the rst level of the tree, then the second level of the classi cation tree is generated, and so on. This approach allows the conditional probabilities at intermediate nodes, i.e., P (Ai = Vij j Ck ) to stabilize before the classi cation tree is extended below that node. This control structure re ects the divisive approach to hierarchical clustering.

4.1.3 Extracting the Initial Partition Extraction of the base level is performed by comparing the CU value (equation 1) of classes or nodes along a path in the classi cation tree. For any path from root to leaf of a classi cation tree this value initially increases, and then drops (see Fig. 1). Classes from the level below the point where the CU value falls, can be considered to over t the data, and, therefore, not useful for the clustering task. The algorithm ensures that a node at which CUk peaks will be included in the initial partition. A salient feature of this algorithm is that it tends to be conservative. In other words, it prefers more speci c to more general concepts (clusters) in its choice of the initial partition. Along a particular path from root to leaf, if one child of a node has a greater CU value than the node itself, this node is not picked to be a component of the initial partition. If the CU value decreases for other children of this node, those nodes are picked for the initial partition. Once a node is picked along a path, no other nodes below this node can be included in the initial partition. Thus 9

the algorithm ensures that no concept subsumes another.

4.1.4 Iterative Redistribution ITERATE's redistribution operator reassigns data objects in the initial partition to groups so as to optimize a chosen outcome function. This criterion function, the category match measure[14] measure the similarity between a data object and a class.

CMdk = P (Ck )

X (P (A = V j C )2 ? P (A = V )2); i ij k i ij

i;j Ai d 2f

(2)

g

for data object d and class k. This measure assumes that each data object has only one value per attribute (represented by j 2 fAigd in the above equation). Category match measures the increase in expected predictability of class Ck for the attribute values present in data object d. The factor P (Ck ) favors larger groupings, therefore, it is this term that promotes consumption of smaller classes into larger ones, especially when the second term, the predictability factor, is not signi cantly di erent. The redistribution operator assigns d to the class for which CMdk is maximum. In case of a tie and the data object's current class is a contender, the object is retained in the same class. Otherwise ties are broken arbitrarily. A redistribution iteration consists of determining each object's assignment and updating the partition based on the assignment. The redistribution operation is re-applied till quiescence. The category match measure forms the basis for a global redistribution operator. The e ect of data ordering on groupings formed can be explained in terms of the criterion function converging to a local maxima (or minima). Iterative redistribution, by allowing global changes, because it allows redistribution over the entire class structure, can mitigate this problem to some extent. It is well known though, that this problem (converging to a local as opposed to global maxima) can, in general, only be eliminated by exhaustive search.

5 Experimental Results We demonstrate the e ectiveness of ITERATE in concept formation by applying it to a number of nominal valued numeric valued, and combination of numeric and nonnumeric valued real world data sets. The classes generated by ITERATE were studied along a number of dimensions (see [4] for details). Interpretation of class structure was based on two conditional probability measures associated with the attributevalues pain of a class (Ck ): (i) predictability, P (Ai = Vij jCk ), and (ii) predictiveness, 10

Figure 2: Mushroom Data: Classi cation Hierarchy

P (Ck jAi = Vij ). Predictability measures the likelihood of observing an attribute-

value for a class, and the predictiveness measures the likelihood of an attributevalue implying the particular class. As concepts become more precise, predictability of attribute-values tend to increase, whereas predictiveness values tend to decrease. Often the product of the predictability and predictiveness measures is used to indicate the importance of attribute-values in a class, i.e., the higher this value, the more important is the feature-value in de ning the concept associated with this class. This is apparent in the Mineral data and PLAYMAKER rule model studies described below.

5.1 Non Numeric Data Analysis We used these nominal valued data sets: Mushroom, Mineral, and the Playmaker rule base.

Mushroom Data The rst data set was extracted from the mushroom database in the UCI repository. The mushroom database consists of mushroom descriptions represented by 22 nominal attributes (some of them are listed in Table 1, belonging to one of two classes: edible and poisonous. There are 23 species of mushrooms in this data set, and we randomly picked a subset of 200 data objects making sure 100 of them were poisonous and 100 were edible. Further, the database source notes that there are no simple rules for determining the edibility of a mushroom. The mushroom data produced an interesting 2 class structure: 28 of the edible mushrooms grouped with the 100 poisonous mushrooms, and the remaining edible mushrooms formed a class by themselves. Table 1 shows that the poisonous mush11

Feature name

Values

poisonous edible cap shape x, f x, b, f, s cap surface y, s y, s, f cap color n, w y, w, n, g bruises t mostly t, some f odor p a, l, n gill spacing c mostly c, some w gill size n mostly n, some b stalk shape e mostly e, some t stalk root e lot c, some e, b, r stalk surf bel ring s lot s, some y ring type p mostly p, some e population s, v mostly n, s, some v, y, a habitat g, u mostly g, m, some u, p, d

Table 1: Feature-value distributions: Mushroom Data rooms have homogeneous properties but the edible mushrooms have more diverse properties, in fact, some of them look more similar to poisonous mushrooms4. To continue the clustering process further, and create a hierarchical structure (Fig. 2) we ran ITERATE again on both the 128 and 72 object groups created in the rst run. Run 3 was executed on the 118 object group, and run 4 was used to generate clusters from the 110 object group created in run 3. The end result was a clean split: the poisonous mushrooms broke down into two groups, and the edible mushrooms split into ve (two groups from the 72 object group produced in run 1, and one each from the remaining runs). It is interesting to observe that the edible mushrooms whose feature-values overlap with the poisonous mushrooms, rst grouped together with those objects, but later split up to form their own individual groups. It is also interesting to note that the split in the poisonous group in run 4 was solely on cap color, all other feature values were quite uniformly distributed between the two groups.

Mineral Data The mineral data set contains 39 di erent minerals from ten di erent groups based on chemical composition (see Table 2). Each mineral is described in terms of optical properties, such as color, form, cleavage, relief, birefringence, and interference gure. Given that the features correspond to optical properties, it was not clear before We would like to point out that the edible mushrooms were randomly drawn from di erent classes. 4

12

Class Type D1 Halide D2 Oxide D3 D4 D5 D6 D7 D8

Minerals Halite, Fluorite Periclase, Corundum, Rutile, Cassiterite, Spinel Hydroxide Diaspore, Brucite, Gibbsite Carbonate Calcite, Dolomite, Magnesite, Siderite Sulfate Barite, Celestite, Anhydrite, Gypsum Phosphate Monazite, Apatite, Dahhlite Silica Quartz Feldspar Orthoclase, Sanidine, Microclase, Albite, Oligoclase, Andesine, Labradorite, Bytownite, Anorthite Feldspathoid Leucite, Nepheline, Cancrinite, Sodalite Inosilicate Enstatite, Hypersthene, Diopside, Augite

D9 D10 Table 2: Mineral Data: Grouping by Chemical Composition hand as to whether these features could recreate the chemical groupings. Therefore, analysis of this data set illustrates the process of characterizing and interpreting the groupings formed, i.e., Step 3 in the exploratory data analysis task. The six classes ITERATE produced are discussed below. It is important to note that the optical features used to characterize the data objects do not produce the chemical groupings shown in Table 2. Instead these features produce a partition that corresponds to the crystal structure of the minerals. Class 1 contains all the isometric/isotropic crystals; Class 2 contains the tetragonals; Class 4 contains the feldspar monoclinic, triclinic, and orthorhombic structures; and Class 5 is the pyroxene monoclinic, triclinic, and orthorhombic structures, and Class 6 is made up of the hexagonals. It may seem unusual that gibbsite, which is monoclinic forms its own singleton class (Class 3). However, it is known that hydroxides often absorb water and exhibit variations in their structural properties. Though classes 3,4, and 5 have similar crystal structures, their signi cant features (i.e., high predictability and predictiveness values) have di erent values (see Table 3 for a comparison of classes 4 and 5). For example, the feldspar group and the sulfates (Class 4) are mostly colorless though they may exhibit some clouding, whereas the pyroxenes exhibit various shades of color in their thin sections. Other signi cant features that exhibit di erences are cleavage, relief, and birefringence.

13

Class Feature color form cleav relief biref extinc int g lown opt

Value cl ahsh p001 lo wk plcl bi one pos

C4

P (A:V C ) P (C A:V ) Value

0.86 0.36 0.43 0.5 0.86 0.21 0.93 0.5 0.5

j

0.6 0.71 1.0 1.0 0.63 0.75 0.72 0.41 0.44

j

cln,etc. pr pl110 hi st pl bi thr pos

C5

P (A:V C ) P (C A:V ) 0.2 0.4 0.8 1.0 0.4 0.6 1.0 0.8 0.8

j

1.0 1.0 0.8 0.71 0.67 0.33 0.28 0.57 0.25

j

Table 3: Predictability and Predictiveness of Features: Mineral Data depositional setting primary bedding shape downdip sediment association interbedded sediment association lithology bedding thickness aerial geometry paleoenvironment indicator

DSE PBS DSA ISA LTY BTH AGM PEI

primary bedding type vertical sediment variation updip sediment association sediment type paleomarker fauna sediment texture sediment structure

PBT VSV USA STY PMA FNA STX SST

Table 4: Geological attributes for Facies structure

PLAYMAKER: Building Rule Models Application of ITERATE to rules of PLAYMAKER, a system for characterizing hydrocarbon plays, illustrates an application of knowledge discovery techniques to improve complex problem solving. The concept hierarchy generated by clustering the rules into a hierarchy of rule models is used with the task-speci c reasoning architectures of MIDST (an expert-system building tool and run-time environment) to develop a more ecient and focused reasoning mechanism. We illustrate the methodology for construction of the rule model hierarchy in this section. Details of the MIDST system and the more e ective reasoning mechanisms are presented elsewhere[2, 3, 5, 9, 23]. We created a rule model hierarchy by running ITERATE recursively on a set of 144 facies rules that represented 14 di erent facies structures. Sixteen geological attributes listed in Table 4 form the LHS conditions for these rules. When rules conclude multiple hypotheses, they were split into multiple objects, one for each conclusion. This increased the total number of objects to be clustered to 188. An example of a rule and its corresponding object description appears below. |rule122| (dep_set shelf) (a_geometry clinoform) (downdip_sed_asso shale) ---> ((facies delta) .5)

14

Figure 3: Rule model Hierarchy: Facies rule122 (FAC DEL) (DSE SHF) (PBT NA) (LTY NA) (STY (ISA NA) (SST NA) (AGM CLI) (DSA SHL) (BTH

NA) (PMA NA) (USA

NA) (VSV NA) (STX

NA) (PBS NA) (FNA

NA) (PEI NA),

NA)

where FAC is facies, DEL is delta, SHF is shelf, CLI is clinoform, and SHL is shale. The object description starts with rule number and the conclusion (facies delta). NA is used as a placeholder to indicate that the particular rule did not use the corresponding attribute. Note that the rule's conclusion was itself used as an attribute of the data object to account for expert's reasons for forming the rules. This di ers from other work where the conclusion is used as a label for a supervised learning algorithm such as ID3[28]. Also note the large number of systematic missing values. In fact for this rule set, 75-80% of the values were missing. Given a problem domain and a relevant set of features that describe the domain, object descriptions in real world data are often incomplete. Incomplete object descriptions in real world data can be attributed to di erent phenomena: (i) random missing feature values caused by random glitches in a measuring device or because of random observation and recording errors, and (ii) systematic missing feature values, which are more a function of the property of the data object itself. Consider a physician recording patient data to form a diagnosis. What data the physician collects di ers from patient to patient, and it is driven by the physician's hunches on the nature of the patient's illness. We focus on a method for handling systematic missing values. 15

For the interpretation task, the presence of a large proportion of missing values distorts the object set increasing the diculty of forming meaningful classes. Biswas, et al.[4] discuss the counterintuitive results produced in experiments with a large number of systematic missing values. A solution is to consider missing values as an additional \unknown" value associated with an attribute (cf. AUTOCLASS[8]), but in data sets where missing values tend to dominate this method fails. To overcome this problem, we rede ne the CU measure so that comparison between a concept node and its parent is made only over the set of attributes that are common to both nodes. Therefore, the term for the parent node in the CU calculation is summed only over i 2 child node. In other words, the CU function is de ned to measure the increase in predictability only for those attributes that have observed values in the child node. Unobserved attributes are ignored. Empirical results demonstrating the e ectiveness of this scheme are presented in [4]. The resultant rule model hierarchy is shown in Figure 3. Each node (number in parentheses) de nes a rule model, which is de ned in terms of its associated attribute values. For example, the DSE (depositional setting) attribute has two values in the node 6 rule model: shelf and slope. The rule model is further characterized by the predictability and predictiveness values of the attribute-values (see Table 7). We used the product of the predictability and predictiveness measures to indicate the importance of attribute-values in a rule model class. When this product falls below a prede ned threshold, for all attribute-values that de ne a rule model, that rule model was considered irrelevant and dropped from the rule model hierarchy. This was used as a pruning procedure to determine the leaf nodes of the hierarchy. The resulting concept hierarchy is used during a problem solving session to focus PLAYMAKER's evidence gathering scheme. Details of the method used for selecting a rule model based on the current hypothesis is given in[24]. The performance of the system using the ITERATE generated rule models was evaluated on 9 test cases. These test cases were compiled from case books of real data by our geological experts. The results are compared against the previous method of focusing attention which basically chose rules that supported the leading hypotheses. The experiments centered on two issues: (i) how much more ecient is the rule model scheme over the previous scheme, and (ii) are there any structural di erences in the consultation dialogues between the two schemes? Eciency was measured in terms of the number of queries the system had to ask before it got to its nal conclusions. Eciency improves if the system can identify the more important and relevant attributes rst to establish the desired conclusions. Then unimportant and irrelevant attributes that do not pertain 16

Case no. 1 2 3 4 5 6 7 8 9 Avg Query saving 3 2 2 3 3 2 2 2 2 2.33

Table 5: Query Saving Case no. 1 2 3 4 5 6 7 8 9 Avg Without rule models 0 2 3 0 2 4 1 0 0 1.33 With rule models 0 0 1 0 0 1 2 2 0 .67

Table 6: Number of Change-overs to the established set can be ignored. This not only reduces the queries asked, but is also likely to produce more focused nal conclusions in terms of relative ranking by belief values. We determined empirically that consultation could be terminated if belief values of the relevant hypotheses did not change over the last three queries. The actual number of queries saved for the nine test cases are listed in Table 5. From an eciency viewpoint, 2.3 queries=case represents a 14% speedup on the average, which is signi cant but not a tremendous gain. However, study of the structure of the consultation process revealed another primary di erence between the two versions of the system. In case of the rule model version, sets of evidence (queries) were picked in a particular context de ned by the selected rule models. This makes it easier to develop a consultation process, where sets of queries can be presented to the user simultaneously, rather than follow the one-query-at-a-time format. From a practical viewpoint, this results in a much quicker consultation process. It also enables the user to focus on sets of relevant evidence at once. We use a measure, the number of change-overs in the rankings of the primary hypotheses (Table 6) to illustrate the ability to focus on the right context and conclusions early. The performance improvement is 100% on the average. In addition to providing speedup the more structural case presentations indicate more clarity in the reasoning process, and, therefore, a better ability to explain its own reasoning processes. This property is a key in achieving overall reliability and e ectiveness in complex decision making process. This rule model hierarchy di ers from the text book clastic facies hierarchy (marine, transitional, and continental). The rule node structures are more mixed, indicating the similarities between di erent facies structures (e.g., submarine fan and basin, delta and slope, and uvial and lacustrian). Recognition of these common characteristics helped focus problem solving. On the other hand, single concepts, e.g., delta, occurred as multiple rule models, indicating that these concepts have di erent characteristics and representations, that may correspond to di erent geographical regions. 17

Attr FAC DSE PBT STY PMA VSV PBS PEI ISA SST

AGM DSA USA STX

Value DEL SHF SLP DIP DIC SDS SHL MWT HET FTL MIX SHL SSC LSC BSE RIP BUR THL CLI SHL SDS SMT

Node 6 P 0.8000 0.6429 0.3571 0.3333 0.3333 0.7500 0.1875 0.5000 0.5000 1.0000 1.0000 0.8333 0.2000 0.2000 0.1000 0.1000 0.2000 0.2000 1.0000 1.0000 1.0000 1.0000 where P 1

Node24 Node 25 P P P P P P P 0.1489 0.1191 1.0000 0.1064 0.1064 1.0000 0.0426 0.0479 0.0308 0.8750 0.0372 0.0326 0.0266 0.0095 1.0000 0.0213 0.0053 0.0018 1.0000 0.0053 0.0053 0.0053 0.0018 1.0000 0.0053 0.0638 0.0479 1.0000 0.0479 0.0479 0.6000 0.0160 0.0160 0.0030 0.4000 0.0106 0.0266 0.0133 0.6250 0.0266 0.0166 1.0000 0.0053 0.0053 0.0027 1.0000 0.0053 0.0053 1.0000 0.0053 0.0106 0.0106 1.0000 0.0106 0.0106 0.1489 0.1489 1.0000 0.0745 0.0745 1.0000 0.0426 0.0266 0.0222 1.0000 0.0266 0.0266 0.0106 0.0021 0.2000 0.0053 0.0011 0.0106 0.0021 0.2000 0.0053 0.0011 0.0053 0.0005 0.2000 0.0053 0.0011 0.0053 0.0005 0.2000 0.0053 0.0011 0.0106 0.0021 0.2000 0.0053 0.0011 0.0106 0.0021 1.0000 0.0053 0.0638 0.0638 1.0000 0.0426 0.0426 1.0000 0.0213 0.0160 0.0160 1.0000 0.0160 0.0160 0.0053 0.0053 1.0000 0.0053 0.0053 0.0053 1.0000 0.0053 0.0053 is predictability, P is predictiveness, and P = P P . 2

1

3

1

2

2

3

3

1

1 

2

P 0.0426 3

0.0213 0.0053 0.0096 0.0043 0.0053 0.0053 0.0426

0.0053 0.0213 0.0053

2

Table 7: Rule Model Characterization These di erences were exploited to achieve more focused querying of the user thus producing more e ective dialogue with the user.

5.2 Numeric Data Processing The next step was to extend the ITERATE algorithm to handle numeric data and datasets that had combined numeric and non numeric feature descriptions. For continuous-valued attributes, the probability of each continuous data attribute is determined by its density function instead of counts. Like CLASSIT[17], ITERATE incorporated the probability density function(pdf ) into the criterion function by replacing the summation of squared probability in the CU function (equation 1) with the integration of squared probability density function(pdf 2) for the numeric valued attributes. The overall CU for a class is made up of two components: the non numeric features and the numeric features. CLASSIT assumed the pdf of numeric-valued features to be normally distributed and exploited this to compute the double integral for the numeric feature terms. ITERATE did not assume a normal distribution for the pdf , but estimated it using a variation of the Parzen window approach, the nearestneighbor approach[10]. A problem with numeric features is that as a feature value description for the class becomes more precise, the pdf for that feature becomes more and more peaked and the contribution of this feature to the CU measure increases quickly. In the 18

case of CLASSIT, the variance of the distribution becomes smaller and smaller. In the both cases, however, the integral of pdf 2 becomes very large, (for a single-valued distribution it would be in nite), and therefore, the contribution to the CU measure from this feature tends to dominate all others. In general, this can have the following detrimental e ects. 1. The domination of one attribute renders all others insigni cant. This is not desirable, 2. Class formation is a tradeo between size (the P (Ck ) term) and cohesiveness (the P (A = V j C )2 term). For nominal-valued data both these terms have an upper bound of 1, so the tradeo between the two is easy to establish. On the other hand, if we increase the upper bound on the cohesiveness term, the size factor P (Ck ) which still has an upper bound of 1 becomes insigni cant. This tends to produce very fragmented class structures. 3. Once a few attributes begin to dominate by making large contributions to the CU function, an even worse situation occurs. Multiple child nodes also exhibit the same big number in their utility function for incorporating additional objects. Under this situation, the assignment of any object to the child nodes becomes arbitrary. Gennari et al.[17] introduced the concept of an acuity measure to bound the contribution of any one feature to the CU measure. Whereas this measure makes intuitive sense, it is not clear why di erent features should be bounded by the same acuity value, and whether the acuity measure should change from level to level in the classi cation tree. To address this problem, a number of dynamic bounds were tried in ITERATE, but empirical studies and analysis demonstrated that none of them could prevent fragmentation of data and generation of arbitrary and spurious classes[25]. Our results illustrate the problems that algorithms like ITERATE and CLASSIT have with numeric data. The primary problem is the inability to normalize and bound the density functions in computing the category utility and partition score values. However, this problem did not arise when we dealt with purely non-numeric data. Therefore, an alternate method was adopted for dealing with continuous-valued data or combinations of continuous-and-nominal valued data in ITERATE, a systematic method for discretizing the numeric-valued features of the data sets so that information that is important and useful for classi cation purposes is retained. 19

0.1

2.5

0.09 0.08

2

0.07 0.06

1.5

0.05 0.04

1

0.03 0.02

0.5

0.01 0

1

2

2.7 3

4

4.6 5

6

6.5

7

8

8.5

9

(a) histogram of the data

0

0

1

2

3

4

5

6

7

8

9

(b) plot of the within group variances according to different values of k

Figure 4: Results on Data Having Four Distinct Peaks

5.2.1 Discretization The method for generating a discretization of continuous-valued features is based on a minimizing-within-group-variance scheme[20]. In image processing applications, this scheme has been applied to bimodal distributions with well separated peaks to create binary images. Since the distribution around each peak is fairly homogeneous(bellshaped curve), a good way of dividing up this feature into two groups would be to nd a dividing value, t, such that the resulting sum of the within-group-variances, W , of the two groups is minimized. The dividing point, t, that minimizes the sum of the within group variances is equivalent to the one that maximizes the betweengroup-variance, 2 X (3) B = (i ? )2; i=1

where i is the mean of each of the two separated groups and  is the mean of the original group of data. We have generalized this algorithm to generate the best k-partition of a feature distribution. This is equivalent to partitioning a feature into k groups which correspond to k-discrete peaks in its histogram of values. The algorithm rst nds the best breaking point to separate an attribute into two distinct groups, the best two points to separate the attribute into three groups, and so on. After each iteration, the sum of the within-group-variances of the separated groups is calculated and the W 's are plotted. An attribute having k well-de ned peaks has the rst minimum W corresponding to the desired \k"-split. The algorithm stops when the rst minimum W is obtained. Figure 4(a) illustrates a histogram of a feature with four peaks. In this case, the rst minimum W obtained corresponds to k=4 in Figure4(b), thus 20

0.6

0.16 0.14

0.55

0.12 0.5

0.1 0.08

0.45

0.06

0.4

0.04 0.35

0.02 0

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

0.3

(a) histogram of the data

0

1

2

3

4

5

6

7

(b) plot of the within group variances according to different values of k

Figure 5: Results on Data Close to Normal Distribution Data set iris

CLUSTER

COBWEB-3

Discretized ITERATE

C1(seto:50) C1(seto:19, vers:47, virg:50) C1(seto:50) C2(vers:2, virg:36) C2(seto:31, vers:3) C2(vers:48, virg:4) C3(vers:48, virg:14) C3(vers:2, virg:46) Heart C1(absent:131, present:25) C1(absent:149, present:39) Disease C2(absent:33, present:114) C2(absent:15, present:100) Table 8: Results obtained from CLUSTER, COBWEB-3, and ITERATE

indicating a 4 value discretization for this feature. For some data sets, a feature distribution may have no major peaks or just one major peak, such as the case of a normally distributed feature. In such cases, it is found that the sum of the within-group-variances for k=0 (no split) will always be smaller than the subsequent ones. Figure 5 illustrates such a case. This data was generated from a normal distribution. Under such circumstances, we use the standard deviation value to break up the data into ve equal intervals. Note that the number ve was an arbitrary choice.

5.2.2 Numeric Data Results To gain a better understanding of the di erent algorithms, empirical studies are conducted on a pure numeric data set and a mixed numeric/non-numeric data set. The rst, the well known Iris data set contains 150 objects. Each object is described in terms of 4 numeric-attributes: sepal length, sepal width, petal length, and petal width. The objects are equally distributed in the three classes : seto(sa), vers(icolor) and virg(inica). From previous studies, it is known that the setosa class 21

is distinct, but versicolor and virginica are mixed. To demonstrate the ability of the discretization method on mixed data, an interpretation study was conducted on the the heart disease database of the Cleveland Clinic5. Heart disease is the build-up of plaque on the coronary artery walls that restricts blood ow to the heart muscle, a condition termed \ischemia". The end result is a reduction or deprivation of the necessary oxygen supply to the heart muscle. The data set consists of 303 patient instances de ned by 14 attributes: (i) age, (ii) gender, (iii) pain, (iv) resting blood pressure, (v) cholesterol, (vi) fasting blood sugar, (vii) resting ecg, (viii) maximum heart rate achieved during stress test, (ix) chest pain during stress test, (x) ECG ST segment slope on exertion, (xi) ECG ST depression relative to rest, (xii) ouroscopy indication of calci cation of major cardiac arteries, (xiii) thalium scan, and (xiv) angiographic status of cardiac artery patency. The last attribute is the direct indication of the presence of heart disease, and is used in reported studies as the predicted attribute. Similarly, for this study, it is used as a post-evaluation indicator of the interpretation study. The numeric attributes are (i), (iv), (v), (viii), and (x). The data comes from ve classes: people with no heart disease, people with four di erent degrees (severity) of heart disease. The rst column of Table 8 shows the results obtained by running the numeric clustering algorithm, CLUSTER[10], which is a variation of the ISODATA-mean square error clustering algorithm on the Iris data. The Euclidean metric was used to compute the distance between objects in the multi-dimensional space. The numeric results indicate that the data is mixed, and does not separate out well into individual clusters. The results obtained from COBWEB-3 is shown in column 2. The results generated by running ITERATE on the discretized data are shown in column 3. For the Iris data, ITERATE-with-discretized data generates exceptionally good results. Not only do the setosa objects separate into their own class, but virginica and versicola are also separated into two groups, each with a dominating class label. The result can be explained by the characteristics of the data. The rst two attributes, sepal length and sepal width, are not very informative for the separation of the data. The third and fourth features, petal length and petal width, are the ones that actually determine the classi cation structure. The distribution of the petal length is clearly bimodal, and the discretization algorithm picks up this fact and separates the values of this feature into two groups (labels the values of the attributes with two discrete values). The distribution of the petal width is also quite clear. It is a mixture of three The Cleveland Clinic heart disease database is maintained at the UCI repository for machine learning databases. We thank Robert Detrano, M.D., Ph.D. as the principal investigator for the data collection. 5

22

normal distributions. Again, the discretization algorithm nds the two threshold values that best separate the data into three groups. Figure 6 illustrates the featurevalue distributions for each of the four features for the three algorithms. It is clear that COBWEB-3 (i.e., CLASSIT) is unable to exploit the di erences between classes and generates two classes with overlapping feature de nitions. On the other hand, both CLUSTER and ITERATE-with-discretized values exploit the petal length and petal width features to form three distinct classes. The ITERATE classes seem to have better separated feature de nitions. For the heart disease data, the discretization step was performed on the ve numeric attributes, and the data set was clustered using ITERATE. The result shows a two class partition. One class covered 115 patient cases and the other 188 patient cases. Patient pro les were created using relevant attribute-values of each class. Relevancy was determined by predictiveness P (Ai = Vi;j j Ck ) and predictability P (Ck j Ai = Vi;j ). The two pro les are presented in Table 9. Each of the pro les was interpreted in terms of the risk and presence of heart disease. Values in Class 1 are consistent with patients having low risk and the likely absence of heart disease. The fact that these patients showed no symptoms of ischemia during exercise (stress test) is signi cant in diagnosing the absence of heart disease. Speci cally, they were able to attain higher heart rates (feature viii) without distress during exercise, showed none of the typical changes in the ECG which are present when the heart is not recovering well due to an oxygen de ciency (features x & xi), and did not exhibit any pain on exertion (feature ix). In addition, patients in this class had indicators that suggest they are at low risk for heart disease: normal resting ECG, normal fasting blood sugar, and no artery calci cation seen on ouroscopy. The presence of atypical anginal/non-anginal pain (feature iii) could be related to a variety of health factors (not just heart disease). Values of Class 2 are consistent with patients who are at risk for and have heart disease. A signi cant features that uni ed this group was demographics: males in the age range 60-68 years. Statistical studies bear out that this group has higher incidence of heart disease. Calci cation of cardiac arteries (feature xii) also increases the risk for this population since it contributes to the narrowing of the arteries. More signi cantly, the patients exhibit multiple symptoms directly indicative of heart disease. Speci cally, they show signs of cardiac ischemia on exertion during their stress test: inability to tolerate a high heart rate (feature viii), chest pain (feature ix), and ECG changes that are indicative of the heart muscle not recovering well (features xi & xii). The lack of anginal pain (feature iii) is interesting, but not 23

0.2

0.3

Sepal Length

Class1 Class2 Class3

Sepal Width

0.25

0.15 0.2 0.1

0.15 0.1

0.05 0.05 04

4.5

5

5.5

6

6.5

7

7.5

8

0.35

02

2.5

3

3.5

4

4.5

0.6

0.3

Petal Length

Petal Width

0.5

0.25

0.4

0.2

0.3

0.15 0.2

0.1

0.1

0.05 0

1

2

3

4

5

6

7

0

0

0.5

1

1.5

2

2.5

(i) Results from CLUSTER 0.3

0.2

Sepal Length

Sepal Width

0.25

0.15 0.2 0.15

0.1

0.1 0.05 0.05 0 4

4.5

5

5.5

6

6.5

7

7.5

8

0

2

2.5

3

3.5

4

4.5

0.6

0.35 0.3

Petal Length

Petal Width

0.5

0.25

0.4

0.2 0.3

0.15

0.2

0.1

0.1

0.05 0 1

2

3

4

5

6

7

0

0

0.5

1

(ii) Results from ITERATE 0.2

0.3

Sepal Length

Sepal Width

0.25

0.15 0.2 0.1

0.15 0.1

0.05 0.05 0

4

4.5

5

5.5

6

6.5

7

7.5

8

0.35

0

2

2.5

3

3.5

4

4.5

0.6

0.3

Petal Width

0.5

Petal Length

0.25

0.4

0.2

0.3

0.15 0.2

0.1

0.1

0.05 0

1

2

3

4

5

6

7

0

0

0.5

1

1.5

2

2.5

(iii) Results from COBWEB-3

Figure 6: Feature distributions of the IRIS data 24

1.5

2

2.5

0.035

0.06

class1 class2

0.05

0.04

0.025

0.03

0.02

0.02

0.015

0.01

0.01

0 60

80

100

120

140

160

180

class1 class2

0.03

200

220

(a) maximum heart rate achieved during stress test

0.005 100

150

200

250

300

350

400

450

500

550

(b) cholesterol

Figure 7: Two feature distributions of the Heart Data necessarily contradictory to the presence of heart disease since there is no direct indication of the severity of heart disease present. Looking at the class labels, the above interpretation, i.e., Class 1 is dominated by patients that are not at risk or haven't experienced heart disease, where as Class 2 is dominated by patients who are con rmed to have had heart disease. From the data discovery and interpretation viewpoint, this study reveals that the discretization methodology does retain the signi cance of important numeric features. An example is feature viii, the stress test heart rate. Its distribution for the two classes is illustrated in Figure 7(a). Note that the mean value for the maximum heart rate achieved is 158.4 (standard deviation = 35.6) for Class 1 and 135.2 (standard deviation = 28.3 ) for Class 2. Looking at the discrete representation, values 2, 3, 4 dominate for Class 1, whereas values 0, 1, 2 dominate for Class 2, indicating some overlap but a distinct di erence. On the other hand, cholesterol, a super cial numeric feature, has overlapping distributions for both classes (Figure 7(b)). The mean for this feature is 244.9 (standard deviation = 77.8) for Class 1 and 249.7 (standard deviation = 143.4) for Class 2, implying no signi cant di erences. A similar result is obtained when one compares the discretized value distributions. Comparing the performances of ITERATE-with-discretization versus COBWEB3, one sees distinct di erences for the Iris data, but comparable results for the Heart Disease data. An interesting observation may be that the discretization method based on minimizing within group variances did not apply to any of the numeric features of the Heart Disease data, but was applied to dividing up two of the Iris data features. In other words, the numeric features of the Heart disease data are more 25

600

Feature Number (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) (x) (xi) (xii) (xiii)

Feature Name

Class-1 Values

Class-2 Values

age gender pain resting blood pressure cholesterol fasting blood sugar resting ecg stress test heart rate chest pain on exertion ecg ST depression ecg ST slope artery calci cation thalium scan

40 - 68 years female/male atypical 93 - 122 (systolic) no relevant value normal normal 139 -184 bpm no 0 - 1 mm upsloping 0 arteries normal

60 - 68 years male anginal/non-anginal asymptomatic no relevant value no relevant value no relevant value no relevant value 70 - 137 bpm yes 2 - 3 mm

at 1 artery reversible defect

Table 9: Heart disease patient pro les or less normally distributed, therefore, the discretization process has no additional information to exploit in de ning feature values. This issue needs to be studied in greater detail in subsequent work.

6 Conclusions In this chapter, we have presented the generic architecture for a knowledge discovery system and discussed, in some detail, the ITERATE system that performs knowledge discovery by organizing data objects into relevant concept hierarchies. Concept hierarchies are useful for categorization and classi cation tasks in relatively unknown domains, but they may also be directly applied for improving problem solving performance, as we illustrated in the PLAYMAKER knowledge base. Our next step is to extend the ITERATE system to develop a toolbox where: (i) preprocessing routines are provided for access to large databases, and the user is assisted in the feature selection task, (ii) automated discovery methods are included as clustering algorithms, and (iii) graphic interfaces are provided by which the user can study the structure hypothesized by the clustering algorithm and interpret the meaning and ecacy of the concepts generated from a speci c problem solving perspective. In typical situations, where little is known about the nature of concepts that the user is looking for, it may be necessary to iterate through this three step process by experimenting with di erent subsets of features, and comparing the characteristics of the groupings and concepts obtained. 26

Acknowledgements: The authors acknowledges the work of graduate student Gye-

sung Lee who directly contributed to this research. Glenn Koller of Amoco Research Labs also provided valuable input. In addition, Lori Weinberg provided the cardiology expertise to interpret the Heart Disease data.

References [1] G.H. Ball. \Data Analysis in the Social Sciences: What about the details?" Proc. of the AFIPS Fall Joint Computer Conference, pp. 533-560, 1965. [2] G. Biswas, and T.S. Anand. \Using the Dempster Shafer scheme in a mixedinitiative expert system shell," Uncertainty in Arti cial Intelligence, Kanal, L., and Lemmer, J., (eds.), Elsevier Science, pp. 223-239, 1989. [3] G. Biswas, et al. \PLAYMAKER: A knowledge-based approach to characterizing hydrocarbon plays," International Journal of Pattern Recognition and Arti cial Intelligence, vol. 4, 1990, pp. 315-339. [4] G. Biswas, et al. \ITERATE: A conceptual clustering algorithm that produces stable clusters," in review, IEEE Trans. on Pattern Analysis and Machine Intelligence, 1993. [5] G. Biswas, G. Lee, and J. Weinberg. \Concept formation using ITERATE: Building Rule Models for Ecient Reasoning," Proc. Applications of AI 1993: Knowledge-Based Systems in Aerospace and Industry, pp. 2-13, Orlando, FL, April 1993. [6] G. Biswas, J. Weinberg, Q. Yang, and G. Koller. \Conceptual Clustering and Exploratory Data Analysisi," Proc. of the Eighth Intl. Workshop on Machine Learning, Evanston, IL, pp. 591-595, June 1991. [7] W. Buntine. \Myths and Legends in Learning Classi cation Rules," Proc. AAAI90, Boston, MA, pp. 736-742, 1990. [8] P. Cheesman, et al. \AutoClass: A Bayesian Classi cation System," Proc. Fifth International Conference on Machine Learning, Ann Arbor, MI, pp. 54-64, 1988. [9] D.K. Cheong, et al. \PLAYMAKER: a knowledge based expert system," Geobyte, vol. 7, no. 6, pp. 28-41, 1992. 27

[10] R.O. Duda and P.E. Hart. Pattern Classi cation and Scene Analysis. John Wiley, New York, NY, 1973. [11] D.H. Fisher. \Knowledge acquisition via incremental conceptual clustering," Machine Learning 2, pp. 139-172. 1987. [12] D.H. Fisher. \A computational account of basic level and typicality e ects," Proceedings of the Seventh AAAI, pp. 233-238, Morgan Kaufmann, San Mateo, CA, 1988. [13] D.H. Fisher. \Noise-Tolerant conceptual clustering," Proceedings of the Eleventh IJCAI, pp. 825-830, Morgan Kaufmann, San Mateo, CA, 1989. [14] D.H. Fisher and P. Langley. \The structure and formation of natural categories," Technical Report CS-90-05, Vanderbilt University, Dept. of Computer Science. 1990. [15] D.H. Fisher, L. Xu, and N. Zard. \Ordering E ects in Clustering," Proceedings of the Ninth International Conference on Machine Learning, pp. 163-168, Morgan Kaufmann, San Mateo, CA, 1992. [16] W.J. Frawley, et al. \Knowledge Discovery in Databases: An Overview," from Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, eds., AAAI/MIT Press, Menlo Park, CA, pp. 1-27, 1991. [17] J.H. Gennari, P. Langely, and D. Fisher. \Models of incremental concept formation," Arti cial Intelligence, 40:11-61, 1989. [18] M. Gluck and J. Corter. \Information, uncertainty, and the utility of categories," Proceedings of the Seventh Annual Conf. of the Cognitive Science Society, pp. 283-287, Irvine, CA, 1985. [19] S.J. Hanson and M. Bauer. \Conceptual clustering, categorization, and polymorphy," Machine Learning, 3:343-372, 1989. [20] R.M. Haralick and Linda G. Shapiro. Computer and Robert Vision, pp. 14-29, 1992. Addison-Wesley Publishing. [21] A.K. Jain, and D.C. Dubes. \Algorithms for clustering data," Prentice Hall, Englewood Cli s, 1988. 28

[22] M. Lebowitz. \Experiments with incremental concept formation: UNIMEM," Machine Learning, 2:103-138, 1987. [23] G. Lee, and G. Biswas. \A New Version of MIDST for building PLAYMAKER: A Knowledge-Based System for Characterizing Hydrocarbon Plays," Technical report 92-00, Department of Computer Science, Vanderbilt University, Nashville, TN, 1992. [24] G. Lee and G. Biswas. \Concept Formation using ITERATE: Building rule models for ecient reasoning," To appear in IEEE Expert. 1993. [25] C. Li, J.B. Weinberg, and G. Biswas. Technical Report CS-93-13, Vanderbilt University, Dept. of Computer Science. 1993. [26] K.B. McKusick and P. Langley. \Constraints on tree structure in concept formation," Proc. 12th Intl. Joint Conf. on Arti cial Intelligence, Sydney, Australia, pp. 810-816, August, 1991. [27] R. Michalski and R.E. Stepp. \Learning from observation: conceptual clustering," In: Machine Learning: An Arti cial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, eds., pp. 331-364, Tioga Press, Palo Alto, CA, 1983. [28] J.R. Quinlan. \Discovering rules by induction from large collections of examples," Expert systems in the micro electronic age, D. Michie (Ed.). Edinburgh University Press, 1979. [29] P.E. Utgo . Machine Learning of Inductive Bias. Kluwer Academic Publishers, 1986. [30] J.B. Weinberg and G. Biswas. \The Order Bias in Concept Formation," in review, Twelfth National Conference on Arti cial Intelligence, Seattle, Washington, July, 1994. [31] L. Xu. \Improving Robustness of the COBWEB Clustering System," M.S. Thesis, Department of Computer Science, Vanderbilt University, Nashville, TN, December, 1991.

29