Document not found! Please try again

a multimodal approach to term extraction using a ... - CiteSeerX

7 downloads 0 Views 233KB Size Report
color may have values fred; white; blueg, and shape may have values fball; brick; cubeg. This knowledge is essential if the algorithm is to specialize concepts by ...
A MULTIMODAL APPROACH TO TERM EXTRACTION USING A RHETORICAL STRUCTURE THEORY TAGGER AND FORMAL CONCEPT ANALYSIS Peter W. Eklund1 and Rudolf Wille2 1 2

Knowledge, Visualisation and Ordering Laboratory, Department of Computer Science, The University of Adelaide, Adelaide 5005 Australia Fachberiech Mathematik, Technische Hochscule Darmstadt, Schlossgartenstr. 7, Darmstadt, D-64289, Germany

This work reports a visual direct manipulation approach to information ltering with a knowledge extraction strategy based on Rhetorical Structure Theory and Formal Concept Analysis. The work is multimodal in two ways: (i) it uses a text tagger to identify key terms in free text, these terms are then used as indexation lters over the free text; (ii) it aims to normalise the content of a number of sources into a single knowledge base. The aim is automated extraction of semantic content in texts derived from di erent sources: merging them into a coherent knowledge base. We use Rhetorical Structure Theory (RST)(7) to automate the identi cation of discourse markers in a number of natural language texts dealing with a single subject matter. Marcu (8, 10) has shown that RST can be used for the automated mark up of natural language texts. Marcu introduced discourse trees, useful to store information about the rhetorical structure, and has shown how the identi cation of discourse markers from prototypical texts can be automated with 88% precision (9) | compared to those identi ed perfectly by human analysts. We adapt Marcu's algorithm in our research. Although our work draws on recent results from natural language processing, progress in that eld is not the objective. The work is motivated by the analysis of texts generated by di erent sources, their translation to a formal knowledge representation followed by a consolidation into a single knowledge corpus. Our interest is in the analysis of this corpus to determine the reliability of information obtained from the agencies (11) and then to visually \order" and navigate the knowledge. Our current work involves a technique called Formal Concept Analysis (FCA) (15, 16, 17, 18, 4) for browsing and retrieving text documents (12, 19, 3, 1). FCA is characterized as a propositional knowledge representation technique, i.e., it can only express monadic relations. Nowak (11) has shown that the FCA framework can be adapted to deal with multiple agent sources to determine consistencies or contradictions that result from the analysis of multiple agents. One of the complications facing lattice-based belief revision theories (such as Nowak's) is the algebraic treatment of n-ary relations, necessary in order to describe the real-world. Recently, Wille (6) has shown that FCA can be used to represent binary relations. This allows the bi-partite knowledge representation scheme | conceptual structures (CGs) (13, 14) | to be encoded and decoded as concept lattices. Wille shows that a set of formal contexts (called the Power Context Family (PCF)) can be used to represent a knowledge base of CGs (6). Furthermore, an algorithm to transform CG's into the PCF's has been implemented by Groh (5). The work described in this paper parses two natural language texts | reference materials on the same topic from two separate textbooks | and builds two discourse trees using Marcu's algorithm. These discourse trees are transformed into a PCF | this forms the consolidation of the two discourse trees. From this structure a CG knowledge base can be constructed. One of the most challenging problems is to know where the \head" or \nucleus" of the any single CG begins and ends. A \nucleus" from RST can become the query graph against the

power context family (PCF). This provides us, algorithmically, with the starting point for a constructive algorithm that progressively adds to the results of query graph: RST provides a solution to identifying CG heads. Using Groh's algorithm one can then use the discourse \head" as a query graph from which to generate a CG knowledge base.

Rhetorical Structure Theory (RST) Rhetorical Structure Theory (RST) is a discourse theory in natural language processing (NLP). It's aim is to identify parts of sentences as either \nucleus" or \satellite". Every \nucleus" can have a number of satellites in various categories: elaboration, explanation, justi cation, generalization etc3 . The assumptions that underly Marcu's algorithm for building discourse trees from unrestricted texts are; (i) elementary units (called discourse markers) do not overlap in complex texts; (ii) rhetorical relations hold between these units; (iii) relations are either paratactic or hypotactic4 ; (iv) the abstract structure of texts is binary | meaning that it consists recursively of \nucleus" and \satellite" pairings. To illustrate the ideas consider the following text from Marcu (9): Although it is possible to generalize from positive examples only,1 negative examples are important in preventing the algorithm from overgeneralizing.2 Not only must the learning concept be general enough to cover all positive examples; it also must be speci c enough to exclude all negative examples.3

\Although" marks a CONCESSIVE relation between satellite 1 and the nucleus either 2 or 3, and the colon, an ELABORATION between satellite 3 and nucleus either 1 or 2. The convention is that hypotactic relations are represented using rst-order predicates having the form rhet rel(NAME, satellite, nucleus) and that paratactic relations have the form rhet rel(NAME, nucleus1, nucleus2). A correct representation of the above is:



rhet rel(CONCESSION, 1, 2) _ rhet rel(CONCESSION, 1, 3) rhet rel(ELABORATION, 3, 1) _ rhet rel(ELABORATION, 3, 2)

Despite the ambiguity of the above the overall rhetorical constraints will associate only a single discourse tree. Marcu's algorithm automates the generation of such discourse trees.

Formal Concept Analysis Formal Concept Analysis is a theory of concept formation derived from lattice and order theory. In Engineering and AI terms it is an unsupervised learning technique. The best way to understand FCA is to consider a simple example. In FCA, we always build a \cross table" or context in which all the objects and their properties are enumerated. De nitive papers on the subject of FCA are found in (15). 3 4

There are 34 well-accepted rhetorical relations. Paratactic relations hold between text units of equal importance, i.e. several nuclei. Hypotactic relations are those that hold between between a nucleus (essential for the writer's purpose) and \satellite" which is non-essential.

Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune Pluto

SIZE DISTANCE MOON small medium large near far yes no x x x x

x x

x

x x

x x x x

x x x x x

x x

x x x x x x x

In this example from Davey and Priestley (2), objects are Planets in our solar system. Planets have properties or attributes that are xed in terms of their distance from the sun, their size and whether or not they have a moon(s). The table can be read by looking at the ith row object and whether or not it exhibits the property in the j th column. A concept is an ordered pair (A; B ) where A is a subset of the 9 planets of the solar system and B is a subset of the 7 listed properties or characteristics. This means that the formal concepts denoted (A; B ) consists of just those objects with a set of attributes in B . For example, take the object Earth and consider its attributes; B = fsmall; near; yesg. Now ask, what are the other planets that process all the attributes in B ? The answer is Mars. So one concept in this context is ffEarth; Marsg; fsmall; near; yesgg. It has the extension of the set of planets fEarth; Marsg and the intention of the set of properties fsmall; near; yesg. Note the other concepts that result from this context; ffMercury; V enusg; fsmall; near; nogg; ffJupiter; Saturng; flarge; far; yesgg; ffUranus; Neptuneg; fmedium; far; yesgg; ffPlutog; fsmall; far; yesgg. Thus, although there are 7 attributes and 9 objects, there are only 5 distinct concepts. It is also usual to regard a concept as more general than another if its extent is a subset of another concept's extent. This means we can de ne an order on concepts (A1 ; B2 )  (A2 ; B2 ) i A1 is a subset of A2 . The consequences are a partial ordering over concepts and this has the properties of a complete lattice. A context is triple (G; M; I ) where G and M are sets and I is subset or equal the Cartesian product of G and M , i.e. GM . Elements of G and M are objects and attributes. (g; m) 2 I or gIm says that an object g has an attribute m. The reason that G and M are so named is that they are derived from the German for object (Gegenstande) and M for attribute (Merkmale). For A  G and B  M de ne A = fm 2 M j 8g AgImg and B = fg 2 G j 8m B gImg. Hence, A is the set of attributes common to all objects in A and B the set of all objects processing attributes in B . A formal concept is a pair (A; B ) of the context (G; M; I ) s.t. A = B and B = A. The set all formal concepts of (G; M; I ) is denoted B(G; M; I ). Again, following the German traditions of the formalism, the B denotes the German \Begri ". For concepts (A1 ; B1 ) and (A2 ; B2 ) in B(G; M; I ) we write (A1 ; B1 )  (A2 ; B2 ) if A1  A2 . The important theoretical result is that this turns out equivalent to B2  B1 . This implies that this structure, namely, (B(G; M; I ); ) is a complete lattice, called the concept lattice of B(G; M; I ). Concepts are placed in a lattice structure in which the meet and join of any combination of elements are given by de nition. This concept lattice not only contains concepts corresponding to each object but also concepts corresponding to the meet and join of other concepts. The lattice can express all relationships between attributes. For example, the lattice can represent the relationship that if an object has one attribute then it must have another speci ed attribute. It is the capability to express such relationships that makes the lattice a powerful algebraic structure. 0

0

0

2

0

2

0

0

T

my

ss

dn

mn Me V

df

E Ma

P

sl

JS

sm UN

T

For example, if we refer to the context described in the table previous, we can look at the midpoint of the lattice and this fundamental theorem tells us that we can determine that this midpoint is (fEarth; Mars; Plutog; fsmall; yesg). The concept lattice provides a basic analysis of a context, it yields an appropriate classi cation of objects and at the same time indicates the implications between attributes. The concept lattice provides the focus for a visual metaphore to navigate a knowledghe base of terms extracted from texts and this is the basis of our work on information ltering and text navigation using FCA (12, 19, 3, 1).

An Algebraic Treatment of Conceptual Graphs In this section we show how FCA and CG's are related. Firstly, we de ne an abstract conceptual graph as a structure A := (V; E; ; C; ; ) for which; 1. V and E are nite sets and  is a mapping of E to [nk=1 V k (n  2) so that (V; E;  ) can be considered as a nite directed multi-hypergraph with vertices from V and edges from E (we de ne jej = k ,  (e) = (v1 ; : : : ; vk )); 2. C is always a nite set and  is a mapping of V [ E to C such that (e1 ) = (e2 ) always implies je1 j = je2 j (the elements of C may be understood as abstract concepts); 3.  is an equivalence relation on V . For abstract conceptual graphs to be related formal contexts in formal concept analysis, we introduce the power context family K := (K1 ;S: : : ; Kn ) (n  2) with Kk := (Gk ; Mk ; Ik ) (k = 1; : : : ; n) such that Gk  (G1 )k . Let CK := nk=1 B(Kk). Now A is an abstract conceptual graph over the power context K if C = CK ; (V )  B(Kk), and (e) 2 B(Kk) for all e 2 E with jej = k. A realization of such of such an abstract conceptual graph A in K is de ned to be a mapping  of V to the power set of G1 for which ; 6= (e)  Ext((v)) for v 2 V and (v1 )  : : :  (vk )  Ext((e)) for e 2 E with  (e) = (v1 ; : : : ; vk ) and v1 v2 always implies

(v1 ) = (v2 ). Then, the pair (A; ) is called a realized conceptual graph of K of a concept graph of K. In the extended version of the paper this section provides and example of a PCF and how CG's can be generated from them (6). An important question is how to obtain all concept graphs of a given power context family. Wille suggests transforming \known" concept graphs of K. The six (6) construction rules form the basis of Groh's implementation (5) and are reported in Wille (6). Wille notes that it is desirable to have a general construction yielding a few concept graphs of K from which interesting concept graphs of K can be easily derived by transformation rules. We hypothesis that a discourse analysis of the original source texts will provide the base-concept graphs in the knowledge base construction.

FCA/CG Algorithm Groh's implementation (5) of Wille's algorithm (6) creates the PCF for the query, exactly in the way as described above. Objects which t into a vertex of the query have to be collected from the PCF of the CG. In the PCF of the query the vertices are the objects and the relations, as in the PCF of the CG, are the attributes. The best way is a loop over all relations in the PCF of the query. This way one gets the vertices in this relation from the PCF of the query just by taking the extent of this relation. The possible objects one obtains by taking the extent of this relation in the PCF of the CG. The objects for a vertex satisfying the query are the objects collected from every relation where the speci c vertex is in. A good structure to store this information is again a formal context. The satisfying objects are just the extent of the bottom element of the lattice. A summary of the steps presented in this work is below. INPUT: two or more texts Ti 1. Determine the set D of all discourse markers and the set UT of elementary textual units in Ti . 2. Hypothesize a set of relations R between the elements of UT . 3. Use a constraint satisfaction procedure to determine all the discourse trees T of Ti . 4. Assign a weight to each of the discourse trees and determine the tree(s) with maximal weight. 5. Convert all discourse trees T into a single formal context family K. 6. Identify all nuclei N from the discourse trees T . 7. For each nucleus in N construct a concept graph (A; ). 8. Maximally join to each concept graph (A; ) all compatible rhetorical relations from the formal context family K. i

i

Conclusion The full paper presents the results of the 8 steps above to illustrate how discourse markers are tagged using Marcu's algorithm using text from two arti cial intelligence books that deal with the version spaces algorithm (see Appendix). These inputs are normalised into a power context family (PCF), collapsing the contents of both sources into a single algebraic structure. This structure is then used to create a single conceptual graph knowledge base using a constructive algorithm that relies on identifying the nucleui of each discourse tree using Marcu algorithm.

References C. Carpineto and G. Romano. A lattice conceptual clustering system and its application to browsing retrieval. Machine Learning, 24:95{122, 1996. B.A. Davey and H.A. Priestly. Introduction to Lattices and Order Theory. Cambridge University Press, 1990. A. Fall and G. Mineau, editors. Dealing with Large Contexts in Formal Context Analysis: A Case Study Using Medical Texts, Lecture Notes on Computer Science, Vancover, 1997. Springer Verlag. B. Rich G. Ellis, B. Levinson and J.F. Sowa, editors. A Triadic Approach to Formal Concept Analysis, number 954 in Lecture Notes on Computer Science, Berlin, 1995. Springer Verlag. Bernd Groh. An application of fca and conceptual graphs. Technical report, Fachberiech Mathematik, 1997. M. Keeler H. Dulugach and D. Lukose, editors. Conceptual Graphs and Formal Concept Analysis, number 1XXX in Lecture Notes on Computer Science, Berlin, 1997. Springer Verlag. S.A Thompson Mann, W.C. Rhetorical structure theory: Toward a functional theory of text organisation. Text, 8(3):243{281, 1988. Daniel Marcu. Building up Rhetorical Structure Trees. In Proceeedings of the Thirteenth National Conference on Arti cial Intelligence AAAI, pages 1069{1074, 1996. Daniel Marcu. The rhetorical parsing of natural language texts. In Proceeedings of the Fifteenth International Joint Conference on Arti cial Intelligence IJCAI, pages |in press|, 1997. Daniel Marcu. The Rhetorical Parsing, Summarization and Generation of Natural Language Texts. PhD thesis, Department of Computer Science, 1997. Chris Nowak. Conceptual Belief Revision. PhD thesis, Department of Computer Science, The University of Adelaide, 1997. Unrevised version will be available at url http://www.cs.adelaide.edu.au/nowak/PHD.ps. G. Mann P. Eklund, G. Ellis, editor. Application of Formal Concept Analysis to Information Retrieval using a Hierarchically Structured Thesaurus, Lecture Notes on Computer Science, Sydney, 1996. The University of New South Wales. J Sowa. Conceptual Structures : Information Processing in Mind and Machine. Addison-Wesley, 1983. J. Sowa. Conceptual graphs summary. In L. Gerholz T. Nagle, J. Nagle and P. Eklund, editors, Conceptual structures: Current theory and practice, chapter 1, pages 3{52. Ellis Horwood, 1992. R. Wille. Restructuring lattice theory: an approach based on hierarchies of concepts. In I. Rival, editor, Ordered Sets, pages 445{470. D. Reidel, Dordrecht, 1982. R. Wille. Line diagrams of hierarchical concept systems. International Classi cation, 11:77{86, 1984. Rudolf Wille. Tensorial decomposition of concept lattices. Order, 2:81{95, 1985. Rudolf Wille. Concept lattices and conceptual knowledge systems. In Fritz Lehmann, editor, Semantic Networks in Arti cial Intelligence. Pergamon Press, Oxford, 1992. Also appeared in Comp. & Math. with Applications, 23(2-9), 1992, p. 493-515. Justin Zobel, editor. Text Retrival for Medical Discharge Summaries using SNOMED and Formal Concept Analysis, Lecture Notes on Computer Science, Sydney, 1996. The University of New South Wales.

Appendix The source text one for the experiment as as follows:

ALGORITHM: CANDIDATE ELIMINATION

Given: A representation language and a set of positive and negative examples expressed in that language. Compute: A concept description that is consistent with all the positive examples and none of the negative examples. 1. Initialize G to contain one element: the null description (all features are variables). 2. Initialize S to contain one element: the rst positive example.

3. Accept a new training example. If it is a positive example, rst remove from G any descriptions that do not cover the example. Then, update the S set to contain the most speci c set of descriptions in the version space that cover the example and the current elements of the S set. That is, generalize the elements of S as little as possible so that they cover the new training example. If it is a negative example, rst remove from S any descriptions that cover the example. Then, update the G set to contain the most general set of descriptions in the version space that do not cover the example. That is, specialize the elements of G as little as possible so that the negative example is no longer covered by any of the elements of G. 4. If S and G are both singleton sets, then if they are identical, output their value and halt. If they are both singleton sets but they are di erent, then the training cases were inconsistent. Output this result and halt. Otherwise, go to step 3. Let us trace the operation of the candidate elimination algorithm. Suppose we want to learn the concept of \Japanese economy car" from the examples in Figure 17.12. G and S both start out as singleton sets. G contains the null description (see Figure 17.11), and S contains the rst positive training example. The version space now contains all descriptions that are consistent with this rst example. G = {(x1, x2, x3, x4, x5)} S = {(Japan, Honda, Blue, 1980, Economy)}

Now we are ready to process the second example. The G set must be specialized in such a way that the negative example is no longer in the version space. In our representation language, specialization involves replacing variables with constants. (Note: The G set must be specialized only to descriptions that are within the current version space, not outside of it.) Hereare the available specializations: G = {(x1, Honda, x3, x4, x5), (x1, x2, Blue, x4, x5), (x1, x2, x3, 1980, x5), (x1, x2, x3, x4, Eonomy)}

The S set is una ected by the negative example. Now we come to the third example, a positive one. The rst order of business is to remove from the G set any descriptions that are inconsistent with the positive example. Our new G set is: G = {(x1, x2, Blue, x4, x5),(x1, x2, x3, x4, Economy)}

We must now generalize the S set to include the new example. This involves replacing constants with variables. Here is the new S set: S = {(Japan, x2, Blue, x4, Economy)}

At this point, the S and G sets specify a version space (a space of candidate descriptions) that can be translated roughly into English as: \The target concept may be as speci c as 'Japanese, blue economy car,' or as general as either 'blue car' or 'economy car."' Next, we get another negative example, a car whose origin is USA. The S set is una ected, but the G set must be specialized to avoid covering the new example. The new G set is: G = {(Japan, x2, Blue, x4, x5),(Japan, x2, x3, x4, Economy)}

We now know that the car must be Japanese, because all of the descriptions in the version space contain Japan as origin. Our nal example is a positive one. We rst remove from the G set any descriptions that are inconsistent with it, leaving: G = {(Japan, x2, x3, x4, Economy)}

We then generalize the S set to include the new example: S = {(Japan, x2, x3, x4, Economy)}

S and G are both singletons, so the algorithm has converged on the target concept. No more examples are needed. There are several things to note about the candidate elimination algorithm. First, it is a leastcommitment algorithm. The version space is pruned as little as possible at each step. Thus, even if all the positive training examples are Japanese cars, the algorithm will not reject the possibility that the target concept may include cars of other origin | until it receives a negative example that forces the rejection. This means that if the training data are sparse, the S and G sets may never converge to a single description; the system may learn only partially speci ed concepts. Second, the algorithm involves exhaustive, breadth- rst search through the version space. We can see this in the algorithm for updating the G set. Contrast this with the depth- rst behaviour of Winston's learning program. Third, in our simple representation language, the S set always contains exactly one element, because any two positive examples always have exactly one generalization. Other representation languages may not share this property. The version space approach can be applied to a wide variety of learning tasks and representation languages. The algorithm above can be extended to handle continously valued features and hierarchical knowledge. However, version spaces have several de ciencies. One is the large space requirements of the exhaustive, breadth- rst search mentioned above. Another is that inconsistent data, also called noise, can cause the candidate elimination algorithm to prune the target concept from the version space prematurely. In the car example above, if the third training instance had been mislabeled (?) instead of (+), the target concept of \Japanese economy car" would never be reached. Also, given enough erroneous negative examples, the G set can be specialized so far that the version space becomes empty. In that case, the algorithm concludes that no concept ts the training examples. One solution to this problem [Mitchell, 1978] is to maintain several G and S sets. One G set is consistent with all the training instances, another is consistent with all but one, another with all but two, etc. (and the same for the S set). When an inconsistency arises, the algorithm switches to G and S sets that are consistent with most, but not all, of the training examples. Maintaining multiple version spaces can be costly, however, and the S and G sets are typically very large. If we assume bounded inconsistency, i.e., that instances close to the target concept boundary are the most likely to be misclassi ed, then more ecient solutions are possible. Hirsh [1990] presents an algorithm that runs as follows. For each instance, we form a version space consistent with that runs as follows. For each instance, we form a version space consistent with that instance plus other nearby instances (for some suitable de nition of nearby). This version space is then intersected with the one created for all previous instances. We keep accepting instances until the version space is reduced to a small set of condidate concept descriptions. (Because of inconsistency, it is unlikely that the version space will converge to a singleton.) We then match each of the concept descriptions against the entire data set, and choose the one that classi es the instances most accurately. Another problem with the candidate elimination algorithm is the learning of disjunctive concepts. Suppose we wanted to learn the concept of \European car," which, in our representation, means either a German, British, or Italian car. Given positive examples of each, the candidate elimination algorithm will generalize to cars of any origin. Given such a generalization, a negative instance (say, a Japanese car) will only cause an inconsistency of the type mentioned above. Of course, we could simply extend the representation language to include disjunctions. Thus, the concept space would hold descriptions such as \Blue car of German or British origin" and \Italian sports car or German luxury car." This approach has two drawbacks. First, the concept space becomes much larger and specialization becomes intractable. Second, generalization can easily degenerate to the point where the S set contains simply one large disjunction of all positive instances. We must somehow force generalization while allowing for the introduction of disjunctive descriptions. Mitchell [1978] gives an iterative approach that involves several passes through the training data. On each pass, the algorithm builds a concept that covers the largest number of positive training instances without covering any negative training instances. At the end of the pass, the positive training instances covered by the new concept are removed from the training set, and the new concept then becomes one disjunct in the evertual disjunctive concept description. When all positive training instances

have been removed, we are left with a disjunctive concept that covers all of them without covering any negative instances. There are a number of other complexities, including the way in which features interact with one another. For example, if the origin of a car is Japan, then the manufacturer cannot be Chrysler. The version space algorithm as described above makes no use of such information. Also in our example, it would be more natural to replace the decade slot with a continuously valued year eld. We would have to change our procedures for updating the S and G sets to accound for this kind of numerical data.

The second source text for the experiment as as follows:

THE CANDIDATE ELIMINATION ALGORITHM

This section presents three algorithms (Mitchell 1982) for searching the concept space. These algorithms rely upon the notion of a version space, which is the set of all concept descriptions consistent with the training examples. These algorithms work by reducing the size of the version space as more examples become available. The rst two algorithms reduce the version space in a speci c to general direction and a general to speci c direction, respectively. The third algorithm, called candidate elimination, combines these approaches into a bi-directional search. In this section we describe and evaluate these algorithms. These algorithms are data driven; they generalize based on regularities found in the training data. Also, in using training data of known classi cation, these algorithms perform a variety of supervised learning. As with Winston's program for learning structural descriptions, version space search uses both positive and negative examples of the target concept. Although it is possible to generalize from positive examples only, negative examples are important in preventing the algorithm from overgeneralizing. Not only must the learning concept be general enough to cover all positive examples; it also must be speci c enough to exclude all negative examples. In the space of Figure 12.5, one concept that would cover all sets of exclusively positive instances would simply be obj (X; Y; X ). However, this concept is probably too general, because it implies that all instances belong to the target concept. One way to avoid overgeneralization is to generalize as little as possible to cover positive examples; another is to use negative instances to eliminate overly general concepts. Speci c to general search maintains a set, S , of hypotheses, or candidate concept de nitions. To avoid overgeneralization, these candidate de nitions are the maximally speci c generalizations from the training data. A concept, c, is maximally speci c if it covers all positive examples, none of the negative examples, and for any other concept, c , that covers the positive examples, c  c . We de ne speci c to general search as: 0

0

Begin Initialize S to the first positive training instance; N is the set of all negative instances seen so far; For each positive instance p Begin For every s in S, if s does not match p, replace s with its most specific generalizations that match p; Delete from S all hypotheses more general than some other hypothesis in S; Delete from S all hypotheses that match a previously observed negative instance in N; End; For every negative instance n Begin Delete all members of S that match n; Add n to N to check future hypotheses for overgeneralization; End; End

We may also search in a general to speci c direction. This algorithm maintains a set, G, of maximally general concepts that cover all of the positive and none of the negative instances.

A concept, c, is maximally general if it covers none of the negative training instances, and for any other concept, c , that covers no negative training instance, c  c . In this algorithm, negative instances lead to the specialization of candidate concepts; the algorithm uses positive instances to eliminate overly specialized concepts. 0

0

Begin Initialize G to contain the most general concept in the space; P contains all positive examples seen so far; For each negative instance n Begin For each g in G that matches n, replace g with its most general specializations that do not match n; Delete from G all hypotheses more specific than some other hypothesis in G; Delete from G all hypotheses that fail to match some positive example in P; End For each positive instance p Begin Delete from G all hypotheses that fail to match p; Add p to P; End; End

In this example, the algorithm uses background knowledge that size may have values flarge; smallg, color may have values fred; white; blueg, and shape may have values fball; brick; cubeg. This knowledge is essential if the algorithm is to specialize concepts by substituting constants for variables. The candidate elimination algorithm combines these approaches into a bi-directional search of the version space. As we shall see, this bi-directional approach has a number of bene ts for the learner. The algorithm maintains two sets of candidate concepts: G is the set of maximally general candidate concepts, and S is the set of maximally speci c candidates. The algorithm spacializes G and generalizes S until they converge on the target concept. The algorithm is de ned. Begin Initialize G to be the most general concept in the space; Initialize S to the first positive training instance; For each new positive instance p Begin Delete all members of G that fail to match p; For every s in S, if s does not match p, replace s with its most specific generalizations that match p; Delete from S any hypothesis more general than some other hypothesis in S; Delete from S any hypothesis not more specific than some hypothesis in G; End; For each new negative instance n Begin Delete all members of S that match n; For each g in G that matches n, replace g with its most general specializations that do not match n; Delete from G any hypothesis more specific than some other hypothesis in G; Delete from G any hypothesis more specific than some hypothesis in S; End;

If G = S and both are singletons, then the algorithm has found a single concept that is consistent with all the data and the algorithm halts; If G and S become empty, then there is no concept that covers all positive instances and none of the negative instances; End

Combining the two directions of search into a single algorithm has several bene ts. The G and S sets summarize the information in the negative and positive training instances respectively, eliminating the need to save these instances. For example, after generalizing S to cover a positive instance, the algorithm uses G to eliminate concepts in S that do not cover any negative instances. Because G is the set of maximally general concepts that do not match any negative training instances, any member of S that is more general than any member of G must match some negative instance. Similarly, because S is the set of maximally speci c generalizations that cover all positive instances, any new member of G that is more speci c than a member of S must fail to cover some positive instance and may also be eliminated. An interesting aspect of candidate elimination is its incremental nature. An incremental learning algorithm accepts training instances one at a time, forming a usable, although possibly incomplete, generalization after each example. this contrasts with batch algorithms that require all training examples to be present before they may begin learning. Even before the candidate elimination algorithm converges on a single concept, the G and S sets provide usable constraints on that concept: if c is the goal concept, then for all g 2 G and s 2 S , s  c  g. Any concept that is more general than some concept in G will cover negative instances; any concept that is more speci c than some concept in S will fail to cover some positive instances. This suggests that instances that have a \good t" with the concepts bounded by G and S are at least plausible instances of the concept.