revision, case based reasoning and inductive logic programming. ..... [13] De Raedt, L. (ed.): Advances in Inductive Logic Programming. IOS. Press., 1996.
Adaptive Features of Machine Learning Methods Petr Berka
Abstract This paper gives a survey of (symbolic) machine learning methods, that exhibit significant features of adaptivity. The paper discussed incremental learning, learning in dynamically changing domains, knowledge integration, theory revision, case based reasoning and inductive logic programming. Index Terms adaptivity, machine learning
I. INTRODUCTION Let us start this survey of adaptive features of machine learning methods with the working definition of the smart adaptive systems as formulated within the IST-2000-29207 project EUNITE. A smart adaptive system: 1. can adapt to a changing environment, 2. can adapt to similar settings without explicitly being ported to them, 3. can „grow“ into an application (I understand this feature as the ability to improve the behavior of a system over time). Can machine learning methods fulfill some (or all) of these requirements? When looking for definition of Machine Learning (ML) in textbooks on machine learning or data mining, we can find statements like: „The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.“ [33] or „Things learn when they change their behavior in a way that makes them perform better in a future.“ [57] When comparing these statements with the working definition of smart adaptive systems given above, we can conclude with a claim, that machine learning inherently exhibits adaptive features. The more complicated answer to the question “are machine learning methods adaptive?“, we try to formulate in this paper is to look for approaches, that make symbolic ML methods even more smarter and more adaptive than they are in general. P. Berka is with the Laboratory for Intelligent Systems, University of Economics, Prague, and the Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague (e-mail: berka @vse.cz).
The problems related to the above definition of smart adaptive systems are studied under various names in different subfields of machine learning. • Incremental learning • Learning in dynamically changing domains • Knowledge integration, meta-learning and multistrategy learning • Knowledge revision and knowledge refinement • Case-Based Reasoning (CBR) and lazy learning • Inductive Logic Programming (ILP) are the keywords of various aspects of our definition of adaptivity. Let us stress here, that these keywords are not on the same level of abstraction. Incremental learning and learning in dynamically changing domains are specific problems within the field of machine learning. Knowledge integration, knowledge revision and knowledge refinement are important issues for broader class of artificial intelligence (AI) methods dealing with knowledge representation and reasoning. CBR and ILP are well established techniques of AI or ML. The aim of this paper is to discuss these problems, notions and methods from an unified viewpoint; from the viewpoint of adaptive features with respect to machine learning paradigm. II. INCREMENTAL LEARNING Incremental learning (as opposite to batch learning) is the ability of a learning system to update the knowledge (model) built so far when new training examples are observed. The learner successively takes one example and the current hypothesis as the input and produces a new hypothesis about the target concept as the output. This basic scenario is referred to as iterative learning. Iterative inference can be refined using feedback [28]. An incremental algorithm can improve it‘s behavior over time. Most symbolic ML algorithms are working in batch mode, nevertheless, there exist incremental versions of some algorithms as well. Let us mention here some early work on incremental algorithms for top-down induction of decision trees ID4 ([41], ID5 [48]) or learning decision rules (AQ16 [23], YAILS [43]). Recent developments of incremental learning in the framework of decision trees can be found in [49].
III. LEARNING IN DYNAMICALLY CHANGING DOMAINS While in incremental learning, all examples observed in the sequel are used for learning, for dynamically changing domains, only the newest examples are relevant. So the key issue is learning and forgetting. One of the first systems designed to deal with concept drift is STAGGER [41]. STAGGER learns by updating statistical (Bayesian) measures of logical sufficiency and necessity of a set of description items in a distributed concept representation. The effect is a continual strengthening oe weakening of partial hypotheses in reaction to new evidence. One of the features of the system is that it is sensitive to the amount of previous training; the longer the system has been trained to some concept, the more deeply ingrained will the concept be, and the more hesitant will the system be to abandon it and adjust to a new concept. The family of FLORA systems ([52], [56]) uses a window to keep track of recent training examples during incremental on-line learning. The change of concept is recognized, when new examples contradict the hypotheses learned so far. In such a case, the oldest (irrelevant) examples are forgotten. FLORA uses a heuristic algorithm that automatically and dynamically adjusts the size of the window. A modification of this approach for continuous domains is described in [26]. To a limited extend, the idea of tracking concept drift in the incremental learning paradigm has been studied in the context of unsupervised learning, especially in concept formation. In the systems COBWEB (an incremental, probabilistic conceptual clustering system described in [17]) and CLASSIT [19], the dynamic modification of the concept description is facilitated by two complementary operators merge and split. These perform generalization of two existing concepts into one (merge) and specialization of a concept into two (split). A control strategy for COBWEB, that uses window do deal with concept drift is described in [24]. Another extension of COBWEB is ECOBWEB. ECOBWEB is a concept formation program for the creation of hierarchical classification trees. The system employs multistrategy learning; it is a concept formation program that includes case-based reasoning capabilities and can also trace a change (i.e., concept drift) in a domain [38]. The AICC (attribute-incremental concept creator) system is based on COBWEB as well [14] A different viewpoint on this problem is based on so called context. A context can be understood as [31]: • the situation in which the examples are observed during learning, • relevant attributes at present unknown (hidden to the learner), • attributes that alone do not contribute to classification performance, but that improve classification in combination with other attributes. One of the first researchers who recognized the effect of context was Michalski ([32]). He suggested a specialized two-tiered representation to represent different aspects of context-dependent concepts (see also [6]).
In batch learning, the basic idea is that the knowledge learned in one context should be applied in different context (e.g. [44], [45], [21]) The system SPLICE uses batch learner and so called contextual clustering to identify stable hidden contexts and the associated context specific, locally stable concepts [22]. In incremental learning, the identification of concept drift is equivalent to the identification of changes in the context. The current context may be described in a hidden indicator variable. A meta-learning strategy for identification of a context indicator variable is implemented in the MetaL system ([53], [54]). IV. KNOWLEDGE INTEGRATION The knowledge obtained during ML can be integrated either on the representational level or on the operational level. Knowledge integration on the representational level can be understood as a combination of knowledge obtained from different sets of training examples, or as a combination of domain knowledge and knowledge learned from data. So, integration on representational level is closely related to incremental learning (first situation – discussed earlier) or to knowledge revision (second situation – discussed later). Knowledge integration on representational level is also a research topic in knowledge engineering. Here, the problem is how to integrate pieces of knowledge obtained from different experts. This type of problems inspired some work on integrating machine learning and knowledge acquisition ([18], [20]). In both cases, the idea is to use machine learning methods for building knowledge-based systems as an additional tool to knowledge elicitation from human expert. Knowledge integration on the operational level can be understood as model combination. The goal is not a single representation but a single decision. Knowledge integration in this second meaning is related to the notions of Multistrategy learning or Meta-Learning. The term multistrategy suggests, that more learning methods are used and their results are combined. The combination can be based on some expert knowledge, or on a learning on the meta-level (metalearning). Meta-Learning can be defined as learning from information generated by (so called) base learners. Nice description of these notions can be found e.g. in [12]. There are three basic approaches to combining multiple models into a single decision: bagging, boosting and stacking. For comparison of different methods of combining models see e.g. [15] or [5]. Bagging and boosting both adopt the approach of voting of the different models but they differ in the way how the individual model are derived. In bagging, the models are created from different randomly created subsets of the data separately. Each model then contributes to the final decision with equal weight. In boosting, the models are created in a sequel. Each new model is influenced by the models created previously, it is focused towards examples incorrectly handled so far. The models are combined using weights that reflect model‘s performance. A well known algorithm for boosting si AdaBoost [42]. Stacking (stacked generalization) is a meta-learning method that uses
predictions of base learners as inputs for meta-learner. The goal is to build a model that predicts true class from these predictions. Knowledge integration is a research topic also in knowledge engineering. Here, the problem is how to integrate Some work on integrating machine learning and knowledge acquisition is reported in [18] and [20]. In both cases, the idea is to use machine learning methods for building knowledgebased systems as an additional tool to knowledge elicitation from human expert. [39] describe a multistrategy system, that learns diagnostic rules using induction (from examples), deduction (from domain theory) and abduction (from causal model). Meta-learning can be used not only for combination of results of base learners, but also for method selection. The idea is to learn which method (classifier) is suitable for a given problem. This question is studied e.g. within the Esprit project MetaL. The goal of this project is to create a MetaLearning assistant for providing user support in Data Mining and Machine Learning. Some results have been already published ([7], [11]). Machine Learning research has shown that there does not exist a single algorithm which would be best suited for all tasks. Thus, for some given data mining task, the selection of a promising algorithm (or a promising combination of more algorithms) is a crucial issue. This issue strongly correlates with the requirement no. 2 of smart adaptive systems; to apply successful solution (in this case method) in similar settings. V. KNOWLEDGE REVISION AND KNOWLEDGE REFINEMENT The general setting for knowledge revision is as follows: given an incomplete knowledge and a set of examples, modify the knowledge to be consistent with the examples. A well defined subfield of ML called Inductive Logic Programming (ILP) performs this task in the framework of first-order predicate logic. In ILP, the initial (and resulting) knowledge is in the form of Horn clauses. ILP systems develop predicate descriptions from examples and background knowledge. The examples (E), background knowledge (B) and final descriptions (H) are all described as logic programs. Given background knowledge B and a set of positive examples E+ and negative examples E- E = E+ ∪ E-, the goal is to find final description H such, that: 1. all positive examples can be inferred from H and B 2. no negative example can be inferred from H and B 3. H is consistent with B. A pioneering research in this area was done by S. Muggleton [34]. By now, there is a number of books (e.g. [29], [13]), papers, conferences etc. on this topic. The advantages of ILP (in comparison to standard learning based on attribute-value representation of examples) are: • ability to process multi-relational data, • more compact (more understandable) knowledge H, • ability to incorporate domain knowledge. There exist ILP modifications of most „basic“ ML approaches (decision trees, decision rules, association rules,
nearest-neighbor). In lesser extend, non-ILP approaches to knowledge revision are studied in the ML community as well. Some algorithms for knowledge revision of propositional rules ([36], [4]) or certainty-factor rules ([30]) have been developed at the University of Texas. Incremental refinement of rules is described in [16]. Knowledge revision and knowledge integration have some common features. In both cases, the question is how to update (modify) existing knowledge when it does not perform well. In the first case, the knowledge is updated on the basis of new examples (so incremental learning performs knowledge revision), in the second case the knowledge is updated on the basis of new knowledge that uses the same representation formalism. The difference is that knowledge revision (in the above sense) can handle concept drifts, whereas knowledge integration aims to retain both old and new knowledge and make them consistent. VI. ANALOGY AND ADAPTATION To use a successful past solution and adapt it for a new, similar problem (as required in the point 2 of our definition) is the key idea of reasoning by analogy. This principle is known in the context of artificial intelligence as Case-Based Reasoning (CBR). We can find two classes of systems in CBR. One class are systems that use feature vectors to represent cases. These systems are close to k-nearest neighbor algorithms and propositional ML. A nice taxonomy of CBR in the context of machine learning (called here instance based learning) is given in [2]. The other class are systems that use complex (or structured) representations of cases. These systems are closely related to classical AI reasoning and problem solving systems (e.g. for planning, scheduling and design). [37]. A Case-Based Reasoning system works in a cyclic process consisting of four steps ([1]): 1. retrieve (the most similar cases), 2. reuse (the cases to attempt to solve the problem) 3. revise (the proposed solution if necessary) 4. retain (the new solution as a part of a new case). A key issue in this process is adaptation. „Once a matching case is retrieved, a CBR system should adapt the solution stored in the retrieved case to the needs of the current case.“ ([51]). VII. CONCLUSIONS The notion of smart adaptive methods is closely related to soft computing. Within this area, neural networks and genetic algorithms represent the machine learning paradigm in broader sense. Symbolic machine learning methods are usually considered not to be very adaptive. Nevertheless, as shown in this paper, there exist a number of examples illustrating the fact, that symbolic machine learning methods can handle various aspects of adaptivity as well.
REFERENCES [1]
[2] [3]
[4]
[5]
[6]
[7]
[8] [9] [10] [11]
[12]
[13] [14] [15]
[16] [17] [18]
[19] [20]
[21] [22] [23]
[24]
[25]
[26]
[27]
Aamodt, A., and Plaza, E.: Case-based reasoning: Foundational issues, methodological variations, and system approaches. Artificial Intelligence Communications, 7, 39-59. Aha,D, Kibler,D. and Albert,M.K.: Instance-based learning algorithms. Machine Learning, 6, 1991, 37-66. Aha,D., and Wetschereck,D.: Workshop on Case-Based Learning. Beyond Classification of Feature Vectors, 9th European Conf. on Machine Learning ECML’97, 1997. Baffes,P.T., and Mooney,R,J.: Symbolic Revision of Theories with Mof-N Rules. In: Proc. Int. Joint Conf. on Artificial Intelligence IJCAI’93, 1993. Bauer,E., and Kohavi,R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1/2):105-139, 1999. Bergadano,F., Matwin,S., Michalski,R.S., and Zhang,J.: Learning Twotiered Descriptions of Flexible Contexts: The POSEIDON System. Machine Learning, 8(1), 1992, 5-43. Bensusan,H., Giraud-Carrier,C. and Kennedy,C.: A higher-order Approach to Meta-learning. In: Proc. ECML' 2000 workshop on Meta Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, ECML' 2000, 2000. Berka,P.: Towards Knolwedge Integration via Knowledge Revision. In: [35]. Berka,P.: Recognizing Reliability of Discovered Knowledge. In: Proc. Principles of Data Mining and Knowledge Discovery PKDD' 97, 1997. Brazdil,P, and Torgo,L.: Knowledge Integration and Learning. LIACC Tech. Report 91-1, 1991. Brazdil,P., and Soares,C.: A Comparison of Ranking Methods for Classifica-tion Algorithm Selection. In: Proc. 11th European Conf. on Machine Learning ECML2000, 2000. Chan,P.K., and Stolfo,S.J.: Experiments on Multistrategy Learning by Meta-Learning. In: Proc. 2nd Int. Conf. on Information and Knowledge Management, 1993. De Raedt, L. (ed.): Advances in Inductive Logic Programming. IOS Press., 1996. Devaney, Ram: Dynamically Adjusting Concepts to Accomodate Changing Contexts. In: [27]. Dietterich,T.G.: An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning, 40(2):139-158, 2000. Fensel,D., and Wiese,M.: Refinement of Rule Set with JoJo. In: Proc. European Conf. on Machine Learning ECML’93, 1993, 378-383. Fisher,D.: Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2, 1987, 139-172. Ganascia,J.G., Thomas,J., and Laubet,P.: Integrating Models of Knowledge and Machine Learning. In: Proc. European Conf. on Machine Learning ECML’93, 1993, 398-401. Gennari,J., Langley,P., and Fisher,D.: Models of Incremental Concept Formation. Artificial Intelligence 40, 1989, 11-16. Graner,N., and Sleeman,D.: A Multistrategy Knowledge and Acquisition Toolbox. In: Proc: 2nd Int. Wshop on Multistrategy Learning, 1993, 107-119. Harries,M., and Horn,K.: Learning Stable Concepts in Domains with Hidden Changes in Context. In: [27]. Harries,M., Sammut,C., and Horn,K.: Extracting Hidden Context. Machine Learning, 32, 1998, 101-126. Janikow,C.Z.: The AQ16 Inuctive Learning Program: Some Experimental Results with AQ16 and Other Symbolic and Nonsymbolic Programs. Rep. of AI Center, George Mason University, 1989. Killander,F., and Jansson,C.G.: COBBIT – A Control Procedure for COBWEB in the Presence of Concept Drift. In: Proc. European Conf. on Machine Learning ECML’93, 1993, 244-261. Klenner,M., and Hahn,U.: Concept Versioning: A Methodology for Tracking Evolutionary Concept Drift in Dynamic Concept Systems. In: Proc. 11th European Conference on Artificial Intelligence, ECAI-94, 1994, 473-477. Kubat M., and Widmer G.: Adapting to Drift in Continuous Domains. In: Proc. European Conf. on Machine Learning ECML-95, 1995, 307310. Kubat,M., and Widmer,G.: Workshop on Learning in Context-Sensitive Domains, Int. Conf. on Machine Learning ICML' 96, 1996.
[28] Lange,S., and Zeugmann,T.: Incremental Learning from Positive Data. Journal of Computer and System Science 53, No. 1, 1996, 88-103. [29] Lavrac, N. and Dzeroski, S.: Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994. [30] Mahoney,J.J., and Mooney,R.J.: Comparing Methods for Refining Ceretainty-Factor Rule-Bases. In: Proc. 11th Int. Conf. on Machine Learning ML-94, 1994. [31] Matwin,S., and Kubat,M.: The Role of Context in Concept Learning. In: [27] [32] Michalski,R.S.: How to Learn Imprecise Concepts: A Method Employing a Two-tiered Knowledge Representation for Learning. In: Proc. 4th Int. Workshop on Machine Learning, 1987, 50-58. [33] Mitchell,T.: Machine Learning. McGraw-Hill. 1997. ISBN 0-07042807-7 [34] Muggleton, S. (ed.): Inductive Logic Programming. Academic Press, 1992. [35] Nakhaeizadeh,G., Bruha,I., and Taylor,C.: (eds.) Proc. of the Preconference Workshop on Dynamically Changing Domains: Theory Revision and Context Dependence Issues, European Conf. on Machine Learning ECML’97, 1997. [36] Ourston,D., and Mooney,R.: Theory Refinement Combining Analytical and Empirical Methods. Artificial Intelligence 66, 1994, 311-344. [37] Plaza,E.: Cases as Episodic Models in Problem Solving. In: [3]. [38] Reich,Y. and Fenves,S.J.: Inductive Learning of Synthesis Knowledge, International Journal of Expert Systems: Research and Applications, 5(4), 1992, 275-297. [39] Saitta,L., Botta,M., and Neri,F.: Multistrategy Learning and Theory Revision. Machine Learning, 11, 1993, 153-172. [40] Schlimmer,J. and Fischer,D.: A Case Study of Incremental COncept Induction. In: Proc. 5th Int. Joint Conf. on AI, IJCAI’86, 1986. [41] Schlimmer,J.C. and Granger,R.H.: Incremental Learning from Noisy Data. Machine Learning 1, 1986, 317-354. [42] Shapire,R.: Theoretical Views of Boosting and Applications. In: Proc.10th Int. Conf. on Algorithmic Learning Theory. [43] Torgo,L.: YAILS: an Incremental Learning Program. LIACC Tech. Report, 92.1, 1992. [44] Turney,P.D.:Exploiting Context when Learning to Classify. In: Proc 6th European Conf. on Machine Learning ECML’93, 1993, 402-407. [45] Turney,P.D.: The Identification of Context-sensitive Features: a Formal Definition of Context for Concept Learning. In: [27]. [46] Turney,P.D.: The Management of Context-sensitive Features: a Review of Strategies. In: [27]. [47] Turney,P., and Halasz,M.: Contextual Normalization Applied to Aircraft Gas Turbine Engine Diagnosis. Journal of Applied Intelligence, 3, 1993, 109-129. [48] Utgoff,P.E.: ID5: An Incremental ID3. In: Proc 5th Int. Wshop on Machine Learning, 1988. [49] Utgoff, P. E., Berkman, N. C., & Clouse, J. A. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning, 29, 5-44. [50] Watrous,R.L., and Towell,G.: A Patient-adaptive Neural Network ECG Patient Monitoring Algorithm. In: Proc. Computers in Cardiology. 1995. [51] Watson, I., and Marir, F.: Case-Based Reasoning: A Review. Knowledge Engineering Review, 9, 1994, 327-354. [52] Widmer,G.: Combining Robustness and Flexibility in Learning Drifting Concepts. In: Proc. 11th European Conference on Artificial Intelligence ECAI94, 1994, 468-472. [53] Widmer,G.: Recognition and Exploitation of Contextual Clues via Incremental Meta-learning. In: Proc. 13th Int. Conf. on Machine Learning ICML’96, 1996, 525-533. [54] Widmer G.: Tracking Context Changes through Meta-Learning. Machine Learning 27(3), 1997, 259-286. [55] Widmer G.: Learning in Dynamically Changing Domains: Recent Contributions of Machine Learning. In [35]. [56] Widmer,G., and Kubat,M.: Learning in the Presence of Concept Drift and Hidden Contexts, Machine Learning, 23(1), 1996, 69-101. [57] Witten,I.H. – Frank,E.: Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufman, 1999, ISBN 1-55860-552-5