Selective Transfer of Neural Network Task Knowledge

0 downloads 0 Views 2MB Size Report
of task learning, from simple to complex, provides the learner with not only ...... tasks develop high level feature detectors which may be used by subsequent tasks .... transfer spatial invariants required to recognize the same object in several images ...... examples as a solution to the problem of catastrophic forgetting McCl89, ...
Selective Transfer of Neural Network Task Knowledge by Daniel L. Silver

Graduate Program in Computer Science

Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy

Faculty of Graduate Studies The University of Western Ontario London, Ontario June 2000

c Daniel L. Silver 2000

THE UNIVERSITY OF WESTERN ONTARIO FACULTY OF GRADUATE STUDIES CERTIFICATE OF EXAMINATION Chief Advisor

Examining Board

Advisory Committee

The thesis by Daniel L. Silver entitled

Selective Transfer of Neural Network Task Knowledge is accepted in partial fulllment of the requirements for the degree of Doctor of Philosophy Date

Chair of Examining Board ii

ABSTRACT Within the context of articial neural networks (ANN), we explore the question: How can a learning system retain and use previously learned knowledge to facilitate future learning? The research objectives are to develop a theoretical model and test a prototype system which sequentially retains ANN task knowledge and selectively uses that knowledge to bias the learning of a new task in an e cient and e ective manner. A theory of selective functional transfer is presented that requires a learning algorithm that employs a measure of task relatedness. MTL is introduced as a knowledge based inductive learning method that learns one or more secondary tasks within a back-propagation ANN as a source of inductive bias for a primary task. MTL employs a separate learning rate, k , for each secondary task output k. k varies as a function of a measure of relatedness, Rk , between the kth secondary task and the primary task of interest. Three categories of a priori measures of relatedness are developed for controlling inductive bias. The task rehearsal method (TRM) is introduced to address the issue of sequential retention and generation of learned task knowledge. The representations of successfully learned tasks are stored within a domain knowledge repository. Virtual training examples generated from domain knowledge are rehearsed as secondary tasks in parallel with each new task using either standard multiple task learning (MTL) or MTL. TRM using MTL is tested as a method of selective knowledge transfer and sequential learning on two synthetic domains and one medical diagnostic domain. Experiments show that the TRM provides an excellent method of retaining and generating accurate functional task knowledge. Hypotheses generated are compared statistically to single task learning and MTL hypotheses. We conclude that selective knowledge transfer with MTL develops more e ective hypotheses but not necessarily with greater e ciency. The a priori measures of relatedness demonstrate signicant value on certain domains of tasks but have di culty scaling to large numbers of tasks. Several issues identied during the research indicate the importance of consolidating a representational form of domain knowledge.

Keywords: task knowledge transfer, articial neural networks, sequential learning, induc-

tive bias, task relatedness, knowledge based inductive learning, learning to learn, knowledge consolidation iii

To my friends and family, especially to my wife, Geri, and mother Dora for their never ending source of love, support and encouragement. To my parents, Dora and Charlie, for their gift of energy, curiosity, and determination. To my children Natalie and Monique for helping me learn how to learn new things.

ACKNOWLEDGEMENTS The successful completion of this dissertation is due in great part to the support, encouragement, and understanding of my supervisor, Robert E. Mercer. Bob recognized my interest in machine learning, created a research environment and provided intellectual, nancial and personal support that never wavered during or after my internship at UWO. The other members of my Ph.D. committee, Ken McRae and Trever Cradduck, provided very helpful information and nancial support during the early stages of the research e ort. I would like to thank Ken McRae for providing important background information on cognitive science and analogical reasoning, research directions in the study of ANNs and psychology, and references to articles on neurophysiology. I would like to acknowledge the signicant contribution made by Trever Cradduck and the Department of Nuclear Medicine, Victoria Campus, of the London Health Science Centre. As part of an on-going e ort into methods of automated medical diagnosis, the department provided me with an o ce, computing and communication equipment, and the nancial support necessary for the program. The Department of Computer Science University of Western Ontario provided additional nancial support with travel funding from the Faculty of Graduate Studies. Over the course of my research I have had numerous discussions via telephone and email as well as in person with other researchers. Many have graciously provided feedback on earlier articles and chapters of this dissertation. In particular, I would like to thank: Nathalie Japkowicz, Andre Trudel, Anthony Robins, Sebastian Thrun, Lorien Pratt, Jonathan Baxter, Rich Caruana, Simon Haykin, Charles Ling, Gilbert Hurwitz, and Piotr Slomka. I would also like to thank the sta of the Department of Computer Science, UWO for their assistance and friendship through my graduate years. David Wiseman kept the machines running and solved several long-distance connection problems. Ursula Dutz and Sandra McKay were of great assistance during my stay in London. Janice Wiersma provided much help and support after my departure from the UWO campus by assuring that administrative issues were dealt with quickly and professionally.

v

TABLE OF CONTENTS CERTIFICATE OF EXAMINATION

ii

ABSTRACT

iii

ACKNOWLEDGEMENTS

v

TABLE OF CONTENTS

vi

LIST OF TABLES

xi

LIST OF FIGURES

xiii

Chapter 1 INTRODUCTION 1.1 1.2 1.3 1.4 1.5

Overview of Problem . . . . . Research Objectives . . . . . Motivation . . . . . . . . . . Research Approach . . . . . . Overview of the Dissertation

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Chapter 2 BACKGROUND AND PROBLEM FORMULATION

. . . . .

. . . . .

2.1 Background on Inductive Learning and ANNs . . . . . . . . . . 2.1.1 The Framework of Inductive Learning . . . . . . . . . . 2.1.2 Inductive Bias and Prior Knowledge . . . . . . . . . . . 2.1.3 Knowledge Based Inductive Learning . . . . . . . . . . . 2.1.4 Learning to Learn Theory . . . . . . . . . . . . . . . . . 2.1.5 Analogical Reasoning . . . . . . . . . . . . . . . . . . . 2.1.6 Articial Neural Networks . . . . . . . . . . . . . . . . . 2.2 Survey of Knowledge Transfer in ANNs . . . . . . . . . . . . . 2.2.1 Fundamental Problems of Knowledge Transfer in ANNs vi

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

1 2 2 3 6 7

9

9 9 12 15 18 19 20 26 26

2.2.2 Summary of Previous Surveys . . . . . . . . . . . . . . . 2.2.3 Representational vs. Functional Transfer . . . . . . . . . 2.2.4 Approaches to Representational Transfer . . . . . . . . 2.2.5 Approaches to Functional Transfer . . . . . . . . . . . . 2.2.6 Analogy within Articial Neural Networks . . . . . . . . 2.3 Major Research Questions . . . . . . . . . . . . . . . . . . . . . 2.3.1 Knowledge Retention: Representational or Functional? . 2.3.2 Knowledge Consolidation: How is it done? . . . . . . . . 2.3.3 Knowledge Transfer: Representational or Functional? . 2.3.4 Task Relatedness: What is it? How is it used? . . . . . 2.3.5 Sequential Learning: Is it possible in ANNs? . . . . . . 2.4 Objectives and Scope of the Research . . . . . . . . . . . . . . 2.4.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

Chapter 3 FUNCTIONAL TRANSFER and SEQUENTIAL LEARNING

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

3.1 MTL - A Basis for Selective Functional Transfer . . . . . . . . . . . . . 3.1.1 Review of MTL Network Learning . . . . . . . . . . . . . . . . . 3.1.2 Characteristics and Inherent Biases of MTL Networks . . . . . . 3.1.3 Inductive Bias = Domain Knowledge + Task Relatedness . . . . 3.1.4 Inductive Bias, Internal Representation and Related Tasks . . . . 3.1.5 Alternative Strategies for MTL Based Task Knowledge Transfer. 3.2 A Theory of Selective Functional Transfer . . . . . . . . . . . . . . . . . 3.3 A Framework for a Measure of Task Relatedness . . . . . . . . . . . . . 3.3.1 Hints to a Framework for Employing Task Relatedness . . . . . . 3.3.2 From Framework to Functional Transfer . . . . . . . . . . . . . . 3.4 The Nature of Task Relatedness . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Relatedness expressed as a Distance Metric. . . . . . . . . . . . . 3.4.2 Relatedness as Similarity. . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Relatedness as Shared Invariance . . . . . . . . . . . . . . . . . . 3.4.4 Criteria for a Measure of Relatedness . . . . . . . . . . . . . . . 3.4.5 An Appropriate Test Domain for a Measure of Relatedness . . . 3.5 Measures of Relatedness Explored . . . . . . . . . . . . . . . . . . . . . 3.5.1 Static Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 28 29 32 39 40 41 41 42 44 45 46 46 47

49

50 50 54 57 59 64 67 68 68 69 73 76 77 81 81 82 83 83

3.5.2 Dynamic Measures . . . . . . . . . . . . . . . . . . . . 3.5.3 Hybrid Measures . . . . . . . . . . . . . . . . . . . . . 3.6 Sequential Learning through Task Rehearsal . . . . . . . . . . 3.6.1 Background on Rehearsal of Task Examples . . . . . . 3.6.2 Model for The Task Rehearsal Method . . . . . . . . . 3.6.3 An Appropriate Test Domain for Sequential Learning 3.7 The Functional Transfer Prototype . . . . . . . . . . . . . . . 3.7.1 The ANN Software. . . . . . . . . . . . . . . . . . . . 3.7.2 The TRM Software. . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

87 99 99 100 100 105 106 107 109

Chapter 4 EXPERIMENTS WITH MTL : SYNTHETIC DOMAINS

111

Chapter 5 EXPERIMENTS WITH TRM : SYNTHETIC DOMAINS

185

4.1 The Band Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Experiments using the MTL Framework - Band Domain . . . . . . . . . . 4.2.1 Experiment 1: Band Domain - Inductive Bias provided by each Task 4.2.2 Experiment 2: Inductive bias from combinations of secondary tasks 4.2.3 Experiment 3: The e ect of varying Rk values . . . . . . . . . . . . 4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experiments using Measures of Relatedness - Band Domain . . . . . . . . . 4.3.1 Experiment 4: Performance of Various Measures . . . . . . . . . . . 4.3.2 Experiment 5: Sensitivity to dynamic c parameter . . . . . . . . . . 4.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Logic Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Experiments using MTL and Measures of Relatedness - Logic Domain . . 4.5.1 Experiment 6: Inductive Bias provided by each Task . . . . . . . . . 4.5.2 Experiment 7: Performance of Various Measures . . . . . . . . . . . 4.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Experiment 8: Sensitivity to number of training examples . . . . . . 4.5.5 Experiment 9: Sensitivity to dynamic c parameter . . . . . . . . . . 4.5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

112 114 116 122 127 133 135 135 154 156 159 161 161 164 173 180 181 184

5.1 Experiments using the Task Rehearsal Method - Band Domain . . . . . . . 185 5.1.1 The Impoverished Band Domain . . . . . . . . . . . . . . . . . . . . 186 5.1.2 Experiment 10: STL learning of impoverished training sets . . . . . 191 viii

5.1.3 Ensuring the Generation of Accurate Virtual Examples . . . . . 5.1.4 Experiment 11 - Sequential learning of impoverished Band tasks 5.1.5 Accuracy and Value of Virtual Examples . . . . . . . . . . . . . 5.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Experiments using the Task Rehearsal Method - Logic Domain . . . . . 5.2.1 The Impoverished Logic Domain . . . . . . . . . . . . . . . . . . 5.2.2 Experiment 12: STL learning of impoverished training sets . . . 5.2.3 Ensuring the Generation of Accurate Virtual Examples . . . . . 5.2.4 Experiment 13 - Sequential learning of impoverished Logic tasks 5.2.5 Accuracy and Value of Virtual Examples . . . . . . . . . . . . . 5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 6 EXPERIMENTS WITH TRM : APPLIED DOMAIN 6.1 6.2 6.3 6.4 6.5 6.6 6.7

Medical Diagnostic Modelling and KBIL . . . . . . . . Coronary Artery Disease (CAD) Diagnosis . . . . . . . The CAD Domain . . . . . . . . . . . . . . . . . . . . Experiment 14: Inductive Bias provided by each Task Experiment 15: Sequential learning of CAD tasks . . . Analysis of the Diagnostic Improvement . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 7 DISCUSSION

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

7.1 Discussion of MTL Functional Transfer . . . . . . . . . . . . . . 7.1.1 MTL as a Framework for a Measure of Relatedness . . . 7.1.2 Performance of the proposed Measures of Relatedness . . 7.1.3 Scalability of MTL and the measures of relatedness. . . . 7.1.4 Alternative explanations for the success of MTL . . . . . 7.1.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Discussion of Task Rehearsal Method of Sequential Learning . . 7.2.1 Performance of the TRM as a sequential learning system 7.2.2 Scalability of TRM . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Alternative explanations for the success of TRM . . . . . 7.2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Advanced Issues and Open Questions . . . . . . . . . . . . . . . . ix

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

192 194 210 212 214 215 215 217 218 232 233

235

235 236 237 241 244 252 253

256

256 257 259 264 271 272 274 274 279 281 282 283

7.3.1 Complexity of Task Relatedness and Inductive Bias . . . . . . . . . 7.3.2 The Need for Consolidated Domain Knowledge . . . . . . . . . . . . 7.3.3 The Need for Functional Transfer for New Learning . . . . . . . . . 7.4 Selective Functional Transfer and Other Learning Methods . . . . . . . . . 7.4.1 KNN and Knowledge Transfer from Related Tasks . . . . . . . . . . 7.4.2 Sequential Learning through Selective Functional Transfer with KNN

Chapter 8 CONCLUSION

8.1 Objectives and Approach of the Research 8.2 Major Findings and Contributions . . . . 8.2.1 General Issues . . . . . . . . . . . 8.2.2 The Prototype Software . . . . . . 8.2.3 The MTL KBIL . . . . . . . . . . 8.2.4 The Measures of Relatedness . . . 8.2.5 The Task Rehearsal Method . . . . 8.3 Suggestions for Future Research . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

283 284 286 287 288 289

291

291 294 294 296 296 297 298 299

REFERENCES

303

Appendix A Glossary of Terms and Acronyms

312

Appendix B Probably Approximately Correct Learning

315

Appendix C Mathematical Details

318 C.1 Mutual Information of Secondary Task with respect to Primary Task . . . . 318 C.2 Mutual Information of Hidden Node with respect to a Task Output . . . . . 320

VITA

322

x

LIST OF TABLES 3.1 Matrix of alternative strategies for MTL based task knowledge transfer. . . 4.1 Summary of the training, validation and test sets of examples for each task of the Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Experiment 1: The inductive bias provided by each secondary task of the Band Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experiment 2: The inductive bias provided by groups of secondary tasks of the Band Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Experiment 3: The e ect of manually varying measures of relatedness on the Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Experiment 4: The performance of various measures of relatedness under MTL on the Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Experiment 4: Examples of the k values for each of the secondary tasks of the Band domain at the beginning of training and at the point of minimum validation error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Experiment 5: Sensitivity of hypothesis development to the dynamic c parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Experiment 4: Ranking of the various measures of relatedness used on the Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Description of the tasks of the Logic Domain. . . . . . . . . . . . . . . . . . 4.10 Summary of the training, validation and test examples for the eight tasks of the Logic domain for one of the experimental runs. . . . . . . . . . . . . . . 4.11 Experiment 7: The performance of various measures of relatedness under MTL on the Logic domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

65 113 118 126 131 138 139 155 158 160 161 168

4.12 Experiment 9: Sensitivity of hypothesis development to the dynamic c parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.1 Summary of the training, validation and test sets of examples for each task of the impoverished Band domain. . . . . . . . . . . . . . . . . . . . . . . . 5.2 Comparison of the original and impoverished training examples for T0 of the Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Experiment 11: Results of TRM sequential learning on the impoverished Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiment 11: Results of statistical tests between hypotheses developed under TRM on the impoverished Band domain. . . . . . . . . . . . . . . . . 5.5 Summary of the number of unknown and remaining training examples for each of the eight tasks of the impoverished Logic domain. . . . . . . . . . . 5.6 Experiment 13: Results of TRM sequential learning on the impoverished Logic domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Experiment 13: Results of statistical tests between hypotheses developed under TRM on the impoverished Logic domain. . . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4

Input attributes and diagnostic targets for the CAD Domain. . . . . . . . . The ctitious diagnostic tasks for the CAD domain. . . . . . . . . . . . . . Summary of the data sets for the seven tasks of the CAD domain. . . . . . Experiment 15: Results of statistical tests between hypotheses developed under TRM on the CAD domain. . . . . . . . . . . . . . . . . . . . . . . . .

xii

187 187 197 198 216 220 221 238 239 240 248

LIST OF FIGURES 2.1 2.2 2.3 2.4

The basic framework for inductive learning. . . . . . . . . . . . . . . . . . . The framework for knowledge based inductive learning. . . . . . . . . . . . An articial neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of how the same ANN can represent di erent functions depending on the value of the connection weights. . . . . . . . . . . . . . . . . . . . . . 2.5 The multi-layer feed-forward network and the back-propagation algorithm. . 2.6 An ANN generates a trajectory through weight space as it learns a new task. 2.7 A Multiple Task Learning (MTL) network. . . . . . . . . . . . . . . . . . . 3.1 Prototype of a simple 3-layer multiple task learning (MTL) network capable of computing any continuous function. . . . . . . . . . . . . . . . . . . . . . 3.2 Prototype of a complex 4-layer MTL network capable of computing an arbitrary function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Prototype of a more complex 5-layer MTL network. . . . . . . . . . . . . . 3.4 Training error versus the number of batch iterations for 4 hypotheses while learning all 14 non-trivial logic functions of 2 variables within an MTL network. 3.5 An idealized representation of various hypothesis spaces under STL and MTL networks and an optimal hypothesis, h0, for the primary task, T0 . . . . . . . 3.6 Mean number of test set misclassications by the primary task versus the number of hidden nodes within an MTL network. . . . . . . . . . . . . . . . 3.7 Percent misclassications by primary task hypotheses versus variation in Rk for all secondary tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 A function space showing the proximity of a primary task T0 to secondary tasks Tl and Tk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Three functions created from the composition of four Fourier series components. xiii

10 16 21 23 24 25 36 51 52 53 56 61 63 74 76 79

3.10 Three hypotheses created from the composition of four MTL hidden nodes features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.11 The 3-dimensional surface of Rk = tanh(2:65relk ) with training error, Ek , set to the mean cross-entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.12 The 3-dimensional surface of Rk = tanh(2:65relk ) but with training error, Ek , dampened. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.13 A model for the Task Rehearsal Method. . . . . . . . . . . . . . . . . . . . . 101 4.1 The Band domain shown within its 2-variable input space. . . . . . . . . . . 4.2 The training, validation and test example sets for the primary task of the Band domain depicted within the 2-variable input space. . . . . . . . . . . . 4.3 Experiment 1: The inductive bias provided by each secondary task of the Band Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Experiment 1: STL results on the Band domain. . . . . . . . . . . . . . . . 4.5 Experiment 1: MTL results on the Band domain. . . . . . . . . . . . . . . . 4.6 Experiment 2: The inductive bias provided by groups of secondary tasks of the Band Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Experiment 2: Learning under MTL with R0 = R5 = R6 = 1:0 while R1;4 = 0:0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Experiment 2: Learning under MTL with R0 = R2 = R5 = 1:0 while all others are 0.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Experiment 3: The e ect of manually varying measures of relatedness on the Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Experiment 4: The learning e ectiveness of MTL on the Band domain. . . 4.11 Experiment 4: The learning e ciency of MTL on the Band domain. . . . . 4.12 Experiment 4: MTL, with Rk based on the static jrj measure. . . . . . . . 4.13 Experiment 4: MTL, with Rk based on the static MI measure. . . . . . . 4.14 Experiment 4: MTL, with Rk based on the dynamic cos  measure. . . . . 4.15 Experiment 4: MTL, with Rk based on the dynamic cos  measure. . . . . 4.16 Experiment 4: MTL, with Rk based on the hybrid jrj + cos  measure. . . 4.17 Experiment 4: MTL, with Rk based on the hybrid jrj + cos  measure. . . 4.18 Experiment 4: MTL, with Rk based on the hybrid MI + cos  measure. . . 4.19 Experiment 4: MTL, with Rk based on the hybrid MI + cos  measure. . xiv

113 115 119 123 124 126 128 129 132 140 140 146 147 148 149 150 151 152 153

4.20 Experiment 5: Sensitivity of hypothesis development to the dynamic c parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.21 The ideal 11-10-8 network conguration for all tasks of the Logic domain. 4.22 Experiment 6: The inductive bias by each secondary task of the Logic Domain using a 11-6-8 network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.23 Experiment 6: The inductive bias by each secondary task of the Logic Domain using a 11-32-8 network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.24 Experiment 7: Learning e ectiveness on the Logic domain using a 11-6-8 network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.25 Experiment 7: Learning e ectiveness on the Logic domain using a 11-32-8 network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.26 Experiment 7: STL results on the Logic domain. . . . . . . . . . . . . . . . 4.27 Experiment 7: MTL results on the Logic domain. . . . . . . . . . . . . . . . 4.28 Experiment 7: MTL, with Rk based on the dynamic cos measure. . . . . 4.29 Experiment 7: MTL, with Rk based on the static jrj measure. . . . . . . . 4.30 Experiment 7: MTL, with Rk based on the hybrid jrj + cos measure. . . 4.31 Experiment 8: Sensitivity of hypothesis development to number of training examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.32 Experiment 9: Sensitivity of hypothesis development to the dynamic c parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Impoverished training examples for the 7 tasks of the Band domain within their 2-variable input space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Experiment 10: Learning e ectiveness of hypotheses developed by STL for all 7 tasks of the impoverished Band domain. . . . . . . . . . . . . . . . . . 5.3 Experiment 10: Classication of a test set for the (a) task T3 and (b) task T0 by STL hypotheses developed directly from the impoverished training examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiment 11: Comparison of TRM sequential learning trials on the impoverished Band domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experiment 11: Sequential learning of T3 under TRM and MTL. . . . . . . 5.6 Experiment 11: Sequential learning of T3 under TRM and MTL with Rk based on the hybrid jrj + cos measure of relatedness. . . . . . . . . . . . . xv

156 160 164 165 171 172 175 176 177 178 179 182 184 190 192 193 199 204 205

5.7 Experiment 11: Sequential learning of T0 under TRM and MTL. . . . . . . 5.8 Experiment 11: Sequential learning of T0 under TRM and MTL with Rk based on the hybrid jrj + cos measure of relatedness. . . . . . . . . . . . . 5.9 Experiment 11: Classication of a test set by STL, MTL and MTL hypotheses for task T3 of the impoverished Band domain. . . . . . . . . . . . . 5.10 Experiment 11: Classication of a test set by STL, MTL and MTL hypotheses for task T0 of the impoverished Band domain. . . . . . . . . . . . . 5.11 An example of the ability of the TRM to generate accurate virtual examples from impoverished Band domain training sets. . . . . . . . . . . . . . . . . 5.12 Relearning of T0 of the Band domain, under the TRM but with only the 10 training examples that have target values. . . . . . . . . . . . . . . . . . . . 5.13 A comparison of hypotheses developed for T0 of the Band domain from Experiments 4 and 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14 Experiment 12: STL learning e ectiveness for all 8 tasks of the impoverished Logic domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 Experiment 13: Bar graphs of learning e ectiveness on the impoverished Logic domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16 Experiment 13: Sequential learning of T1 under TRM and MTL. . . . . . . 5.17 Experiment 13: Sequential learning of T1 under TRM and MTL with Rk based on the hybrid jrj + cos measure of relatedness. . . . . . . . . . . . . 5.18 Experiment 13: Sequential learning of T1 under TRM and MTL with Rk based on the dynamic cos measure of relatedness. . . . . . . . . . . . . . . 5.19 Experiment 13: Sequential learning of T0 under TRM and MTL. . . . . . . 5.20 Experiment 13: Sequential learning of T0 under TRM and MTL with Rk based on the hybrid jrj + cos measure of relatedness. . . . . . . . . . . . . 5.21 A comparison of hypotheses developed for T0 of the Logic domain under the TRM from di erent numbers of virtual examples per secondary task. . . . .

206 207 208 209 211 213 213 217 222 227 228 229 230 231 233

6.1 Experiment 14. Inductive bias provided by each task of the CAD domain to the vamc task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 6.2 Experiment 15: Performance results from sequential learning on the CAD domain tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 xvi

6.3 Experiment 15: TRM sequential learning of the vamc task using MTL and the jrj + cos  measure of relatedness. . . . . . . . . . . . . . . . . . . . . . 251 6.4 Experiment 15: Detailed performance statistics for vamc hypotheses developed under the TRM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 7.1 Learning e ectiveness with a large number of unrelated secondary tasks. . . 268 7.2 Learning e ectiveness with a large number of related secondary tasks. . . . 270

xvii

Chapter 1 INTRODUCTION Over the last two decades of the 20th Century, an important area of research associated with Articial Intelligence has been concerned with the construction of computer systems that improve with experience. This eld of research has become known as Machine Learning Mitc97]. Much progress has been achieved in machine learning since the early 1980s. The most noticeable and practical result is the wide-spread application of machine learning software in science, business and industry. Commercial products that use machine learning software have helped create new terms such as data mining and intelligent agents and encouraged new enterprises engaged in activities such as automated knowledge discovery and user proling. Subsequently, machine learning has caught the imagination of a new generation of young scientists and professionals as well as corporate decision makers. Despite these impressive results, from an academic perspective, there is still much for machine learning to accomplish. If, ultimately, the goal is to develop a machine that is capable of learning at the level of the human mind, the journey has only just begun. This thesis takes one step in that journey by exploring an outstanding question in machine learning that deserves further attention: How can a learning system retain learned knowledge and use that knowledge to facilitate future learning? A better understanding of this question and the testing of possible solutions is at the heart of this dissertation. This introductory chapter is divided into ve sections. Section 1.1 provides an overview of the research problem. Section 1.2 denes the general objectives of the research. Section 1.3 presents the motivation for this dissertation. Section 1.4 describes the approach that was taken. Finally, Section 1.5 provides an overview of the subsequent chapters and the structure and ow of the document. 1

2

1.1 Overview of Problem The vast majority of machine learning research has focused on the tabula rasa approach of inducing a model of a classication task from a set of supervised training examples. Consequently, most machine learning systems do not take advantage of previously acquired task knowledge when learning a new and potentially related task. Unlike human learners, these systems are not capable of sequentially improving upon their ability to learn tasks. The ability to learn sequences of tasks would be of benet from both a theoretical and practical perspective. Learning theory tells us that the development of a su ciently accurate model from a practical number of examples depends upon an appropriate inductive bias, one source of which is knowledge that can be derived from the models of previously learned tasks Mitc80]. We will refer to this knowledge as prior task knowledge. Furthermore, a learning system that makes use of prior knowledge can train more e ciently and require fewer training examples Baxt95b]. Currently, there is no adequate theory of how task knowledge can be retained and then selectively transferred when learning a new task Thru97a, Caru97a]. From a practical perspective, many applications of machine learning systems, such as data mining and intelligent agents, su er from a deciency of training examples and could benet from the use of prior task knowledge. For example, a more accurate medical diagnostic model could be developed from a small sample of patients if related diagnostic models were available and accessible to the learning system. Alternatively, the user prole for a new email user could be learned more rapidly if prior knowledge of similar user proles were considered by the learning component of the mail tool.

1.2 Research Objectives This thesis investigates the retention and integration of task knowledge after it has been induced and its selective recall and use when learning a new task. Integration of prior task knowledge will be referred to as consolidation. Selective recall and use of prior knowledge will be referred to as transfer. The thesis focuses on systems of articial neural networks (ANNs) that are capable of sequentially learning a series of tasks from the same problem domain using knowledge consolidation and transfer. The research reported here has two objectives. The rst objective is to develop a theoretical model of knowledge transfer that uses previously induced task knowledge to minimize the number of training examples required for learning a new task to an acceptable level of generalization accuracy and to decrease the training time for that task. The second objective is to build a prototype system that tests the theoretical model. The prototype system will be tested against specially designed synthetic domains of tasks to verify the

3 theory. The system will also be applied to a practical decision making problem in the eld of medicine where there is an important need for the accumulation and use of domain knowledge.

1.3 Motivation This research proposes new methods of sequential consolidation and transfer of task knowledge within the context of articial neural networks and tests these methods against synthetic and real-world problem domains. Primarily, the perspective is that of computer and cognitive science. However, it is unavoidable to consider the results from other elds that have contributed to the development of machine learning theory and ANN learning algorithms. Therefore, motivation comes from several fronts: cognitive science, psychological and physiological evidence, advances in computational learning theory, and the desire to solve real-world problems. Cognitive Science. The question of how the human mind is able to store and later utilize acquired task knowledge is central to this research. As a child, a person rst learns to drive a scooter and then a tricycle. Later, he learns his balance on a bicycle and the rules for driving on a public highway. As a young adult the person learns to drive various motorized vehicles such as cars and motorcycles. By the time the person is 30 years of age he has acquired and consolidated knowledge concerning the control of a number of vehicles. The sequence of task learning, from simple to complex, provides the learner with not only knowledge of how to drive each machine but also knowledge of the commonalties and di erences across a domain of vehicles. This domain knowledge can be used to learn more easily how to drive any new vehicle. Persons who have acquired a vast array of domain knowledge and are able to use it e ectively when solving new problems are often referred to as experts. If a machine learning system is to emulate the ability of human learning to use domain knowledge, it must have a method of acquiring such knowledge in the rst place. It makes sense that this method be, itself, a learning process or better stated, a meta-learning process JL83, Holl89]. As in the case of a child, a learning system should start as a pure inductive learner having no background knowledge of the problem domain. This makes for a slow and unsure beginning that is typied by trial and error. However, as tasks are learned successfully, knowledge of the domain increases and the learning system should no longer pursue naive models that clearly y in the face of previous experience. Therefore, motivation for this research comes from the desire to create machine learning systems that benet from learning many di erent tasks over a lifetime.

4 Psychological and neurophysiological evidence. Psychological evidence Harl49, Marx44, Ward37] indicates that the e ectiveness and e ciency of the mammalian brain to learn a new task is closely related to knowledge of similar tasks. If the task is similar to previous tasks, a positive transfer of task knowledge will occur. If the task is dissimilar to previous experience, it is likely that a negative transfer of knowledge will occur. Furthermore, there is an abundance of evidence that during the learning process humans and animals develop not only specic discriminate models but also a sensitivity to similar structural relations among the input stimuli Keho88]. Research led by James McClelland has inuenced the research direction of this thesis. In McCl94], McClelland discusses the process of memory consolidation: \... we suggest that the neocortex may be optimized for the gradual discovery of the shared structure of events and experiences, and that the hippocampal system is there to provide a mechanism for the rapid acquisition of new information without interference with previously discovered regularities. After this initial acquisition, the hippocampal system serves as teacher to the neocortex: That is, it allows for the reinstatement in the neocortex of representations of past events, so that they may be gradually acquired by the cortical system via interleaved learning. We equate this interleaved learning process with consolidation, and we suggest that it is necessarily slow so that new knowledge can be integrated e ectively into the structured knowledge contained in the neocortical system." In addition, there is evidence that, by way of bi-directional connections, the neocortex provides information to the hippocampus during short-term learning that may positively inuence the generation of episodic models Squi92]. Advances in computational learning theory. Theoretical developments point to the need for inductive bias during learning Mitc80]. In fact, without a source of guidance during the learning process, there is little hope of ever producing a machine that can learn most real-world tasks. Domain knowledge has been recognized as a major source of inductive bias. The challenge is to discover a method for retaining knowledge from previous learning experiences and for using that knowledge selectively to benet future learning. Researchers like Paul Utgo feel that the search for the most appropriate inductive bias is a fundamental part of machine learning. As he discusses in Utgo86]: \An inductive concept learning program ought to conduct its own search for appropriate bias. Until programs have such capability, the search for appropriate bias will remain a manual task dependent on a human's ability to perform it.

5 This gap in understanding constitutes the largest weakness in current methods of machine learning of concepts from examples." Motivation for this research comes from the desire to nd ways to eliminate this weakness. The desire to solve real-world problems. The complexity of many real-world problems means that the current theories of learning do little to ensure that the models that are developed can be relied upon. For example, pure inductive learning, as formalized by the Probably Approximately Correct (PAC) theory, dictates that for most real-world tasks a very large number of training examples must be used if a reliable model is to be found by a learning system (see Appendix B). The reality is that it is often di cult to nd large numbers of examples. Practicalities such as cost, or risk to human health, or privacy prohibit the collection of data sets of su cent size. A common variant of this problem is an unbalanced set of data in which the number of examples for one class is substantially higher than all other classes. Subsequently, most statistical and machine learning studies are illposed on the basis that the sample size is smaller than is required to formulate an accurate hypothesis. Nonetheless, human experts have a great deal of success at developing complex models that demonstrate good generalization. One reason for this is the experts' use of knowledge of the problem domain to constrain the search for a practical model. Therefore, a further motive for this research is to overcome the sample complexity requirement for pure inductive learning by acquiring and using domain knowledge. Machine learning and medical decision making. The eld of medicine is ripe for the application of machine learning technology Scot93]. There have already been a number of successes using inductive decision trees, instance based learning methods, case based reasoning systems, and ANNs Dets91]. A survey of medical and bio-medical journals since 1989 produced over 700 articles on the application of neural networks in biomedical research and clinical diagnosis (for examples see Baxt91b, Baxt91a, Asti92, Akay93, Boon93, Daws94]). At the 1995 World Congress on Neural Networks in Washington, DC, there was a twoday session dedicated to the use and regulation of neural networks in medicine. At that session a major point was made that the complexities of medical science today are forcing the use of automated tools for clinical diagnosis Burk95]. Furthermore, the non-linear interaction of various diagnostic attributes requires the use of sophisticated modelling systems. This is bringing about a shift from data-based analysis to model-based analysis. Typically, the complexity of medical decision making problems means that large numbers of training examples are required to develop reliable diagnostic models using pure inductive learning systems. But realistically, on a per doctor, or per hospital, or even per region

6 basis, there is an insu cient number of examples to meet this requirement. Therefore, a motive for this research is the desire to create better medical decision making systems that continually acquire and utilize domain knowledge for the development of more accurate diagnostic models.

1.4 Research Approach As stated in Section 1.2, the objective of this thesis is to investigate the retention and integration or consolidation of task knowledge after it has been induced, and its selective recall or transfer to facilitate the learning of a new task. The focus is on systems of ANNs that are capable of sequentially learning a series of tasks from the same problem domain. The research e ort began with a general denition of the problem of knowledge transfer along with a broad set of research objectives. The denition and objectives guided a thorough survey of existing literature and current research, covering:

 task and skill transfer, knowledge consolidation and analogical reasoning from the elds of articial intelligence, psychology and neuroscience

 important areas of computational learning theory such as probably approximately correct (PAC) learning (see Appendix B) and inductive bias

 mathematical, statistical and psychological denitions of relatedness and similarity, particularly in regard to task knowledge

 specic features of ANNs such as catastrophic interference, stability and plasticity, network weight initialization, and multiple task learning and

 participation with fellow researchers at related conferences and workshops. From the breadth and depth of the survey information a more focused research objective was formulated. In addition, the survey information identied two fundamentally di erent approaches to knowledge transfer: representational transfer and functional transfer. A summary of the surveyed material and the formulation of the specic research objective is presented in Chapter 2. Pursuing, at rst, representational transfer, the research took on an experimentalist approach made up of the following sequence of steps:

 the generation of knowledge transfer theory  the development of a prototype sequential learning system

7

 experimentation using the prototype on synthetic and real-world domains of tasks This same approach was subsequently used to investigate functional task knowledge transfer. In total, over 25 task knowledge consolidation and transfer experiments were conducted using two sequential learning systems on either synthetic tasks domains using computer generated examples or real-world task domains using collected examples. The di culties encountered with the representational transfer theory and experimental software are briey discussed in Section 2.3.5. The more successful functional transfer theory, prototype system, and experiments are presented in Chapters 3 through 6. The importance of both representational and functional transfer and a strategy for integrating the two are discussed in Chapter 7.

1.5 Overview of the Dissertation The remainder of the thesis is organized into eight chapters. Chapter 2 begins with a summary of the relevant background material on inductive learning, inductive bias and ANNs. Then a survey of recent work on task knowledge transfer and sequential learning in the context of ANNs is presented. The background and survey material is consolidated into a set of major open research questions. Finally, based on a subset of the research questions, the objectives and scope of the research are specied. Chapter 3 develops a theory of functional knowledge transfer that uses a modied version of the multiple task learning (MTL) ANN method called MTL and the concept of task rehearsal based on the generation of virtual examples. The theory requires that three requirements be satised: (1) that there exists a method of sequentially retaining and transferring functional task knowledge in the form of virtual examples, (2) that MTL provides a framework for employing a measure of task relatedness for the selective transfer of knowledge, and (3) that suitable measures of relatedness can be found for a domain such that appropriate transfer occurs. The chapter explores various solutions to each of these requirements and proposes a suitable test domain. The chapter nishes by describing a prototype software system that has been developed. In Chapter 4, solutions to the second and third requirements of the theory of selective functional transfer are tested on two synthetic domains of tasks of varying degrees of relatedness. Using the prototype system, MTL is tested as a framework for selectively transferring knowledge from secondary source tasks to a primary task by manually manipulating the measure of relatedness for each secondary task. Experiments are then conducted on both domains using as many as eight di erent automated measures of relatedness. The results from the experimental runs are compared to one another and to results by single task

8 learning and standard multiple task learning results. The important ndings are discussed in the last portion of the chapter. In Chapter 5, the Task Rehearsal Method (TRM) of sequential learning is tested as a solution to the rst subproblem of the theory of functional knowledge transfer. Using the prototype system, experiments using TRM are conducted on the two synthetic domains of tasks introduced in Chapter 5. The prototype system's ability to sequentially retain, consolidate and transfer knowledge is tested using MTL and MTL and several measures of relatedness. The results are compared to single task learning for each task of the sequence. Chapter 6 reports on the application of functional transfer to a real-world medical domain. The subject is the diagnosis of coronary artery disease. Sequential learning of seven tasks tests the ability of the functional transfer prototype to consolidate and transfer knowledge from previously learned tasks. Chapter 7 discusses the outcomes of the experiments reported in the previous chapters, presents a number of advanced issues and open questions that arise from the research and considers selective functional transfer using other machine learning systems. Chapter 8 concludes with a summary of the objectives and approach of the research, a list of the important ndings and contributions made, and suggestions for future work. Following Chapter 8, a series of appendices provide a glossary of acronyms and terminology used throughout the document and detailed information on important theory and mathematics discussed in various chapters.

Chapter 2 BACKGROUND AND PROBLEM FORMULATION This chapter presents a summary of relevant background material followed by formulation of the research problem and the objective and scope of the research e ort. The rst section presents foundational material on inductive learning, inductive bias and articial neural networks. The second section summarizes a survey of literature on task knowledge transfer and sequential learning within the context of ANNs. The third section consolidates the background and survey material and identies several key research issues. The nal section denes the research problem and the objective and scope of the dissertation.

2.1 Background on Inductive Learning and ANNs A research program concerned with knowledge transfer and sequential learning in the context of neural networks requires foundational background in inductive and analogical reasoning, computational learning theory, and articial neural networks. This section presents a summary of relevant material in each of these areas.

2.1.1 The Framework of Inductive Learning Many phenomena in the world can be expressed as functions which map from a set of input variables, or attributes, to an output variable or a set of output variables. For example, the amount of electrical current owing through a wire is a function of the impedance of the components in the circuit and the voltage applied. One type of function is a classication task or classier, f , which maps a set of input values, x, to one of a discrete set of target output values, f (x), normally referred to as classes or categories. For example, the presence of coronary artery disease can be classied based on clinical attributes such as age, gender, and blood pressure, as well as the ndings of diagnostic tests that check for abnormal heart rythmn during states of stress and rest. When there are only two classication values, such as \has disease" or \does not have disease", a classication task is sometimes referred to as a concept. This research will concern itself primarily with concept learning. 9

10

Testing Examples

Environment

Training Examples

Induced Model of Classifier

Inductive Learning System

Output Classification

Figure 2.1: The basic framework for inductive learning. Inductive learning is an inference process which constructs a model or hypothesis, h, of a classication task, f , from a set of training examples. Inductive learning is said to be supervised when each example includes its correct target class as a training signal. This research will deal strictly with supervised inductive learning henceforth, the use of the word \learning" is meant to imply this supervised format. Figure 2.1 presents the basic framework in which a supervised inductive learning system may be positioned. There is a source of error free classication examples referred to as the environment. Let X be an instance space of possible input values xi with some xed probability distribution of occurrence, D. To learn a concept task f , that is a function of the inputs, a random sample according to D is taken from X to create a set of training examples: S = ((x1 f (x1)) (x2 f (x2)) ::: (xm f (xm))): Each training example, i, must consist of a set of input attribute values, xi , as well as a target class value, ti , such that ti = f (xi ). The objective of an inductive learning system, L, is to select or develop the hypothesis, h (induced model of classier), using the training examples, S , such that h approximates the actual task below a desired level of error, . Let the true error of the hypothesis be dened as

e(h f ) = D(xi 2 X jh(xi) 6= f (xi)): Then formally,

S ^ xi  h(xi )

and

e(h f ) < 

11 where a  b indicates that b is inductively inferred from a. The expression states that the learning system should use the available training examples to induce a hypothesis, h, such that the portion of examples in X misclassied by h is less than . Preferably, as the number of examples m(referred to as the sample size) increases, L should develop a hypothesis h of increasing accuracy. Clearly, if all examples in X are seen by the learning system, given su cient representation, the system should be able to construct a perfect h, such that h(xi ) = f (xi ) for all xi 2 X . The degree of approximation for a hypothesis can be estimated by its classication performance against a set of previously unseen test examples drawn from environment X according to distribution D. This estimation known as the empirical or generalization error is some measure e0 (h(xi ) f (xi)) for all (xi  f (xi)) in the test set. Preferably, the error, e0(h f ), across all test examples is less than or equal to some previously agreed upon value. A hypothesis with good classication performance on test examples is said to exhibit good generalization or high generalization accuracy. Consider a simple learning system which has available a hypothesis space H containing a nite number of distinct hypotheses

H = fh1  h2 h3 ::: hng: Assume a training sample S containing m examples of some task f . For each h and for each example of S compare the output of the hypothesis to the supervised target class. If the hypothesis disagrees with any target class, then reject it. A good choice of hypothesis is one that agrees, or is consistent with the entire sample S , if such a hypothesis exists. Constructing a hypothesis with good generalization is not simply a matter of memorizing the training examples. This would produce an e ect known as over-training which is analogous to over-tting a complex curve to a set of data points. The model becomes too specic to the training examples and does poorly on a set of test examples. Instead, the inductive learning system must discern from the training examples those global regularities which properly discriminate between the classes. This raises an important question: How many examples are required by a learning system before it can ever hope to produce a hypothesis with good generalization? An answer to this was provided by Leslie G. Valiant in 1984 when he proposed the probably approximately correct, or PAC, theory of learning Vali84]1 . The PAC model of learning characterizes training examples by their statistical properties, and measures the error in the hypothesis produced by a learning system in light of the same statistical properties. This can be viewed as a probabilistic extension of E.M. Gold's identi cation in the limit paradigm Gold67]. 1

12 Valiant denes algorithm L to be a probably approximately correct learning algorithm2 for a classication task f in the hypothesis space H if for

 a condence level , such that 0 <  < 1, and  an error threshold , such that 0 <  < 1,

there exists a positive integer m0 a function of ( ) such that

 for any target concept h 2 H , and  for any probability distribution over the example space D(X ), whenever m  m , then D(S 2 f je(h f ) < ) > 1 ;  . Thus, the minimum number of 0

examples m0 depends upon the values  and , but never on the target classication task f or the unknown distribution D. This means that the desired level of condence and error can be set even though the target task and the distribution of examples is unknown. Furthermore, Valiant showed that any nite hypothesis space H is potentially learnable and that the number of examples required to learn a consistent hypothesis h is given by: m  m = 1 ln jH j  0





where jH j is the number of hypotheses in the hypothesis space. This is an important theorem since it tells us how many examples are required to have a consistent learning algorithm achieve levels of condence  with an error rate less than . In fact, this theorem covers all Boolean hypothesis spaces f0 1gn for a xed n. All such spaces are potentially learnable, and any algorithm which consistently learns a hypothesis can be considered PAC compliant.

2.1.2 Inductive Bias and Prior Knowledge In practise the PAC theory has its limitations. As with all computing engines, machine learning systems must perform using nite time and space resources. Although a PAC learning system is able to theoretically learn a hypothesis to a desired level of accuracy, it may require an unrealistic number of examples, amount of memory and time to compute. Therefore, in practice, a PAC learning system must avoid using an unreasonably large hypothesis space (such as the space of all Boolean equations of n variables) since the search of this space for an accurate hypothesis may be intractable3 . This is in conict with the fact that most complex real-world problems have large hypothesis spaces. For greater detail the reader is referred to Appendix B. For this reason, PAC research has concentrated on the discovery of domains (eg. boolean conjunctive normal form equations of k terms and n variables = CNF(n ki)) which can be learned to a level of generalization error in polynomial time. Such domains are said to be e ciently PAC, or EPAC, learnable. 2

3

13 For real-world problems, some strategy, or heuristic, must be employed to \intelligently" restrict the hypothesis space, H , thereby making the search process computationally e cient Hert91]. Any constraint of a learning system's hypothesis space, beyond the criterion of consistency with the training examples, is called inductive bias. All learning systems have a bias that favours certain types of hypotheses over others. ANN learning algorithms naturally favour hypotheses with weight values that are small in magnitude. Inductive decision tree learning algorithms tend to favour hypotheses that are represented by small trees instead of those that are represented by larger bushier trees. Practitioners have been forced to take an ad hoc approach to the application of machine learning systems: rst one algorithm is used (e.g. a neural network) and then another (e.g. an inductive decision tree) in the hopes of nding the appropriate inductive bias for the problem. It has been shown on several occasions that there cannot exist a universal inductive learning algorithm, be it biological or machine, which can perform equally well on all classication tasks for some example space X Roma94]. Instead, every learner utilizes an inductive bias which favours it in some domains while handicapping it in others. Inductive bias is essential for the development of a hypothesis with good generalization in a tractable amount of time and from a practical number of examples Mitc80, Mitc97]. Inductive bias characterizes the methods a learning system uses to develop a generalized model of a task, beyond the information found in the available training examples. The inductive bias of a learning system can be considered a learning system's preference for one hypothesis over another. This preference produces a partial ordering over the hypothesis space of the learning system that can be used to reduce the average time to nd a su ciently accurate hypothesis.

De nition: Formally, as per Mitc97], we dene the inductive bias of a learning system L to be the set of assumptions B such that for a set of training examples, S , from an instance space X for a concept f (xi ): B ^ S ^ xi ` h(xi )

where a ` b indicates that b can be logically deduced from a. This expression denes inductive bias as the set of additional assumptions B that justies the inductive inference of a learning system as a provable deductive inference. The inductive bias can be consider correct if (8xi 2 X )e(h f ) = 0: Currently, in most learning systems the inductive bias remains xed. For example, an inductive decision tree system with a preference for smaller trees orders the list of possible hypotheses starting with the smallest possible tree of only one node. Ideally, a learning system is able to change its inductive bias to tailor its preference for hypotheses according

14 to the task being learned. The ability to change inductive bias requires the learning system to have prior knowledge about some aspect or aspects of the task. Furthermore, it suggests that the accumulation of prior knowledge at the inductive bias level is a useful characteristic for any learning system. Acquiring and using prior knowledge as a source of inductive bias is one of the unsolved problems in learning theory. However, some advances have been made over the last 20 years. Five major classes of inductive bias used by intelligent learners are cited in Mitc80] universal heuristics, knowledge of intended use, knowledge of the source, analogy with previously learned tasks, and knowledge of the task domain. All are forms of prior knowledge used to facilitate the search of hypothesis space.

Universal Heuristics. Universal heuristics for learning are methods of bias which force a

priori conditions on the induction process independent of the task or problem domain. The most accepted and widely applied universal heuristic is the Law of Economy, often referred to in machine learning as Occam's Razor. The name derives itself from William of Ockham (1285-1349) who rst stated \non sunt multiplicanda entia prater necessitatem" which can be translated as \entities are not to be multiplied beyond necessity" or \plurality should not be assumed without necessity". This is congruent with the common belief that the simplest explanation of observed facts is often the best explanation. The work of Kolmogorov, Solomono , and Levin Li92, Anth92], Blumer et al. Blum87], and Rissanen Riss78, Riss89] have given formal support to this intuition by showing that the optimal method of representing a series of examples is based on a minimization of the description length of the hypothesis. For machine learning, Occam's Razor suggests an a priori strategy for selecting a hypothesis of greatest generalization. It places a cost on the complexity of each potential model of a task, with the more complex models having the greater cost. Thus, the optimum hypothesis is one which minimizes its complexity of structure while maximizing its accuracy of \t" to the training examples. In ANNs, complexity has often been equated to the number of connections in the network. By reducing the number of connection weights the hypothesis space H of the network is reduced. Therefore, the theory has been that small networks that t the training data well are simpler than large networks and, therefore, stand the better chance of generalization. Recently, researchers have recognized problems with interpretations of Occam's Razor Domi98]. Signicant to our discussion is the realization that the amount of structure or representation that a learning system has is only one constraint that can be used to limit the size of the systems hypothesis space. For example, the complexity of an inductive system's hypothesis space can be reduced by limiting the range of its

15 representational values. A neural network with certain connection weights that are limited to positive values has a simpler hypothesis space than the same network with unlimited connection weight values.

Knowledge of intended use. A learner can benet from knowledge of the use of the

chosen hypothesis. For example, in many medical diagnostic situations it is less of a consequence to predict a false positive (indicating disease where there actually is none) as compared with predicting a false negative (indicating no disease where there actually is).

Knowledge of the source. If the learner knows or suspects a curriculum is being used

then this information can be used to make logical inferences based on previous learning. Common assumptions between the teacher and the learner can be assumed which act to constrain the space of available hypotheses.

Analogy with previously learned tasks. If several related tasks have been previously

learned, the learner can draw upon constraints inferred from those tasks when learning a new and related task. For example, if a particular input attribute played a negligible role in all of the previous tasks then its inclusion in the current learning process might be considered of little value. Alternatively, if 3 hidden units of a neural network have been su cient for learning all previous tasks, then it is likely that 3 will be su cient for the next task.

Knowledge of the task domain. It is sensible to assume that for all tasks or at least groups of tasks within a domain there will exist common constraints which can be brought to bear on any one task. For example, gravity must be considered in playing all games on dry land involving a ball. Thus, to learn a new game there is no reason to consider hypotheses which allow a ball to travel though the air forever without the inuence of gravity.

The last two classes of inductive bias identied by Mitchell, analogy with previously learned tasks and knowledge of the task domain, are of primary concern to our research henceforth they are jointly referred to as task domain knowledge.

2.1.3 Knowledge Based Inductive Learning We dene knowledge based inductive learning, or KBIL, as a learning method which relies on prior knowledge of the problem domain to reduce the hypothesis space which must be searched. Figure 2.2 provides the framework for knowledge based inductive learning. Domain knowledge is a database of accumulated information which has been acquired from

16 previously learned tasks. The intent is that domain knowledge can be used to bias a pure inductive learning system in a positive manner such that it trains in a shorter period of time and produces a more accurate hypothesis with fewer training examples. In turn, new information is added to and potentially consolidated within domain knowledge following the learning of each task. Michalski, in his Inferential Theory of Learning, refers to this cycle of learning and consolidation as constructive inductive learning Mich93]. In the extreme, where the new classication task to be learned is exactly the same as one learned at some earlier time, the inductive bias should provide rapid convergence to an accurate hypothesis from a minimum number of examples. Environment

Testing Examples Domain Knowledge Inductive Bias

Training Examples

Inductive Learning System Knowledge Based Inductive Learning System

Induced Model of Classifier Task

Output Classification

Figure 2.2: The framework for knowledge based inductive learning. Shifting Inductive Bias During the 1980's symbolic machine learning methods were developed which used domain knowledge. In his PhD thesis Machine Learning of Inductive Bias Utgo86], Paul Utgo provides some of the earliest material on inductive bias. The thesis states that the search for a better inductive bias is a fundamental part of learning and that this search can be mechanized. Utgo is adamant that machine learning should direct its attention toward learning systems that are able to adjust their inductive biases. He states \This gap in understanding constitutes the largest weakness in current methods of mechanical learning of concepts from examples". He equates the psychological term learning to learn to the ability of a learning system to discover an appropriate inductive bias. Aspects of inductive bias are dened. A strong bias focuses the learning system on a

17 relatively small subset of hypotheses whereas a weak hypothesis allows a relatively large subset. A correct bias is one which allows the learning system to discover the correct hypothesis, whereas an incorrect bias does not. The best learning system is one that is able to select a strong and correct inductive bias for any task. Utgo 's thesis focuses on how incorrect bias is identied and changed through a process called RTA | Recommend shift in bias, Translate into representation formalisms and Assimilate newly formed hypotheses. Five simplifying assumptions are made. The research focuses on the problem of shifting bias and not selecting an initial bias. Inductive bias is represented as a restricted hypothesis space dened by a constrained formal symbolic language. The research considers only those shifts of bias that weaken a strong bias by adding new elements of the language. Bias is shifted when the space of hypotheses is found not to include any hypothesis consistent with the training examples. The approach is demonstrated in a program called STABB which shifts bias of a symbolic learning system called LEX. Explanation Based Learning One of the most widely studied methods of adjusting inductive bias is Explanation Based Learning (EBL) popularized by Mitc86]. EBL systems extract general, explanatory rules from examples and record them in a domain knowledge database. During the learning of a new function, EBL rules are used to generate candidate hypotheses. This is not so much an inductive method as it is a derivational approach. The key to success with EBL is developing su ciently general rules and dening the class of functions for which the same rules can be used. The basic EBL method can be advanced by combining it with an inductive learning component. The domain rules can be used to impose additional constraints which serve to reduce the e ective hypothesis space from which a model will be selected by the inductive learning component Dany89, Moon89]. Cooperatively, the inductive component is able to generate new rules which augment the domain knowledge. Determinations Davies and Russel dene a related form of domain knowledge inuence called determinations Davi87]. The intent is to acquire the knowledge of the problem domain in the form of functional dependencies which act as relevant guidelines for future inductive learning. Determinations specify constraints, such as \all chairs are less than 2 feet tall" to restrict the choice of possible future hypotheses concerning furniture. Such dependencies are learned by using a search algorithm to select the smallest but consistent set of attribute conditions

18 which hold for a particular target class. Inductive Logic Programming Inductive Logic Programming (ILP) is a major area of research for domain knowledge based inductive learning Mugg91, Quin90]. One of the most promising methods used in ILP is inverse resolution which takes the statement domain knowledge ^ hypothesis ^ input attributes ` target class and creates a backwards proof in search of an appropriate hypothesis. This has great potential since it has the ability to construct new predicates not supplied by training examples. However, the challenge has been nding heuristics for the tractable construction of the backwards proof.

2.1.4 Learning to Learn Theory At the 1995 Neural Information Processing Systems conference a workshop on knowledge transfer in inductive systems was held. The workshop was entitled \Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems" Silv95a]. The workshop resulted in the generation of a seminal collection of articles covering task knowledge transfer Thru97b]. The overview article for this text denes the term learning to learn. It sets the stage by dening the term learning according to Mitc93]. Given:

 a task,  a set of training examples, and  a performance measure, a computer system is said to learn if its performance at the task improves with the number training examples. The term learning to learn is then dened as follows. Given:

 a family of tasks,  a set of training examples for each task, and  a performance measure for each task, a computer system is said to learn to learn if its performance at each task improves with the number of training examples and with the number of tasks in the family. The implication is that by learning many tasks from the same family, prior knowledge about all tasks is both

19 acquired and used as inductive bias. This denition requires that some form of knowledge is transferred between the tasks such that there is a positive impact on the performance of the hypotheses developed for the tasks. When task domain knowledge is used to bias an inductive learning system a transfer of knowledge occurs from one or more source tasks to a target or primary task. Thus, the problem of selecting an appropriate bias is transformed into the problem of selecting the appropriate task knowledge for transfer. When the use of domain knowledge results in a more accurate hypothesis than could have been achieved with only the training examples, a positive inductive bias or positive transfer is said to have occurred. Conversely, a negative inductive bias or negative transfer of knowledge results in a hypothesis of lesser accuracy.

2.1.5 Analogical Reasoning The transfer of task knowledge is closely linked to the concept of analogy and reasoning by analogy. This is by no means a new area of research, having been studied in the elds of philosophy, psychology, and linguistics for several decades. In articial intelligence and machine learning, symbolic methods of analogical reasoning have been researched since the 1960's. An excellent survey of analogy is provided by Hall Hall89]. Much of the following discussion relates to that article. Hall uses the term analogical mapping to describe the transfer of task knowledge from a source domain to a target domain. Hall species an abstract framework for analogy composed of four components:

recognition | the identication or indexing of analogous source knowledge, given a target task description

elaboration | the process of transfer (or mapping) between the source knowledge and

the target task, usually requiring some systematic extension of the source knowledge

evaluation | the appraisal of the transfer in terms of its usefulness to the target task if

necessary, the transfer must be rejected on the grounds it hinders the learning process and

consolidation | the integration of the target task results back into the store of source knowledge

Hall is careful to indicate that elaboration is the most import component of the analogy process, and that elaboration and evaluation are often a tightly-coupled and iterative subprocess. In Hick90] a fth component is added, that of gathering feedback on the success

20 of the analogical process and the associated e ort (cost). This can be used to modify the other components. For the purposes of this thesis, analogy is dened as the process of transferring knowledge from a source of domain knowledge to a target task and the subsequent consolidation of new target knowledge back into the source. The transfer of task knowledge combines the recognition, elaboration, and evaluation components. It is important to distinguish between two classes of analogy, namely inter-domain analogy, and intra-domain analogy. Inter-domain analogy is dened as analogy between source knowledge in one domain and a target task in another domain, whereas intra-domain analogy is dened as analogy between source knowledge and a target task in the same domain. The major di erence between these two classes of analogy is that intra-domain analogy will always require elaborative extension, whereas inter-domain analogy may only require an appropriate attribute mapping. Although Hall is not specic about the class, for the most part he discusses inter-domain analogy. In contrast, this thesis is primarily concerned with the intra-domain knowledge transfer and, therefore, aspects of intra-domain analogy.

2.1.6 Articial Neural Networks ANNs use a method of inductive learning based on computational models of biological neurons and networks of neurons as found in the central nervous system of humans and animals. ANN modelling systems take advantage of massive numbers of parallel processing devices (typically parallelism is simulated on a serial processor) which work cooperatively to solve a problem. Unlike traditional computer systems, the parameters which dene the global computation of an ANN are distributed throughout the network and are inherently adaptive. For the above reasons, ANNs are often referred to as parallel distributed processing (PDP) systems. The History of ANN Research ANN research is as old as that of the modern computer. In 1890, Dr. Willian James dened the rst theories of the neuronal process of learning which inspired researchers in the early part of the century Ande88]. During the second world war researchers such as McCulloch, Pitts, and Hebb developed the rst mathematical models of neural networks and subsequently ran the rst analog and digital computer simulations. This optimistic period was crowned by Frank Rosenblatt who designed the Perceptron learning algorithm and proved a theorem for its convergence to a correct hypothesis. Unfortunately, in 1969, Marvin Minsky and Seymour Papert wrote a brilliantly authored book simply entitled Perceptrons which formally proved the limitations of Rosenblatt's learning algorithm when applied to

21 non-linear problems such as XOR. This resulted in a period of disenchantment which began in 1969 and lasted for approximately 15 years. In 1985 and 1986 several researchers independently discovered the solution to training more complex networks of Perceptrons. The dominant group was composed of Rumelhart, McClelland, Williams, and Hinton who published an inuential set of books on their ground-breaking research entitled Parallel Distributed Processing or PDP Rume86b]. The hallmark of the PDP Group was their interdisciplinary approach involving mathematics and theory of computation, psychology, philosophy, neuro-physiology, and traditional AI. Over the last 15 years, ANN research and development has exploded all over the world with interest coming from educational institutions, industry, business, and the military. There are numerous applications from bomb detection and medical diagnostic systems, to \data mining" for knowledge in larger corporate databases and user proling within intelligent agents. input function

x1

activation function y = f ( xw

x2

xw

. . .

.. .

y

1

xn * wn

wn

xn

total weighted input

0

output

sigmoid

xb input

weight

weighted input

Figure 2.3: An articial neuron. The anatomy of an articial neural network. There are a wide range of ANNs which have been developed however, the basic structure of an articial neuron remains the same (observe Figure 2.3). The basic function of the biological neuron is to integrate its inputs from other neurons and to generate an output value as a function of this input. Learning is believed to be achieved by modifying the e ectiveness of the individual connections between the neurons. In our articial neuron unit (which will be simulated by computer software) the e ectiveness of an input, xi , from some other neuron is determined by the weight, wi, of the connection from that neuron.

)

22 Each unit has an additional input which is referred to as a bias, xb . This term bias should not be confused with inductive bias dened earlier and discussed through the rest of the document. The input value for the bias is always a constant (typically 1), however its weight, wb , is modied during the process of learning. Input integration is accomplished by P an input function which for a unit, j , is most often a simple summation, Ij = i xi wij . The output of the neuron, j , is produced by pushing the value of Ij through some activation function. The most commonly used activation is the sigmoid function given by: yj = 1 +1e;Ij where yj is the output of unit j . A graph of the sigmoid function is shown in Figure 2.3. The function maps its input to the interval (0 1) becoming asymptotic as the absolute value P of Ij = i xi wij increases. The behaviour of an articial neuron, and therefore an ANN, depends upon three fundamental aspects (1) the input and activation functions of the unit (neuron structure), (2) the input connectivity from other neurons (network architecture), and (3) the weight on each of the input connections. Given that the rst two aspects are xed, the behaviour of the ANN is dened by the current values of the weights. We can, therefore, consider the connection weights to be a representation of any classication task we train the network to model. For example, Figure 2.4 shows two networks consisting of only one active neuron and which are identical in neuron structure and network architecture. However, their weight values di er such that network (a) models the logical OR function whereas network (b) models the AND function. Therefore, ANNs can be dened as networks of simple processors that store their task representation in the strengths of their connections. For the purposes of this research this denition is important because it dictates that, with all other factors being xed, any form of inductive bias must ultimately a ect the weights of the connections. Multi-layer Feed-forward ANNs Due to the mass of research and development over the last 10 years, there are a variety of ANN families. In this thesis the family of multi-layer feed-forward ANNs characterized by Figure 2.5 will be used. This network consists of an input layer and an output layer of neurons, as well as one or more hidden layers (although one hidden layer is all that will be considered in this research). The layers are connected in a strictly feed-forward fashion input nodes to hidden nodes to output nodes. The network accepts continuously valued inputs and uses a global error signal as the primary source of supervisory feedback. To classify an example, the set of attribute values are presented to the input nodes. Each input node forwards the value on to all nodes in the hidden layer. The hidden nodes

23

Z

Z

w =-3.22

w =-7.42

b=1

b=1

b

wX =5.39

X

b

wY =5.39

Y

(a) Logical OR network

wX =5.29

X

wY =5.29

Y

(b) Logical AND network

Figure 2.4: An example of how the same ANN can represent di erent functions depending on the value of the connection weights. The ANN models for the boolean logic OR and AND functions di er only in the values of the connection weights. The weights shown are taken from actual networks developed for the functions.

24 compute their activations and forward them on to all nodes in the output layer. The activation value(s) produced by the output node(s) indicate(s) the class of the example. One can view this type of network as representing the original input attribute values by the activation values of the hidden nodes and by the activation values of output nodes. Output Layer Input is forward propagated Hidden Layer

w

δE δw Error is backward propagated

Input Layer

Figure 2.5: The multi-layer feed-forward network and the back-propagation algorithm. To learn a task using ANNs of the type shown in Figure 2.5, the weights of the connections must be adjusted to produce the hypothesis with greatest generalization. The most widely used learning algorithm for this type of network is the back-propagation of error algorithm Rume86b]. Although the di erential calculus of the back-propagation algorithm is complex, the concept is simple. For each example or group of training examples, the error measure, E , between the actual output of the network and the target output is backward propagated from the output layer down through each of the hidden layers. The change in each weight is expressed as the rate at which the error changes as the weights change, Ew . At each node, each incoming connection weight is adjusted to minimize the error contributed by that weight to the global error. The change in weight w is directly proportional to E . Thus, the process of learning is one of iteratively presenting the training examples and w making small weight changes to reduce the error. The algorithm stops when the error across all the training examples reaches some agreed upon minimum. This learning process can be described as gradient descent through weight space in search of a set of weight values that minimizes the error for all training examples. To prevent over-tting and poor generalization, the number of hidden node connections are often constrained in some predetermined fashion (which can be considered a form of prior knowledge or inductive bias), a validation

25 or tuning set of examples is used to monitor over-tting, or a weight-cost term is employed in the back-propagation weight update equation to reduce irrelevant weights automatically to zero. The most common error measure or cost function, E , that is used with the backpropagation algorithm is the sum of squared errors. It can be shown that minimizing the sum of squared errors seeks the maximum likelihood hypothesis under the assumption that the training data can be modelled by normally distributed noise added to the target function values. If the target function is a concept task and noise is also expected in the input attribute values then the cross-entropy cost function is better suited Mitc97]. Error measures will be discussed in greater detail in Chapter 3. A neural network learning algorithm iteratively updates the connection weights of the network based on the set of examples from which it is trained. At each update the weight vector represents the state of the ANN and the function which it computes. One can think of the iterative update process as a trajectory through weight space. This view is depicted in Figure 2.6. In light of this, we dene a good initial representation as a set of weights which statistically require fewer training iterations than the average random set of initial weights.

Random initial weight vector

Learning trajectory

ANN Weight Space

Good initial representation

Converged weight vector

Figure 2.6: An ANN generates a trajectory through weight space as it learns a new task. A good initial representation can be considered a point along this trajectory which is closer, in terms of the number of training iterations, to the optimal weight vector than the average random set of weights.

26

2.2 Survey of Knowledge Transfer in ANNs Sequential learning and task knowledge consolidation and transfer in neural networks has been an area of inquiry since approximately 1987. This section presents the fundamental problems which have been identied and the previous e orts to solve these problems.

2.2.1 Fundamental Problems of Knowledge Transfer in ANNs Catastrophic Interference McCloskey wrote of the di culty a neural network has in maintaining the knowledge of one association after learning another McCl89]. The earlier association can be completely \forgotten" by the neural network. McCloskey used the term catastrophic interference for this phenomenon. Previously, Grossberg referred to this problem as the stability-plasticity problem of ANNs Gros87]. The phenomenon holds true after learning one classication task and then attempting to learn another using the same network. The network will show poor generalization performance on the original task after learning the second. This is at the heart of the problem of sequential learning in ANNs and is often cited as a major drawback of connectionist theories of cognition Fodo88]. One approach to solving the problem of catastrophic interference when sequentially learning the categories of a single classication task is to reduce the overlap in hidden unit representation by trading o distributed representation for local representation Fren91, Shar94b, Shar94a, McRa93, Gros87]. This can be accomplished by forcing the internal representations to be as orthogonal to one another as possible. In the extreme, of course, one hidden node is used to represent each training example and little or no generalization occurs. In Fren94a] there is an e ort to make the best of both distribution and orthogonality of representation through a method referred to as context biasing. Context biasing inuences the internal representation of each new paired-association in terms of the average Hamming distance between class examples. Such e orts have provided insight into methods of sequentially learning several classication tasks. Problems with Literal Transfer Taking the connection weights from a trained source network and placing them, as is, into a target net is dened as literal transfer of task representation. One might think that an ANN trained for one task would provide a good starting point for learning what would seem to be a closely related task (e.g. learning the logical XOR function starting from the logical OR representation). This is often not the case (e.g. XOR from OR). Within the context of

27 the back-propagation algorithm, Pratt has demonstrated that inappropriately transferred weight values of high magnitude will cause prolonged training times Prat93a]. Sensitivity to Initial Conditions A related problem of ANNs is their sensitivity to initial conditions Kole90]. An ANN learning algorithm is a non-linear dynamical system, and as such it is possible for the search process to \fall into" a hypothesis which does not have the lowest possible generalization error. Such a hypothesis can be said to be a local minimum and, therefore, sub-optimal. It has been debated in Hamm95] and Shar95] that the problem of local minima appears to be worse in theory than it is in practice. Networks which are initialized to small random values tend to do well on average. However, when a network is initialized to high magnitude weight values as a consequence of literal transfer the probability of \falling into" a local minimum increases Hert91]. The increased probability of a local minimum is because high magnitude weights places the output of the network nodes at the extremes of the sigmoid activation function. Multiplicity of Task Representation The representational language of a multi-layer ANN are the weights of its connections. Each weight is free to take on any positive or negative real value. As with many languages, this freedom of representation means that ANNs have the ability to represent the same function in di erent ways. From a knowledge consolidation and transfer perspective the multiplicity of representation creates two major problems:

 Firstly, if prior task knowledge can be represented in di erent ways, how can a learning system consolidate that knowledge for e cient storage within a domain knowledge database?

 Secondly, regardless of whether prior task knowledge is or is not consolidated, how

can a learning system index into domain knowledge for the information most relevant, or related, to the learning of a new task?

We will refer to indexing into domain knowledge as the selective transfer of task knowledge. Consolidation and indexing into domain knowledge has been studied within symbolic learning systems, such as EBL Towe90], since the mid 1980's. However, except for the authors cited in this section there has been little study of consolidation and indexing involving ANN learning systems.

28

2.2.2 Summary of Previous Surveys A study of the literature on task knowledge transfer and sequential learning, within the context of ANNs, shows that various e orts have been made toward solutions to the above problems. These solutions have been grouped into various categories by previous authors. In Prat93b] a distinction is made between direct and indirect methods of transfer. Direct transfer is the placement of neural network parameters (typically connection weight values) from a source network into a target network with or without modication. Indirect methods of transfer involve the use of domain knowledge stored in a form such as rst order symbolic logic which is not directly usable by an ANN. In Thru97b] two families of approaches are discussed. The rst family partitions the parameter space of the learning system into task-specic parameters and parameters common across all tasks. The family is further subdivided into four sub-families: recursive functional decomposition, piecewise functional composition, learning declarative or procedural bias and learning various control parameters. The second family of approaches develops structural constraints that reduce the e ective hypothesis space of the learning system to a region that is more optimal for learning a new task. Various other distinctions have been made between approaches to knowledge transfer Thru97b, Prat96]. These include:

 emphasis on generalization accuracy vs emphasis on speed of learning  application oriented research vs cognitive model oriented research  incremental sequential learning vs simultaneous parallel learning  unselective transfer vs selective transfer and  literal vs non-literal transfer. To some extent, each of these distinctions will be discussed in the following sections.

2.2.3 Representational vs. Functional Transfer We choose to dene categories of task knowledge transfer to separate the issue of transfer from the process of sequential task learning. This is important because a number of the transfer approaches can be used for either simultaneous or sequential learning. In Silv96b] we dene the di erence between two forms of task knowledge transfer: representational and functional. This distinction has been used as the basis of a survey article on knowledge transfer Prat96].

29 The representational form of transfer involves the direct or indirect assignment of known task representation (weight values) to a new task. In this way the learning system is initialized in favour of a particular region of hypothesis space. We consider this to be an explicit form of knowledge transfer from a source task to a target task. Since 1990 numerous authors have discussed methods of representational transfer Fahl90, Prat93a, Ring93, Shar92, Shav90, Sing92, Towe90]. Representational transfer often results in substantially reduced training time with no loss in generalization performance. In contrast to representational transfer is a form we dene as functional. Functional transfer does not involve the explicit assignment of prior task representation to a new task, rather it employs the use of implicit pressures from supplemental training examples AM95, Sudd90], the parallel learning of related tasks constrained to use a common internal representation Baxt95b, Caru95], or the use of historical training information (most commonly the learning rate or gradient of the error surface) to augment the standard weight update equations Mitc93, Naik93, Thru94b, Thru95a]. These pressures serve to reduce the e ective hypothesis space in which the learning system performs its search. This form of transfer has its greatest value from the perspective of increased generalization performance. Certain methods of functional transfer have also been found to reduce training time (measured in number of training iterations). Chief among these methods is the parallel multiple task learning paradigm explored recently by Caruana and Baxter Baxt95b, Caru95]. The following sections survey the representational and functional forms of knowledge transfer within the context of ANNs. References are provided to the relevant material.

2.2.4 Approaches to Representational Transfer Representational transfer involves the direct or indirect assignment of a known task representation (weight values) to a new task. The learning system is initialized in favour of a particular region of hypothesis space that reduces learning time without loss of generalization accuracy. We consider this to be an explicit form of knowledge transfer from a source task to a target task.

Direct, Literal Methods of Transfer In Prat93a] two methods of direct transfer are dened literal and non-literal. Direct literal transfer is the placement of neural network parameters (typically weight values) from a source network into a target network with no intermediate modication of those parameters. This does not necessitate the use of separate source and target networks, more often it involves the use of the same network or sub-network for the learning of related tasks at two di erent times. Each method of direct literal transfer must supply a solution for

30 overcoming the problem of catastrophic interference discussed earlier. Compositional methods. Divide and conquer methods are common in AI. In ANN learning, approaches based on this paradigm have been referred to as modular Waib89] or compositional learning Sing92, Sing94, Prat91]. The emphasis is on decomposing a large task into a set of smaller tasks. A set of small networks each learn a component or sub-task using several of the input attributes. The set of small networks are then organized to provide a good initial representation for learning the larger task. In this way the knowledge of the sub-task networks are transferred to the larger task. This can be seen as a form of literal transfer. New connection weights (randomized to small initial values) used to bridge the sub-task networks help overcome any problems with catastrophic interference. Incremental methods. An approach which is closely related to compositional learning, is known as incremental learning. In this case there is an attempt to build incrementally upon previously acquired task knowledge, starting with simple concepts and moving on to more complex problems. The Cascade Correlation Network (CCN) is perhaps the best example of this Fahl90]. The learning algorithm associated with this network iteratively adds hidden nodes and connections for learning preselected subsets of the training examples which may represent sub-tasks. In this way there is a literal transfer of previously learned sub-task representation on to the next sub-task and nally on to the major task. Romaniuk expands upon the CCN in Roma94] by using a genetic algorithm to select training subsets optimally for each task. This he calls an Evolutionary Growth Perceptron (EGP) network. Romaniuk is specically interested in the construction of domain knowledge and the ability to use EGP networks to learn various tasks sequentially in no particular order. He calls his solution the trans-dimensional learner which is based on the EGP network. His results for an arbitrary sequence of Boolean output tasks (such as 3-bit adder, 4-bit encoder, and 6-bit parity) are impressive. He suggests that the previously learned tasks develop high level feature detectors which may be used by subsequent tasks. This is no doubt the case. However, the drawback is the lack of an indexing method into acquired domain knowledge. Without this, every new task, even relatively simple ones, require additional nodes added to the EGP network. Additional nodes are needed even for tasks that are being learned for the second time. Direct, Non-literal Methods of Transfer A direct non-literal transfer of ANN knowledge is the placement of neural network parameters from a source network into a target network following some form of intermediate

31 modication of those parameters. The modication process can be considered equivalent to the elaboration step of Hall's analogical reasoning formulation. An elaboration of the source network's parameters is based on an a priori heuristic meant to benet the target network. The parameters can be the weight values of a source network modied to better suit the target task. This is clearly a form of representational transfer and will be discussed here. Alternatively, the transferred parameters may be used to guide the backpropagation algorithm's search. For example, the learning rate or momentum coe cient might be transferred. We consider this a functional form of knowledge transfer because explicit representational knowledge is not involved. The transfer of search parameters will be discussed in the section on functional transfer approaches. Learning algorithms for ANNs are always dynamic and non-linear in nature and subject to the whim of initial conditions (initial weight values). This has led to the perception that ANN representations are ad hoc and unsystematic in nature and that relationships between learned task representations are of no future value. Two investigators have looked at methods of elaborating the weights of a source task to generate a good initial representation for a target task: Agarwal et al. Agar92] and Pratt Prat93a]. Both take an approach based on the linear discriminate functions constructed by the hidden nodes of the network. Agarwal prescribes a method retaining source network performance while perturbing the node hyperplanes to accommodate the training data for the target task. This works if the two tasks are su ciently \close" or related to one another. Lorien Pratt has studied the dynamics of the back-propagation learning algorithm and methods of transferring the weights of a source task to generate a good initial representation for a target task. In Prat93a, Prat94a, Prat94b] she describes the Discriminability-Based Transfer (DBT) method. The method selects good initial weight values for transfer from a source network over poor values by examining the information theoretic value of the hidden node hyperplanes. A calculation based on Shannon's information theory determines the contribution a source network hyperplane will make in properly separating the target training data. Those weights associated with a hyperplane of high discrimination score are kept. Those weights associated with a hyperplane of low discrimination are randomized to small values. Pratt demonstrates on several real-world learning problems the success of this method over either random initial weights or direct literal transfer from a source task. There are three limitations which must be pointed out in the above work. First, the contribution made by the hidden to output node weights are not considered in Pratt's research. In Prat93b] this is acknowledged and a reference is made to Shar92] where the impact of varying the degree to which hidden to output weights are transferred is determined to be signicant for some of the functions examined. Second, due to the dynamics of

32 gradient descent the initial combination of hidden node hyperplane positions may be more important than their individual discriminant abilities. Lastly, the research relies on the manual selection of a single related source task. No consideration is given to methods of combining the knowledge from several related source tasks. It should be noted that Pratt's original work has been somewhat improved upon and reported in Prat94d, Prat94c]. Indirect Methods of Transfer Indirect methods of task knowledge transfer involve the use of domain knowledge stored in a form not directly usable by an ANN. The transfer of source knowledge to a new task requires a transformation of the format of the domain knowledge to the parameters or weights of the target network. The consolidation of new task knowledge with previously learned tasks necessitates a transformation back into the format of domain knowledge. Thus, indirect methods of consolidation and transfer promote a hybrid architecture. KBANN - Knowledge-based ANNs. Some of the rst e orts to transfer knowledge into an ANN were by researchers with backgrounds in Explanation-based Learning (EBL). Shavlik and Towell made the rst advances in this area with their EBL-ANNs Shav89, Shav90] and later their knowledge-based networks (KBANN) Towe90]. These systems demonstrate the mutual advantage of using the symbolic data of the EBL domain knowledge to overcome, on the one hand, the ANN's need for large numbers of examples, while on the other hand, the EBL requirement for complete domain knowledge. They developed a method of conguring an ANN and initializing its weights based on EBL symbolic rules. The ANN can then rene this representation using induction on the available training examples. A major part of this e ort is mapping the symbolic EBL rules to the appropriate ANN conguration and weight values. Towel and Shavlik have explored methods of extracting symbolic rules from the rened ANN Towe91, Towe93], the objective being to consolidate the EBL domain knowledge with new rules produced by the ANN. KBANN promotes a hybrid architecture composed of a ANN inductive learning component and a symbolic domain knowledge database.

2.2.5 Approaches to Functional Transfer Functional transfer does not involve the explicit assignment of prior task representation to a new task, rather it employs the use of implicit pressures selected from domain knowledge. These pressures serve to reduce the e ective hypothesis space in which the learning system performs its search. This form of transfer has its greatest value from the perspective of increased generalization performance. Certain methods of functional transfer have also been found to reduce training time.

33 Implicit Pressures from Supplemental Training Examples Learning from Hints. Yaser Abu-Mostafa is credited with introducing an alternative method of supplying domain knowledge to a neural network referred to as hints AM93]. Others who have used variants of this method are Sudd90, Fawc92]. Abu-Mostafa generalizes the concept of learning from examples to one of learning from hints. A hint is dened to be any property of the target function which can be expressed as one or more pieces of training data. Thus, normal training examples are one type of hint. Other types are examples that provide knowledge of the task's invariance properties and its monotonicity. A fundamental assumption is that all hints are strongly related to the primary task and will therefore contribute positively toward the development of an e ective hypothesis. The process of training an ANN with hints requires the minimization of a more complex error signal, E (h f ), which is a function of the individual errors for each hint type. A learning schedule is used to balance the impact of learning various hints. The generation of hints can be di cult. It requires su cient external knowledge of the task and its domain. Abu-Mostafa presents a systematic method for developing examples for a number of di erent hint types. Furthermore, in AM95] he formalizes how hint examples can be used to reduce the e ective Vapnik-Chervonenkis (VC) dimension which measures the size of the hypothesis search space Anth92]. Thus, through the use of hints the hypothesis space is constrained in an appropriate manner. Meta-learning of Search Parameters A number of researchers have taken an approach that advocates the meta-learning of neural network search control parameters such as the learning rate (also referred to as the step-size) or the momentum coe cient. The meta-models of search control parameters are used to control learning of a new task. As discussed in the previous section, this can be considered a direct, non-literal method of knowledge transfer. Interactive Tandem Networks. Robert French is interested in methods of preventing catastrophic interference when a network is used to learn various classication categories (eg., types of furniture) Fren94b]. His work has been inspired by the writing of McClelland et al. McCl94] on psychological and physiological studies regarding the interaction between the hippocampal system and the neocortex of the brain. He describes an Interactive Tandem Network (ITN) composed of two back-propagation tandem networks one referred to as short term memory, or STM, and the other referred to as long term memory, or LTM. Building on earlier work presented in Fren94a] the author uses the LTM network to meta-learn the internal hidden node features or prototype generate by an STM network trained to recognize

34 a category. While learning a category in the STM network, the LTM network is queried for a hidden node activation prototype to act as an inductive or context bias. The rationale is to take advantage of any LTM knowledge of the category. If the category has been encountered in earlier learning then the context bias plays a major role in relearning the association quickly. Thus, the LTM can be seen as a database of category representations where the category is used as the query key. If the category has never been learned before then the e ect of the context bias fades as a function of training error. If a new category (e.g., bookshelf) has been learned by the STM, a piece of logic, external to the network, selects a hidden node activation prototype which is guaranteed to be distributed, yet as orthogonal as possible, to the previous category prototype. The LTM network is then trained to associate the new category with the new prototype. French shows the ability of the ITN to facilitate the relearning of previously learned categories but provides no discussion on the ability of the network to facilitate the learning of new categories. Given that the choice of category prototype representation is arbitrary, no systematic consolidation method seems to have been employed. Meta-Neural Networks. In Naik92, Naik93] Meta-Neural Networks (MNN) are used to record aspects of back-propagation training on a source task. An MNN is associated with each hidden node in the source network and works in an observing mode during source training and in a guiding mode during target training. In the observing mode, at each iteration of the conventional ANN, the MNNs record trajectory information for each hidden node, in the form of a step-size (learning rate) and a weight direction vector (the MNNs are just back-propagation ANNs themselves). This is done several times starting from various initial random weights. Thus, the intention is to have the concert of MNNs remember the dynamics of the way a task is learned from various initial conditions. In the guiding mode, a modied back-propagation algorithm is used to train the target network on a new but related task. At each iteration, the MNNs provide their respective hidden nodes with an additional weight update term composed of the step-size and direction vector which is most appropriate based on past experience. The information provided by the MNN network constrains the e ective hypothesis space of the target network in a manner that is congruent with the previously learned source task. If the target task can be properly represented within this constrained hypothesis space then a positive transfer will occur, if it cannot be represented then the impact of transfer will be negative. The results of simulations on four-bit Boolean logic functions shows positive results in terms of reduced training time and insensitivity to initial random weights when compared to conventional back-propagation. No experiments were conducted with real-world training sets and generalization accuracy was not reported.

35 Explanation-based Neural Networks. The Explanation-based Neural Network (EBNN) method for robot learning is introduced in Mitc93]. EBNN is a neural network analogue to the symbolic EBL framework for knowledge based learning initiated by Mitchell. The emphasis is clearly on improving the generalization performance of the learning system based on domain knowledge acquired while learning previous tasks. The EBNN research is important, as it provides a method by which impoverished training sets can be used along with domain knowledge to learn a new task. In Thru93, Thru94a, Thru94b] the authors expand on the initial e ort and propose the life-long learning paradigm. For an autonomous agent to be successful it must ultimately have the ability to acquire domain knowledge dynamically from its environment. EBNN works to accomplish life long learning by using a secondary back-propagation ANN to meta-learn the slope of a desired function at each training example (the derivative of the function at an example output with respect to the input attribute vector). During the training of a new target function, the metanetwork generates a slope prediction and an estimate of its accuracy for each training example. This domain knowledge information provides a further constraint to the weight update equation employed in the modied back-propagation neural network (a variant of the Tangent Prop Network of Simm92]) used to learn the target task. The slope information imposes a constraint on the volume of hypothesis space in which search is allowed. If a more accurate hypothesis for the new task resides within this region of space, a positive transfer of knowledge occurs. In Thru94b] the slope information is proposed as an invariant shared across several tasks of a domain. Experiments show how the EBNN can learn and then transfer spatial invariants required to recognize the same object in several images. The most interesting aspect of the EBNN method is the way in which it mixes the inuence of the slope constraint, referred to as the analytical component with the standard error derivative, which is called the inductive component. The inductive component remains constant whereas the analytical component is proportioned according to the accuracy of domain knowledge in classifying the training examples. If the estimated accuracy is low and, therefore, the benet of using domain knowledge is questionable then the contribution of the analytical component is small. The work of Mitchell and Thrun has been encouraging for researchers in learning to learn. Their systems are able to demonstrate decreased training times and lower generalization error with each new task learned from the domain. Experimental results show that (1) EBNN outperforms pure inductive learning systems, (2) more accurate domain knowledge (slope constraint) yields more accurate hypotheses, and (3) EBNN learning degrades gracefully as the accuracy of the domain knowledge with respect to the new task decreases. However, there is an important caveat placed on the success of the transfer of knowledge.

36 The transfer of knowledge from previously learned tasks is not selective. A positive transfer will occur only to the extent that, on average, domain knowledge is able to classify the training examples. Parallel Learning of Related Tasks Kehoe points out in Keho88] that psychological studies of human and animal learning suggest that, besides the development of a specic discriminate function which satises the task at hand, there is the acquisition of general knowledge of the structural relationship between input attributes. Kehoe proposes a multi-layer network model which is composed of two associative mapping functions: the rst is a common component which maps inputs to an internal representation the second is a task specic component which maps internal representation to outputs. The common component remains available for use in subsequent learning. In terms of analogy, one can describe adjustment of the task specic component as an elaboration of the common component to the needs of a particular task. Kehoe's concept has been demonstrated by Caruana as multi-task learning (MTL) Caru93b, Caru93a, Caru95]. Task 1

Task 2

Task n

Output Layer

Task specific representation One or more Hidden Layers

Common task domain representation Input Layer

Figure 2.7: A Multiple Task Learning (MTL) network. There is an output node for each task being learned in parallel. The representation formed in the lower portion of the network is common to all tasks.

37 An MTL network uses a feed-forward multi-layer ANN with an output for each task to be learned (observe Figure 2.7). Training examples contain a set of input attributes as well as a target output for each task. The standard back-propagation learning algorithm is used to train all tasks in parallel. Caruana shows how a set of source tasks chosen from a problem domain can be used to learn a common internal representation (the weights in the common portion of a neural network) useful for the subsequent learning of tasks from the same domain. The common portion of the MTL network is the lower section which maps the input attributes to internal representation at the hidden nodes. Important aspects of MTL have been formalized by Baxter's research on learning internal representations Baxt95b]. Let the environment of the learner to be modeled by a pair (= Q) where = is a set of tasks fTk g, and Q is a probability distribution over =. That is, (= Q) denes the task domain and the probability of occurrence of any task. Depending upon Q, the learner will be required to develop hypotheses from one of a number of possible hypothesis spaces, fH g. One H may contain hypotheses that are primarily linear in nature whereas other H are of varying degrees of non-linearity. Each H dened by (= Q) requires an inductive bias appropriate for learning tasks within that H . For learning to become e cient and e ective in any one environment an appropriate bias must be discovered. In particular, we are concerned with the inductive bias provided by the shared use of the common feature layer of an MTL network. Baxter has proven that the number of examples required for learning any one task using an MTL network decreases as a function of the number of tasks being learned in parallel Baxt95a]. Let m be the number of examples required to PAC learn a task Tk to some desired generalization error and condence interval using an inductive learning system Vali84]. Consider an MTL network with j hidden nodes in the common feature layer and with W weights in the common portion of the network below this layer. Let t be the number of related tasks selected from the domain for learning within the MTL network. If the input attributes for the task can be compressed to a smaller internal set of features represented by the j hidden nodes, it can be shown that the upper bound on the number of examples required to PAC learn any one task Tk using the MTL network is

m = O(j + Wt ):

With j and W xed, the number of examples required to learn Tk decreases with the number of tasks being learned to a minimum bound of O(j + W ). The minimum bound can be reduced to O(j ) if the common representation W is not really needed. If the tasks use a small set of common features j relative to the input attributes then W j . Under this condition the number of examples required per task decreases most rapidly as t increases. Baxter has also proven that the common internal representation acquired will facilitate

38 the learning of subsequent tasks sampled from the domain according to the distribution Q. For any particular task the representation between the common feature layer and the task output constitutes the task specic component. Because the number of weights in this section of the network is relatively small, the training of a new task from (= Q) can be accomplished with relatively few examples and with a smaller amount of e ort than compared with single task learning. This suggests that MTL networks can provide an important representational form of knowledge transfer as well as functional transfer. Caruana has suggested 5 reasons for the success of multi-task learning Caru97a]:

 Statistical data amplication - the true signals common across all tasks have a greater chance of overcoming the noise in training examples

 Blocking data amplication - internal representations which would normally be blocked due to mutual exclusion of examples are resolved to the best ability of the network

 Attribute selection - the most important attributes are identied due to the data amplication e ects

 Eavesdropping - internal representations which would normally be di cult to learn

for one task may be more easily developed for another task (due to greater number of examples requiring that specic internal representation)

 Representational bias - given that there are many weight representations for any particular task, it is conjectured that by learning several tasks simultaneously a more globally useful representation will be generated.

The last point is of particular interest to us. Caruana discusses a tidal e ect, which can be seen as a mathematical bias based on a sum of the error gradients from multiple tasks, toward better performing regions of weight space. In fact, there is empirical evidence to indicate that hidden node weight values favoured by several tasks in a domain are favoured by another task from the same domain. The work on parallel learning of multiple tasks has been signicant Baxt95c, Caru97a]. Baxter and Caruana have demonstrated the success of the parallel MTL method on a number of laboratory and real-world problems (e.g., Boolean equations, one dimensional invariance problems, object recognition, medical diagnosis). However, from a sequential learning perspective MTL has a major limitation. Parallel transfer in an MTL network occurs due to the pressures of learning several related tasks given the constraint that the majority of the connection weights of each task are shared. Given that the tasks are from a domain of highly related functions, a positive transfer of knowledge will occur from one to the other. However, when under a diverse task domain, the MTL methodology provides

39 no mechanism of automatically selecting the most appropriate source tasks for parallel transfer to a primary task of interest. Currently the selection is done manually, that is, the relatedness of the source tasks to the primary task is decided o -line to the learning algorithm. The complexity of this subjective selection process increases with the number of previously learned tasks. If a poor selection has been made, the representations generated during parallel learning may favour several unrelated tasks and thus a negative transfer e ect will take place (increased training times and decreased generalization accuracy). In fact, the MTL operator can never guarantee, a priori, that a particular source task will provide a positive transfer of inductive knowledge to the primary task and the standard MTL algorithm has no method of escaping the pressures of unrelated tasks. A sequential learning system must be capable of selecting the appropriate prior knowledge such that it facilitates the learning of a new function, otherwise the retention of that knowledge is of little value.

2.2.6 Analogy within Articial Neural Networks As discussed in Section 2.1.5, analogy can be decomposed into four components recognition, elaboration, evaluation, and consolidation. We see knowledge transfer as being primarily concerned with recognition, elaboration and evaluation and with a focus on rapid learning of new tasks within short-term memory. In contrast, we see knowledge consolidation as a long-term memory activity not well suited for rapid learning. Integral to task knowledge transfer is a method of indexing into domain knowledge which involves some measure of relatedness between the target task and the available source tasks. Indexing is at the heart of the recognition component for it is during this process that certain source knowledge is chosen as best for learning a new task. The alternative to indexing is some form of exhaustive search which would seem impractical for the rapid deployment of inductive bias. Indexing is also important to the elaboration component. During elaboration the chosen knowledge is extended by some means and transferred to the model of the new task. One method of extending domain knowledge is to generalize across previously learned task knowledge to combine the best of several source tasks. This suggests that a meta-learning process may be required for consolidation and selective transfer of domain knowledge. The formulation of analogical learning in ANNs is new and currently there are more questions than answers Bard94a, Bard94b]. In Gent93] Gentner and Markman posed a challenge to the connectionist community: \How can neural networks be used to develop models of analogy?" They suggested, along the lines of Fodor and Pylyshyn Fodo88], that neural networks lack the systematicity needed to provide the structural alignment, structural

40 projection, and exibility evidenced in human analogy and symbolic models of analogy. They feel that any system of analogy must exhibit syntactic concatenative compositionality, as provided by rst order logic, to capture the exibility inherent in the comparison and generation of complex tasks. We do not agree with this position, and we would like to provide results that support that learning by analogy is possible using a system of ANNs. Clues as to how ANNs might be used to model analogy are provided by van Gelder vanGelder90]. Van Gelder suggests that the aw in the symbolic argument against ANNs is that syntactic structural compositionality is not the only way to achieve systematicity. He makes the case that ANNs can achieve the same using functional compositionality in the same manner as various sine waves can be composed to create any sound wave.

2.3 Major Research Questions This section consolidates the background and surveyed material by presenting several major research questions surrounding selective knowledge transfer and sequential learning. The major research questions are: 1. In what form should previously learned task knowledge be retained within domain knowledge | representational or functional? 2. How can task knowledge be consolidated within domain knowledge for e cient storage and for more e cient and e ective transfer? 3. In what form should task knowledge be transferred from domain knowledge to a new target task | representational or functional? 4. What is relatedness? What is task relatedness in the context of inductive learning? 5. What are the theoretical implications of task relatedness with respect to inductive bias? 6. How can related knowledge be selected from domain knowledge before or during the learning of a new task? 7. Can an articial neural network or system of networks be constructed to sequentially learn, retain and selectively transfer task knowledge? Certainly, these questions are not independent of one another. A solution to one question will have an impact on the answer for another. For example, the form in which task knowledge is retained within domain knowledge can a ect the manner in which the most related domain knowledge is selected when learning a new task.

41

2.3.1 Knowledge Retention: Representational or Functional? In what form should previously learned task knowledge be retained within domain knowledge | representational or functional? The simplest method of retaining task knowledge is to save all the training examples for the task. We dene the training examples to be a functional form of knowledge4. Other methods of retaining functional knowledge of a task involve the storage or modelling of search parameters such as the learning rate or back-propagation error gradient in ANNs or the number of allowed branches for an input attribute in a decision tree. An advantage of retaining functional knowledge, particularly the retention of the actual training examples, is the accuracy and purity of the knowledge. A disadvantage of retaining functional knowledge is the large amount of storage space that it often requires. Alternatively, the knowledge of a task can be retained in the form of the representation of an accurate hypothesis developed from the training examples. We dene this to be a representational form of knowledge. The representation of a hypothesis involves a description of the representational language (architecture of the neural network) and the values of the free parameters used by that representation (the weights of the connections between neurons). The advantage of retaining representational knowledge is its compact form relative to the space required for the training examples used to develop the representation. The disadvantage of retaining representational knowledge is the loss of purity and potential loss of accuracy from the original training examples. In this thesis we do not attempt to explore this question fully because we see the best form of retention being closely tied to the question of knowledge consolidation. However, from a knowledge transfer perspective, retaining knowledge in the representational form of the developed hypotheses has a number of benets. These will be discussed in Chapter 3.

2.3.2 Knowledge Consolidation: How is it done? How can task knowledge be consolidated within domain knowledge for e cient storage and for more e cient and e ective transfer? It is necessary for a KBIL system to retain task knowledge. However, retention is not su cient for a life-long learning agent. Knowledge must be consolidated in a systematic fashion for the purposes of e cient storage and for more e cient and e ective transfer. Over a long period of time it would be ine cient to retain multiple copies of knowledge for the same task simply because the task was practised on di erent occasions using di erent Clearly, all retained information must have a representation, else it cannot be stored on a computer. The distinction between functional and representational forms of task knowledge is to categorize two broad classes of knowledge for later use in knowledge transfer. 4

42 training examples. To some extent the knowledge acquired at the di erent times will overlap and support each other. A life-long learning system must have a mechanism for integrating the task knowledge it acquires within a domain knowledge database of nite size. The consolidation of task knowledge is equally important for the optimal transfer of domain knowledge to a new target task. Knowledge integration implies a method of indexing into domain knowledge that can reduce search time and increase the probability of selecting the most appropriate knowledge for transfer. The question of how retained task knowledge can be consolidated is interesting and challenging. In fact, it is the stability-plasticity problem originally posed by Grossberg Gros87] taken to the level of learning sets of tasks as opposed to learning sets of examples. We see this question as primarily concerned with the storage of task knowledge in longterm memory and necessarily requiring large quantities of computational time and storage space. Our initial research into representational methods of knowledge transfer explored consolidation to an extent. The di culties and important ndings from that research will be discussed in the next section. The greatest portion of our research e orts have concentrated on the more practical problem of learning new tasks within a short-term memory structure using retained but not consolidated task knowledge. This dissertation reports on this later research and draws no conclusions regarding long-term knowledge consolidation.

2.3.3 Knowledge Transfer: Representational or Functional? In what form should learned task knowledge be transferred from domain knowledge to a new target task | representational or functional? The form in which task knowledge is retained can be separated from the form in which it is transferred. Knowledge can be retained in a functional form and transferred in a representational form. For example, the functional training examples of a task could be saved in domain knowledge. As a part of the transfer process these examples could be used to develop an ANN model, the representation of which can be used as a starting point for learning a new task. Alternatively, knowledge can be retained in a representational form and transferred in a functional manner. For example, the representation of an ANN hypothesis could be saved after being induced. As part of functional transfer, the representation could be used to generate functional examples which are used as supplemental training examples for learning a new target task. Representational transfer has been shown to provide a better than random or default starting point for induction of a new task. The result is that hypotheses are developed in shorter training times and with no loss in generalization accuracy. In the early part of our research we explored selective representational transfer based on a model of two interacting

43 back-propagation ANNs, a task network (TN) and an experience network (EN) Silv95b]. With respect to the framework of KBIL shown in Figure 2.2, the TN was the standard inductive learning system, where as the EN formed the database of domain knowledge. The cycle of interaction was intended to force a representational structure on domain knowledge within the EN that could be used to transfer knowledge to new tasks. A prototype system was developed and experiments were conducted on synthetic and real-world domains. Experiments showed that the prototype had success on small domains of simple linearly separable tasks in terms of more e cient learning. However, on larger domains involving more complex non-linearly separable tasks, we discovered several fundamental problems with our representational transfer approach:

 For our selective representational transfer approach to succeed, the similarity be-

tween a new target task and retained source tasks had to be mapped to a measure of representational relatedness. The representational prototype system considered the Euclidean distance between task representations to be an indexing method into the meta-level model of retained network hypotheses. However, there are many weight representations for any one task developed by an ANN and the diversity of representation grows with the complexity of the task. Ultimately, despite several e orts to constrain the multiplicity of task-level representation by using meta-level knowledge, the prototype system failed.

 Often when the representational transfer prototype was used to learn a sequence of

more complex tasks, the sequence had to be ordered a priori to facilitate the transfer of knowledge. The requirement for planning a task curriculum reduces the practical application of the representational transfer approach.

 None of the experiments using the selective representational transfer prototype produced hypotheses that were signicantly more accurate than hypotheses produced using a standard neural network that used no knowledge based inductive bias.

In contrast, functional transfer has been shown to reduce the e ective hypothesis space of the learning system. If the knowledge transferred is from source tasks that are related to the new target task, then the result is often a hypothesis with better generalization on test examples. We have spent the largest portion of our research e ort exploring a functional form of selective knowledge transfer using ANNs. The dissertation focuses on a theory of selective functional transfer and the results of experiments using a prototype system based on that theory.

44

2.3.4 Task Relatedness: What is it? How is it used? What is task relatedness in the context of inductive learning? An issue that repeatedly occurs in the literature is the importance of relatedness to the process of task knowledge transfer. The issue arises when domain knowledge contains a diverse set of retained task knowledge. Some of the domain knowledge may be related to a new target task and some may be unrelated. Knowledge transferred from a closely related source task provides a positive inductive bias that results in a hypothesis of superior performance. Conversely, knowledge transferred from an unrelated source task provides an negative inductive bias that results in a hypothesis of inferior performance. The success of previous ANN knowledge transfer methods has depended upon the vast majority of domain knowledge tasks being closely related to the new target task. Thus, the question of task relatedness has been largely avoided. In Caru97a] the importance of relatedness to knowledge based inductive learning is recognized and discussed. Relatedness is dened as follows: Two tasks T0 and Tk are related if there exists an algorithm L such that L learns T0 better when given training data for Tk as well, and if there is no modication to L that allows L to learn T0 this well when not given the training data for Tk . We agree with the intent of this denition, however we nd it too strong. For example, an additional modication to L could be as simple as learning T0 with another task Tl (or modications that are equivalent to Tl) such that L produces better hypotheses. Certainly, this should not mean that Tk is no longer related to T0. A denition of relatedness is needed that considers degrees of relatedness between tasks such as T0, Tk and Tl . In Chapter 3 we present such a denition. What are the theoretical implications of task relatedness with respect to inductive bias? The major theoretical implication of task relatedness to inductive bias is that previously learned and retained task knowledge is only one portion of the inductive bias of a KBIL system. The second and equally important portion of inductive bias is the relatedness of domain knowledge to the new target task. The relationship between domain knowledge and task relatedness is an important aspect of this dissertation. How can related knowledge be selected from domain knowledge before or during the learning of a new task? An answer to this question is critical to the success of a KBIL system. Even if task knowledge has been retained and properly consolidated there must be a method of selectively transferring knowledge based on the new target task. Only by selectively using domain

45 knowledge in this way can new tasks be quickly learned from limited numbers of examples to high levels of accuracy. In this thesis we develop several measures of task relatedness and present the results of testing the measures experimentally on synthetic and real-world domains.

2.3.5 Sequential Learning: Is it possible in ANNs? Can an articial neural network or system of networks be constructed to sequentially learn, retain and selectively transfer task knowledge? The problems of catastrophic interference, sensitivity to initial weight values and the multiplicity of task representations would seem to make ANNs a poor choice as a basis for developing a sequential learning system. However, ANNs have a number of other characteristics that we feel make them a good choice for studying KBIL:

 The distributed representation of ANNs naturally promotes a form of generalization

similar to non-linear interpolation. We feel that this is an important characteristic for domain knowledge retention and for the transfer of knowledge from two or more secondary tasks to a primary task.

 Multiple tasks can be learned at the same time within one ANN and share the in-

ternal representation that is formed by the learning algorithm. Few other machine learning systems have the ability to learn multiple tasks in this manner without major modications. Baxter and Caruana have demonstrated that (1) inductive bias will occur from related secondary tasks to a target task within an MTL network and (2) to some extent, an MTL network will discover the relatedness between tasks in terms of their shared use of internal representations.

 Both training examples and ANN task representations (connection weights) can be analysed analytically as well as visually using graphical tools.

 ANN task representations (connection weights) can be represented as numeric vectors and used as input or output by another ANN. This provides the possibility of manipulating or modelling task knowledge within a meta-level ANN.

 Previous research suggests methods of dynamically adjusting ANN learning parameters Jaco88, Vogl88, Naik92, Thru94b] as a method of inductive bias.

For the above reasons we have decided to pursue knowledge transfer and sequential learning in the context of ANNs.

46

2.4 Objectives and Scope of the Research 2.4.1 Objectives Theory. The rst objective is to develop a theoretical model and a prototype system which sequentially retains ANN task knowledge and selectively uses that knowledge to bias the learning of a new task in a more e cient and more e ective manner than an standard inductive learning system. Specically:

 Increased Eciency is dened as follows:

Given P tasks from a domain each with some xed number of training examples su cient to PAC learn each task to some error tolerance e, and one new task T0 with some xed number of training examples su cient for PAC learning to e, the system should, on average, show a reduction in the number of batch training iterations to learn T0 with domain knowledge compared to learning T0 without the domain knowledge5 .

 Increased Eectiveness is dened as follows: Given P tasks from a domain each with

some xed number of training examples su cient to PAC learn each task to some error tolerance e, and one new task T0 with insu cient number of training examples for PAC learning to e, the system should produce hypotheses that show, on average, higher generalization accuracy for T0 compared to hypotheses developed without the domain knowledge. As the system's knowledge of the task domain increases there should be a resulting decrease in the number of training examples6 required to reach a desired level of generalization accuracy for a new task.

Of these two performance factors we consider e ectiveness to be the more important. It is reasonable to choose the reliability of a model over its rate of development, provided there is time for all models to develop. Application. The second objective is to demonstrate the application of the theory in the development of a model for medical decision making. The thesis will show the advantage of an ANN based KBIL system that sequentially learns a domain of medical diagnostic tasks as compared to learning the same tasks with a traditional ANN system that does not accumulate domain knowledge. E ciency is based on the number of training iterations and not on training time. It can be argued that the number of training iterations may not be representative of actual processing time for variations in ANN simulators and network architectures running on a serial processing computer. However, this is a function of the implementation and not the back-propagation algorithm itself. 6 The number of training examples will have a lower bound in accordance with Baxt95b]. 5

47 Criteria for success. Due to the dynamic non-linear di erential equations involved in backpropagation learning, formal proofs are extremely di cult and are in many cases impossible. Therefore, much of the thesis will report empirical results using a prototype system based on the proposed theory. The prototype system, and therefore the theory, will be judged on its ability to select appropriate source knowledge for a new task such that the knowledge produces an inductive bias resulting in:

 more e cient learning - the average number of training iterations over several learning

runs (each with di erent random starting weights) is statistically less than or equal to the average number of iterations under standard back-propagation learning and

 more e ective learning - the average generalization error measured against a test set of data for the same set of runs is statistically less than the average generalization error under standard back-propagation learning.

Specially constructed synthetic task domains will be used for a critical analysis of the prototype software and the theory. This will be followed by experiments involving a realworld domain of medical diagnostic tasks.

2.4.2 Scope The scope of this thesis is circumscribed by the following:

 The focus will be on the retention and use of previously acquired task knowledge as a

basis for inductive bias that leads to more e cient and more e ective learning. Thus, the research is interested in improved methods of knowledge based inductive learning. Advances in pure inductive learning are not at issue.

 The consolidation of task knowledge will be considered a secondary issue. Previously

learned task knowledge will be retained but not necessarily consolidated. The research will focus on the problem of learning a new task within a short-term memory ANN using the retained task knowledge.

 This research will only consider supervised inductive learning of classication tasks with a single output that can have one of two values. Therefore, the research will focus on the learning of concepts (e.g. \the person has the disease").

 The classication tasks will be high-level learning problems typical of clinical diagnosis

in a medical setting or data mining in a business setting. Typically, these problems have 10 to 50 independent input attributes such as age, gender, life-style, demographic or clinical ndings. The objective is to develop a model for accurately classifying a

48 patient or customer. We consider the problems high-level because they do not deal with large quantities of low-level inputs such as image pixels or auditory frequencies.

 The machine learning paradigm of choice will be ANNs for the reasons specied in Section 2.3.5.

 Multi-layer feed-forward neural networks and the back-propagation of error learning

algorithm will be used because of their wide range of practical deployment, their use in related work on task consolidation and transfer, availability of simulation software, familiarity with the learning paradigm, and the speed at which the algorithm can learn as compared with various other classes of ANNs.

 The neural networks will be restricted to networks of three layers including one input

layer, one hidden layer and one output layer of nodes. Networks of three layers have been shown to be su cient for learning any continuous function Horn89].

 The extension of the knowledge transfer theory to other machine learning methods will be discussed. However, the details of implementation and testing are beyond the scope of the current e ort.

 The e ect of noisy training and testing examples will not be explicitly considered in

the theory. The experiments will be used to examine the e ect of certain aspects of noise on the learning system.

Chapter 3 FUNCTIONAL TRANSFER and SEQUENTIAL LEARNING Two forms of task knowledge retention and transfer, representational and functional, are dened in Chapter 2. This chapter discusses a knowledge based inductive learning (KBIL) system that uses a representational form of knowledge retention but a functional form of knowledge transfer. This system is based on three contributions we have made to the eld of machine learning:

 Relatedness between Tasks - The degree of task relatedness in the context of inductive

learning and functional transfer is dened. Three categories of a priori measures of relatedness are developed and explored.

 The

MTL Algorithm - A modied version of the back-propagation algorithm used for MTL, called MTL, provides a framework for employing a measure of task relatedness between a primary task and each of the secondary tasks. The measure of relatedness controls the transfer of knowledge (inductive bias) for developing a hypothesis for the primary task.

 A Task Rehearsal Method (TRM) - A method that uses

MTL to perform sequential learning of a series of tasks from the same domain is presented. The TRM retains task knowledge in the form of network representations. It generates virtual examples which transfer knowledge in a functional form when learning a new task.

The chapter opens with Section 3.1 discussing MTL networks as a basis for selective functional transfer of task knowledge. The details of an MTL network are reviewed and the inherent characteristics and inductive biases of an MTL network are presented. Inductive bias within a KBIL system is dened as a composition of domain knowledge and task relatedness. The relationship between inductive bias, internal representation and related tasks is then described. This leads to four alternative strategies for knowledge transfer within an MTL network based on the form of transfer (representational or functional) and the inuence from secondary tasks (xed or variable). 49

50 Section 3.2 denes a theory of functional transfer using an MTL network and the backpropagation algorithm. The validity of the theory depends on nding solutions to three requirements: (1) a method of sequentially retaining and regenerating functional task knowledge, (2) a framework for employing a measure of task relatedness for each secondary task learned within an MTL network and (3) a measure of relatedness that can automatically promote a positive transfer of knowledge to a primary task. Section 3.3 discusses a solution to requirement (2) regarding a framework for employing a measure of task relatedness within a back-propagation MTL neural network. Section 3.4 begins the discussion of requirement (3) by presenting the nature and complexity of task relatedness. Task relatedness in the context of functional transfer is dened, followed by descriptions of theoretical measures of relatedness and similarity. Section 3.5 completes the discussion of requirement (3) by reporting on several alternative measures of relatedness. Section 3.6 introduces the TRM of sequential learning and discusses the generation and use of virtual examples for functional transfer. The TRM provides a solution to requirement(1) of the theory of functional transfer. Finally, in Section 3.7 the functional transfer ( MTL) and sequential learning (TRM) prototype software is reviewed. This software is used in the experiments reported in Chapters 4, 5 and 6.

3.1 MTL - A Basis for Selective Functional Transfer 3.1.1 Review of MTL Network Learning An MTL network uses a feed-forward multi-layer network with an output for each task to be learned. Figure 3.1 shows a simple MTL network which contains a hidden layer of nodes that are shared by all tasks. We will refer to this as the common feature layer. The sharing of the internal representation (the weights of connections) below the common feature layer is the method by which inductive bias occurs within an MTL network. The representational power of the common and task specic portions of an MTL network dene the space of possible hypotheses. Figures 3.2 and 3.3 show more complex MTL networks that have additional hidden layers of nodes within the common and task specic portions of the network. Three-layer networks are capable of approximating any continuous function to an arbitrary level of accuracy Horn89] and have been found to be su cient for many non-linearly separable classication problems. It has been shown by Cybe88] that 4-layer networks having 2 hidden layers are capable of approximating any function (continuous or discontinuous) to an arbitrary level of accuracy. In composite the 3-layer common portion

51 and the linear task specic portion of the network in Figure 3.2 provide this representational power. A 3-layer common portion of an MTL network can be useful because it has the ability to generate a tightly constrained common representation that satises the needs of the multiple tasks. The network of Figure 3.3 goes one step further by providing task specic portions that contain a hidden layer. The additional hidden nodes give the task specic portions the ability to create non-linearly separable mappings from the common features to the task output nodes. Task 1

Task 2

Task n

Output Layer

Task specific representation One or more Hidden Layers

Common task domain representation Input Layer

Figure 3.1: Prototype of a simple 3-layer multiple task learning (MTL) network capable of computing any continuous function. There is an output node for each task being learned in parallel. The representation formed in the lower portion of the network is common to all tasks. All tasks share the hidden layer, which we will refer to as the common feature layer. MTL training examples are composed of a set of input attributes as well as a target output for each task. The standard back-propagation of error (BP) learning algorithm is used to train all tasks in parallel. The weights, wjk , a ecting an output node k are adjusted @Ek =  o  where is the learning rate parameter, according to the equation: wjk = @w k j jk oj is the input from the hidden layer node j , and k is dependent upon the cost function, Ek , being minimized by the back-propagation algorithm. The sum of squared error (SSE) cost function given by X Ek = (tk ; ok )2 p

over all training examples p for task output k is commonly employed. Under the SSE cost

52

Task 0

Task 1

Task t

Output Layer

Task specific representation Common Feature Layer

Common Hidden Layer

Common task domain representation

Input Layer

Figure 3.2: Prototype of a complex 4-layer MTL network capable of computing an arbitrary function. The common task domain representation is capable of computing any continuous function. As in the case of the simple MTL network of Figure 3.1, all tasks share the common feature layer. The task specic portions of the network are capable of forming linear combinations of these features.

53

Task 0

Task 1

Task t

Output Layer Task Specific Hidden Layer

Task specific representation

Common Feature Layer

Common Hidden Layer

Common task domain representation

Input Layer

Figure 3.3: Prototype of a more complex 5-layer MTL network. Both the common representation portion of the network and the task specic portions have a hidden layer of nodes. All tasks share a common feature layer. The task specic portions of the network are capable of forming complex nonlinear combinations of these features.

54 function, k can be shown to equal (tk ; ok )ok (1 ; ok ) where tk is the desired output at node k and ok is the actual output of the network at node k. Alternatively, the summed cross-entropy (SXE) cost function, given by

Ek = ;

X t log(o ) + (1 ; t )log(1 ; o ) p

k

k

k

k

can be used. The cross-entropy function seeks the maximum likelihood hypothesis under the assumption that the training example target values are a probabilistic function of their attributes Mitc97]. Under the cross-entropy cost function, k can be shown to equal (tk ; ok ) where tk is the desired output at node k and ok is the actual output of the network at node k. Similarly, the weights, wij , a ecting any hidden node j are modied as per the P following: wij = j oi  where j is given by oj (1 ; oj ) k k wjk  and oj is the output of the hidden node j and the summation proportions the error from each of the k output nodes in accord with the weights wjk . This equation can be used for networks of any number of hidden layers below layer i if the appropriate notational substitutions are made.

3.1.2 Characteristics and Inherent Biases of MTL Networks Assuming that an MTL network has the appropriate amount of internal representation for developing accurate hypotheses for all parallel tasks, the following are important characteristics and inherent inductive biases of the network under the back-propagation algorithm.

Characteristic 1: Hypotheses Depend on Initial Conditions. The development of

an hypothesis under the BP algorithm is a gradient descent method that begins from a randomly selected set of connection weights. It is well known that there is no theoretical guarantee that the BP algorithm will converge to the most optimal hypothesis possible for any network conguration Rume86a]. This remains true for MTL networks. A poor choice of initial weight values can make it impossible for the BP algorithm to discover an acceptable level of error for any, or all, tasks of the network. Fortunately, the problem can be mitigated in most applications by choosing parallel tasks that are closely related to each other Caru97b].

Characteristic 2: Over- t to Training Data. The BP algorithm will have a tendency

to over-t or over-train the weights of the MTL network to produce hypotheses that are too specic to the training examples and subsequently generalize poorly on a separate test set of examples Hayk94].

Bias 1: Smooth Interpolation over Training Examples. The values generated at each task output of an MTL network will present a smooth interpolation between training examples. Given two positive training examples, unless a third negative training

55 example falling between the rst two is provided, the network will generate a continuous smooth owing sequence of positive values between the original two examples Mitc97].

Bias 2: Rapid Learning of Simple Tasks. An articial neural network is initialized

with small random weight values. This initialization favours simple linear hypotheses over non-linear hypotheses in the early part of learning. Subsequently, tasks with outputs that linearly correlate more highly with their inputs will, subject to Characteristic 1, tend to train in fewer iterations than those with lower linear correlation. In fact, given two Boolean functions of two variables, the one which has the highest correlation between its inputs and its output will on average train most rapidly and to the lowest training error. This can be easily veried by learning any set of concept tasks within an MTL network and comparing the time it takes to learn each task with the linear correlation of each task's training example inputs to its outputs. Consider learning all 14 non-trivial Boolean logic functions of 2 variables with an MTL network. The X1 function (where the output is the same as the rst input) is most highly correlated to its output at 1.0. The OR, AND and XOR functions follow with correlations of 0.578, 0.578 and 0, respectfully. Thus we would expect the X1 function to be learned more rapidly than the others. Figure 3.4 shows the graphs of training error versus number of batch iterations for the four hypotheses. The graphs indicate that the rate of learning of each function generally matches its correlleation. A consequence of this bias is that tasks of varying complexity will develop optimal hypotheses at di erent rates within an MTL.

Bias 3: The Parallel Learning Constraint. Under the standard back-propagation set

of equations the learning rate, , is a constant, global parameter. Subsequently, the back-propagated error signal from any output node k is considered to be of equal value to all others. In this way, the objective function is minimized simultaneously and democratically for all parallel tasks. At the point of lowest training error, the BP algorithm does its best, subject to Characteristic 1 and Bias 2, to generate hypotheses that average the error across all of the output nodes. This may not be benecial for all parallel tasks Caru93b].

Consequences of the Characteristics and Biases The convergence problem is one which has plagued neural networks since their inception. It has proven to be more of a theoretical problem than a practical one with the BP algorithm because it is often possible to develop a su ciently accurate model. Nonetheless, the

56

0.18 0.16

Training Set Mean Error

0.14 0.12 0.1

XOR

0.08 AND

0.06

OR 0.04 0.02 X1 0 0

100

200

300 400 Number of Batch Iterations

500

600

700

Figure 3.4: Training error versus the number of batch iterations for 4 hypotheses while learning all 14 non-trivial logic functions of 2 variables within an MTL network. The network had a conguration of 2-3-14 which is su cient to learn all 14 functions.

57 problem does have implications for the experimental design of comparative network studies. A single experimental trial can look good or bad depending upon the initial conditions provided by the weights. One must be rigorous in running several trials and comparing the appropriate statistics when conducting experiments with the BP algorithm. There are several solutions to the problem of over-tting. The most widely used method and the one we shall employ makes use of a validation or tuning set of examples. While a hypothesis is being learned using the training set, the hypothesis is periodically checked against the validation set to estimate its true error in predicting future examples. The BP algorithm halts after a minimum validation error is determined and the network weights are restored to that point. The generalization accuracy of the hypothesis is then measured against a separate test set. The smooth interpolation across training examples is generally seen as a favourable bias of many learning algorithms. The bias generates hypotheses which work well on many real-world problems. The two remaining inductive biases, the rapid learning of simple tasks and the parallel learning constraint are biases that must be controlled if the positive transfer of knowledge from secondary tasks to a primary task is to occur. Methods of controlling these bias will be discussed in the Section 3.3.

3.1.3 Inductive Bias = Domain Knowledge + Task Relatedness The task-level learning problem concerns selecting an appropriate hypothesis given a set of training examples. Similarly, we can dene meta-level learning as the problem of selecting a set or space of hypotheses given a set of task examples. That is, tasks can be viewed as training examples at a meta-level of induction. Both learning problems involve inductive inference and require an inductive bias to succeed Thru96]. A task-level inductive bias prefers an hypothesis, h, from a hypothesis space, H = fh1  h2 :::g, based on assumptions about the domain of examples to be modelled. A meta-level inductive bias Rend87], referred to as a hyper-bias in Baxt97], prefers a hypothesis space, H , from a set of hypothesis spaces H = fH1 H2 :::g based on assumptions about the domain of tasks to be modelled. Clearly, there is a relation between the two levels of inductive bias because the task-level bias will be constrained by the choice of meta-level bias. Should the learner choose an appropriate meta-level bias, the problem of determining a good task-level bias is simplied. Unfortunately, it is unlikely that there exists a uniquely best meta-level bias any more than there is a uniquely best task-level bias. Consequently, it may seem that the problem of choosing task-level inductive bias has simply been replaced with the problem of selecting an appropriate meta-level bias. However, there is the possibility that the higher order bias

58 is an easier problem. Baxter has demonstrated this to be correct Baxt95b]. Consider an MTL network with a su cient number of hidden nodes (internal representation) for learning t tasks from a diverse domain of tasks. By choosing a meta-level bias that selects the most closely related t tasks, a task-level bias can be learned. The task-level inductive bias is manifested as a benecial internal representation within the common portion of the network induced from training examples of the multiple related tasks. In fact it is the sharing of internal representation that really denes the relation between the tasks. Recall from Section 2.2.6 that Baxter denes the environment of the learning system to be modeled by a pair (= Q) where = is a set of tasks fTk g, and Q is a probability distribution over =. It is Q, the probability of occurrence of a task that denes the relatedness of tasks in =, because it is inferred that an enviroment tends to deliver a group of naturally occurring and related tasks more frequently than others. However, it is often the case that the environment delivers two or more groups of tasks that have equal probability of occurring. In this case, tasks from the di erent groups will frequently occur but will not be related. This leads to the following renement of the denition of inductive bias for a KBIL system.

De nition: Let K be the domain knowledge retained by a knowledge based inductive

learning system L. Let R be the meta-level bias that selects the most related knowledge from domain knowledge for developing a new task T0 . Then we can dene domain knowledge based inductive bias BD as:

K ^ R ! BD :

This states that the inductive bias of a KBIL system is produced from the existing domain knowledge and some meta-level bias R that selects the most related knowledge for learning a particular task. The ! operator indicates that the expression is not a deductive inference because the measures of relatedness we use can depend upon on initial conditions of the ANN. R must originate from a source of background knowledge other than domain knowledge that the learning system possesses or is given. As we have mentioned above, it is unlikely that there exists a uniquely best meta-level bias. In fact, variation in R is important because it is the means by which a life-long learning system can focus the support from di erent and potentially benecial information within domain knowledge. The problem for a learning system, L, with a knowledge based inductive bias BD becomes one of selecting or constructing a hypothesis, h, based on a set of examples, S , from a instance space X = fxi g for a concept f (xi ), such that:

BD ^ S ^ xi  h(xi )

and

e(h(xi ) f (xi)) < 

8xi 2 X:

This states that the combination of knowledge based inductive bias (produced from domain knowledge and a meta-level inductive bias), the learning system, and the available training

59 examples should inductively infer a concept hypothesis, h, such that the portion of examples of X misclassied by h is less than . The inference is not one of entailment because knowledge based inductive bias is recognized as only a portion of all assumptions required to deduce h. The original denition of inductive bias due to Mitc97] can now be modied to include domain knowledge based inductive bias as follows.

De nition: Let BD be the inductive bias provided by domain knowledge to a knowledge

based inductive learning system and let BO be all other assumptions such that for a set of training examples, S , from an instance space X = fxig for a concept f (xi ): (BO ^ BD ) ^ S ^ xi ` h(xi ) Preferably, BO and BD have been chosen such that (8xi 2 X )e(h(xi) f (xi)) = 0:

3.1.4 Inductive Bias, Internal Representation and Related Tasks Functional transfer and inductive bias occurs in an MTL network because of the pressures of learning several tasks under the constraint that connection weights of each task are shared. In Caru97b] it is demonstrated that MTL under the back-propagation algorithm will force the development of representations at the common feature layer that are mutually benecial to related tasks. A transfer of knowledge takes place through the shared common features. Due to the mathematical complexity of the BP algorithm, the exact nature of this process is an open question. However, to optimize the transfer of knowledge from several secondary tasks to the primary task, the following must be ensured:

 the MTL network should have an appropriate amount of internal representation, and  the secondary tasks should be as closely related to the primary task as possible. Appropriate Amount of Internal Representation The optimal number of hidden nodes for a Single Task Learning (STL) network is a subject that has been investigated on a number of occasions Lape88, Bich89, Rama92]. It is generally agreed by researchers that the choice of the amount of internal representation for development of an optimal hypothesis depends upon

 the size of the training set (assuming the examples have been fairly sampled), and  the complexity of the task.

60 In particular, we are interested in the case where the training examples are impoverished either because of their small numbers or because they are a non-representative sample. For real-world problems an impoverished training set is often the case because of the high cost of acquiring examples. In accordance with Occams Razor, it has been traditionally thought that smaller networks that are able to develop hypotheses for a task are better than larger networks. This has led researchers to nd rules for determining the optimal number of hidden nodes a network should have for a task. However, as discussed in Section 2.1.2, the true meaning of Occams Razor is that the hypothesis space in which search occurs should be kept as small as possible. In Weig94] it is demonstrated that large networks with an abundance of internal representation can be benecial for developing more accurate hypotheses if a validation set is used to prevent over-tting. The larger network provides a richer and more diverse hypothesis space whereas the the validation set acts to constrain that space to a smaller but more e ective set of hypotheses. In this way the validation set acts a form of inductive bias, as well. Caruana has reported similar results and has found that the approach works equally well for learning related tasks under MTL Caru97a]. The constraint of hidden node representation is just as important in the case of MTL as it is with STL, but the strategy for doing so is more complex. As shown in Figure 3.5, constrained internal representation under STL, either by network size or through use of a validation set, fosters the development of an hypothesis from a relatively small hypothesis space, HSTL . The result is a \simple" hypothesis that often provides good generalization. Constrained internal representation under MTL forces the selection of a primary hypotheses from a subspace of a much larger and richer hypothesis space. The amount of internal representation denes the larger hypothesis space, HMTL. The related (R) and unrelated tasks (UR) secondary tasks being learned in parallel with the primary task constrains the e ective hypothesis space to HUR+R. If there is insu cient internal representation within an MTL network, induction will be ine ective because interference will produce poor generalization for at least one of the t tasks. Constrained representation within the hidden layer of MTL is advantageous, but only to the point where secondary tasks begin to interfere negatively with the learning of the primary task. Thus, the choice of an appropriate amount of internal representation within an MTL network depends on

 the size of the training sets (assuming the examples have been fairly sampled),  the complexity of the tasks within the domain, and  the degree of relatedness between the tasks within the domain.

61

H MTL H STL H UR+R

HR

optimal ho

H UR

Figure 3.5: A graphical representation of various hypothesis spaces under STL and MTL networks and an optimal hypothesis, h0, for the primary task, T0. The STL hypothesis space, HSTL, will be constrained by the amount of internal representation provided within the network. Typically, the amount of internal representation is made as small as possible to foster the development of a simple hypothesis that often provides good generalization. It is possible that a better hypothesis exists but requires a more complex representation. Under MTL with several parallel tasks being learned, the size of the potential hypothesis space, HMTL, is normally much larger because the amount of internal representation is increased to accommodate all tasks. The pressures of the related (R) and unrelated (UR) secondary task training examples provide an important secondary constraint on the available representation such that the e ective hypothesis space is reduced to HUR+R. This will often facilitate the search for a more optimal hypothesis, ho . Ideally, all secondary tasks are closely related to the primary task such that the e ective hypothesis space is reduced to HR . However, should all secondary tasks be unrelated to the primary task then the e ective hypothesis space may be negatively constrained to HUR in which case a poor hypothesis may result.

62 An MTL network needs at least as many hidden nodes as the best single task learning (STL) network needs for the primary task. As a rule of thumb, Caru97b] suggests that the number of hidden nodes be approximately equal to the sum of the hidden nodes required for the optimal learning of each task under STL. Figure 3.6 shows the mean number of misclassications made by network hypotheses of various numbers of hidden nodes (from 2 to 256) developed for a primary task su ering from an impoverished training set of only 10 examples (described in Section 4.1). Six other secondary tasks were being learned by the MTL network. A validation set was used to prevent over-tting to the training data. The necessary amount of internal representation per task is two hidden nodes. Notice that after the network has 16 or more hidden nodes the mean accuracy of the primary hypothesis improves signicantly. The improved accuracy results from a transfer of knowledge occurring from the more related secondary tasks to the primary hypothesis. As the number of hidden nodes increases, the mean number of misclassications rises and falls but does not reach the same level as when there are fewer than 14 hidden nodes. This agrees with the ndings in Caru97a]. We conclude that an appropriate amount of internal representation for an MTL network is an amount su cient or greater than that required to develop hypotheses for all tasks to a desired level of accuracy based on the training examples. A separate validation set of examples can be used to prevent over-tting under such a conguration. A separate test set of examples must be used to estimate the e ectiveness or generalization of the hypothesis to future examples. Related Secondary Tasks Internal representation that is benecial to learn a primary task from an impoverished training set can be guaranteed only if the secondary tasks being learned in parallel are closely related i.e. the generation of positive inductive bias can be guaranteed to occur only from related tasks that share internal representation. Figure 3.5 provides an idealization of the following argument. If all secondary tasks are unrelated to the primary task, the hypothesis subspace, HUR, that will be searched is unlikely to have an optimal hypothesis, h0 . Conversely, if all secondary tasks are closely related, the probability of nding an optimal hypothesis within the subspace HR is much greater. If there is a mixture of related and unrelated secondary tasks, the hypothesis subspace HUR+R will be larger than either HR or HUR, thus making the search for an optimal hypothesis more di cult. If there are su cient training examples and internal representation within the MTL ANN for all tasks, unrelated secondary tasks can expedite learning of a primary task because the secondary tasks will reduce the e ective hypothesis space available to the primary task. However, if the primary task has insu cient training examples or there is insu cient internal representation for all

63

110

Number of Misclassifications

100 90 80 70 60

Conf. Mean

50 40 30 20 10 0 2

4

8

16

32

64

128

256

Number of Hidden Nodes

Figure 3.6: Mean number (and 95% condence) of test set misclassications by the primary task versus the number of hidden nodes within an MTL network. The misclassication statistic is based on 10 trials with a training set of 10 examples, a validation set of 20 examples and an independent test set of 200 examples. A total of 7 tasks were being learned within the MTL network. A complete description of the domain and the primary task is provided in Section 4.1.

64 tasks, the use of internal representations by unrelated hypotheses will result in a primary hypothesis of poor generalization performance. Given that the distribution of tasks in the domain is potentially diverse, a framework for ensuring the transfer of knowledge from more related tasks to the primary task is essential. The following section discusses such a framework in the context of back-propagation MTL networks.

3.1.5 Alternative Strategies for MTL Based Task Knowledge Transfer. One can select from several alternative strategies when developing a sequential learning system that relies on MTL based task knowledge transfer. Each of these strategies di er largely along the following two lines:

 the choice of representational and/or functional knowledge transfer, and  the inuence of secondary tasks from domain knowledge. Table 3.1 summarizes four alternative knowledge transfer strategies for an MTL based sequential learning system.

A. Related secondary tasks are manually selected from domain knowledge in the form of

functional training examples. The inuence of each task on primary task learning is xed. An MTL network of appropriate internal representation is congured. Learning starts from a random initial representation (small random weights) and the transfer of knowledge is functional during learning.

B. All secondary tasks from the domain are selected from domain knowledge in the form of functional training examples. An MTL of appropriate internal representation is congured. The inuence of each secondary task on primary task learning is allowed to vary in accord with a measure of relatedness between the tasks. Learning starts with a random initial representation (small random weights) and the transfer of knowledge is functional during learning.

C. Related secondary tasks are manually selected from domain knowledge in the form of a previously learned MTL network representation as well as functional training examples. An output node is added for the new task. Also new hidden nodes may be added. Learning starts from the previously learned MTL representation providing representational transfer. The inuence of each task on primary task learning is xed. As training proceeds there may be functional transfer from the training examples of the secondary tasks.

65

D. All secondary tasks from the domain are selected from domain knowledge in the form of a previously learned MTL network representation as well as functional training examples. An output node is added for the new task. Also new hidden nodes may be added. Learning starts from the previously learned MTL representation providing representational transfer. The inuence of each secondary task on primary task learning is allowed to vary in accord with a measure of relatedness between the tasks. As training proceeds there may be functional transfer from the training examples of the secondary tasks. Form of Transfer Functional

Secondary Task Inuence Constant Variable A: B: MTL MTL Representational C: D: and Functional -open-openTable 3.1: Matrix of alternative strategies for MTL based task knowledge transfer. The table shows four di erent possibilities that are a function of the form of transfer and the inuence of secondary tasks. Alternative A is standard MTL. If the tasks are closely related and the amount of internal representation is appropriate, a positive transfer of knowledge will occur within an MTL network from the secondary tasks to the primary task. However, from a sequential learning perspective, the MTL methodology has a major deciency under a diverse task domain. There is no method to automatically select the most appropriate secondary source tasks. This is important to a life-long learning agent working with a diverse domain of tasks. Currently, the selection is done manually, that is, the relatedness of the source tasks to the primary task is decided o -line to the learning algorithm. The complexity of this subjective selection process increases with the number of previously learned tasks. If a poor selection has been made, the common features generated during parallel learning may favor several unrelated tasks. Thus, a negative transfer e ect will take place (increased training times and decreased generalization accuracy). The standard MTL algorithm has no method of escaping the pressures of unrelated tasks. The remaining alternatives of MTL based knowledge transfer are areas for new research. Alternative B, which is labelled MTL in Table 3.1, is the subject of our research. The

66 alternative considers allowing the inuence from each of the secondary tasks in accord with a measure of relatedness between a secondary task and the primary task. The intention is to provide su cient internal representation for developing a hypothesis for the primary task but not necessarily enough for developing hypotheses for all tasks within domain knowledge. Related research in this area has been conducted by Mitc93, Naik93, Thru95b]. In the case of Mitchell and Naik the standard weight update equations are e ected by the functional transfer of historic training information (most commonly the learning rate or gradient of the error surface). This results in a positive inductive bias that a ects the development of the neural network hypothesis for the primary task. These methods necessarily depend upon historical information that comes from closely related tasks. They are o -line methods of functional transfer. We are interested in an on-line method where the related tasks are being learned (or re-learned) at the same time as the primary task. Thru95b] considers task relatedness and knowledge transfer in the context of a memory based K-nearest neighbour learning system. The system estimates the degree of relatedness between a primary task and clusters of previously learned tasks based on which cluster's distance metric produces the best preliminary nearest neighbour model. This related material will be covered in greater detail in Chapter 7. Alternative C and D, involving the transfer of both representational and functional knowledge are open areas of research that are currently under investigation by us as well as Robi96a] and OSul97]. The approach can be considered an extention of the cascade correlation algorithm Fahl90] to learning multiple tasks. A single MTL network is augmented with additional representation as each new task is learned. This strategy provides immediate representational transfer of previously learned knowledge. If the new task is related to one or more of the previously learned tasks this causes e cient learning. Training examples for the secondary tasks serve two purposes. They encourage the hypotheses of the secondary tasks to remain accurate and they provide a source of functional transfer to the new task. The formation of new internal representation for the new task will be inuenced by the back-propagated error for the secondary examples as well as by the error generated by the new task examples. Alternative C has the inuence of the secondary tasks xed at the same level as the primary target task. Alternative D allows the inuence of the secondary tasks to vary based on some measure of relatedness as per alternative B. The possibility of using these strategies for domain knowledge consolidation will be discussed in Chapter 7.

67

3.2 A Theory of Selective Functional Transfer The above background material has led to a theory of selective functional transfer in the context of back-propagation ANNs.

De nition: A virtual example is a form of functional knowledge that consists of a set of

input attributes and an output target value generated from previously learned task representations stored in domain knowledge.

Theory:

Knowledge can be selectively transferred at a purely functional level within a neural network by learning a new task in parallel with other tasks from the same domain (as per MTL). To ensure a positive transfer of knowledge, a neural network provides an environment in which a measure of relatedness R can be employed to select and utilize virtual examples from the most related domain knowledge tasks during the learning process. In this way virtual examples serve as an emergent form of knowledge-based inductive bias that constrains the e ective hypothesis space of the learning system. The validity of this theory depends on satisfying the requirements of the following assumptions: 1. Functional task knowledge can be embodied in the form of virtual examples of previously learned tasks. There exists a method of sequentially retaining functional task knowledge and regenerating it for the purposes of transfer when learning a new task. 2. A back-propagation ANN provides a framework for employing an arbitrary measure of task relatedness R for each secondary task. The measure of relatedness can be considered a method of indexing into functional task knowledge (domain knowledge). 3. A measure of relatedness can be chosen to promote a positive transfer of knowledge (positive inductive bias) that contributes to the e cient and e ective learning of the primary task. E ciency and e ectiveness are dened in Section 2.4. The remainder of this chapter addresses each of these requirements and describes a functional transfer protoype system. Section 3.3 develops the mathematical framework for employing a measure of relatedness within a back-propagation neural network call MTL. Section 3.4 presents a theoretical basis for task relatedness and formulates the criteria for a measure of relatedness between two tasks. Section 3.5 introduces several alternative measures of relatedness which will be empirically tested in the next chapter. Section 3.6

68 describes the task rehearsal method of sequential learning and discusses the generation and use of virtual examples for functional knowledge transfer. Finally, in Section 3.7, the experimental prototype software for MTL and the task rehearsal method are described.

3.3 A Framework for a Measure of Task Relatedness 3.3.1 Hints to a Framework for Employing Task Relatedness Y. S. Abu-Mostafa has systematically researched the use of hints in inductive learning AM93]. Hints are dened as properties of the primary task or the task domain that are known to be true although independent of the primary training examples, such as monotonicity or symmetry of the output with respect to the inputs. Hints are expressed as virtual examples that are used to train a single task network. In fact, Abu-Mostafa considers primary task training examples to be just another form of hint. All hint task examples must necessarily provide helpful information as to the nature of the primary task i.e. they are all closely related to the primary task. By minimizing the error across all of the hint examples, a more accurate hypothesis for the primary task will be developed. Clearly, the use of hints is a form of knowledge based inductive learning. Abu-Mostafa formally develops the theory of a new VC Dimension based on the use of hints AM95]. He shows that the sample complexity for the primary task can be reduced through the use of hint examples. This is in direct agreement with Baxter's work on internal representation that was discussed in Chapter 2. In AM95], Abu-Mostafa develops the mathematics for learning a primary task from several related hint task examples. Abu-Mostafa denes an estimate of the true error for the primary task to be: E^ = f (E0 E1 ::: Ek ::: Et) where Ek is the error on the training examples for hint tasks k = 1 ::: t and E0 is the error on the primary task training examples. Assuming that all hint task examples contribute positively to the learning of the primary task yet vary in their importance, the estimate of the true error for the primary task becomes one of balancing the individual secondary errors. Abu-Mostafa considers a network with a single output node that must conform to all hint examples. We apply Abu-Mostafa's mathematics to an MTL network with multiple output nodes where there is a node for each secondary hint task. Subsequently, the secondary task examples act as hints for training the common feature layer shared by all tasks of an MTL network. Furthermore, we propose that virtual examples for each secondary task be generated from the training examples for the primary task through a method called task

69 rehearsal Silv98] that will be discussed in Section 3.6. For the purposes of the following discussion, it is not important to know the source of examples for the secondary hint tasks, only that those examples are available and accurate. If we consider E^ to be the objective function to be minimized by the BP algorithm across all task outputs, the gradient descent at an output node k is given by ^ wjk = @ E @Ek

@Ek @wjk

@ E^ , the rate of change of the true error estimate for the where is the learning rate. @E k primary task with respect to the rate of change of the error for task k can be consider the weight of importance of hint task k relative to other secondary tasks for learning the primary task. By selecting the right weights of importance for each secondary task a positive bias is provided to the inductive learning process which can contribute to the development of a more accurate primary task hypothesis. We dene this weight of importance to be the measure of relatedness, Rk , between the primary and each of the secondary tasks that is ^ Rk = @ E such that wjk = Rk @Ek

@Ek

@wjk

Abu-Mostafa recognizes that the selection of appropriate values for the Rk 's are crucial to weighting the contribution each Ek makes to the true error estimate and he explores methods of determining appropriate values for each Ek . He suggests the use of either static exact weight values or dynamic variable sequential values. He presents a clever algorithm called the adaptive minimization schedule that achieves the latter by balancing the information contained within the di erent hint examples by relating the performance on each set of hint examples to the overall error estimate, E^ . The hint that contributes the greatest to E^ gets the most attention. Thus, Abu-Mostafa's adaptive minimization schedule relies on all of the hint tasks being strongly related to the primary task. The result is a positive bias that occurs from the secondary hint tasks to the primary task. Abu-Mostafa does not suggest a general theory for measuring the relative importance, or relatedness, of each hint task to the primary task. Neither does he discuss situations where a hint has been improperly chosen and is unrelated to the primary task (thus delivering negative bias).

3.3.2 From Framework to Functional Transfer As above, we dene an estimate of the true error for the primary task to be:

E^ = f (E0 E1 ::: Ek ::: Et):

70 Let E^ be the objective function to be minimized across the t outputs of an MTL network. The true nature of E^ is unknown. However, from the above equations, it is evident that

E^ =

Zt

k=0

Rk @Ek

which suggests a simple additive function:

E^ = R0E0 + R1E1 + ::: + Rk Ek + ::: + RtEt : Thus, an appropriate measure of relatedness, Rk , for a secondary source task, Tk , must regulate the impact of the task error, Ek , on the formation of shared internal representation. Because a gradient descent search method will be employed (via the BP algorithm), the value of E^ should be minimal when all Rk Ek are at their lowest levels. One of two approaches can be taken:

 All Rk are set to the same value, su cient internal representation is provided to learn

all t tasks and all Ek are minimized to su ciently small values. To learn a primary task with an impoverished training set under this approach the vast majority of secondary tasks must be strongly related to the primary task.

R

is set to a some maximum value and each Rk is set or allowed to vary in accord with the degree of relatedness the secondary task, Tk , has to the primary task. The amount of internal representation is su cient for learning at least some subset of all t tasks. Under this scenario, the majority of secondary tasks do not need to be related to the primary task. 0

Standard MTL is the rst approach. The problems with this method have been discussed in Section 3.1.5. This thesis is concerned with the second approach where there is a diverse set of secondary tasks some of which may be far less related to the primary tasks than others. This section presents a framework for employing various measures of relatedness Rk . The framework requires solutions to problems created by two of the inherent inductive biases of an MTL network, the parallel learning constraint and the rapid learning of simple tasks. These will be discussed immediately below. Section 3.4 considers the more di cult issue regarding the selection of an appropriate measure of relatedness, Rk . Relaxing the MTL Parallel Learning Constraint. Being the common learning rate across all outputs employed by the back-propagation algorithm, is normally brought outside the summations of the weight update equation. However, this does not have to be the case. A separate learning rate, k , for each output node k can be considered and kept inside the backward propagated error signal k . Accordingly, a notational modication is in

71 order. The weights, wjk , e ecting an output node k are adjusted according to the equation @Ek =  o  where o is as described in section 2.3, however  becomes wjk = @w k j j k jk k (tk

; ok )

under the cross-entropy cost function with k being the learning rate parameter specic to output node k. Lower layer weights, wij , a ecting any hidden node j are then modied as per the following: wij = j oi  where j is given as before by

oj (1 ; oj )

X w k

k jk :

Thus, by varying k it is possible to adjust the amount of weight modication associated with any one output of the network1. A separate learning rate, k , provides an opportunity to relax the parallel learning constraint of the MTL network paradigm in accord with a measure of relatedness. The k can be used as a static or dynamic control mechanism over the level of inuence that each parallel task exerts during the induction process. Consider T0 as our primary task of interest in an MTL network where we are uncertain of its relatedness to all of the other parallel tasks T1 : : : Tk  : : : Tt. We require a method of tuning the learning rate k for each parallel task, such that k reects a base value as well as a measure of the relatedness, Rk , between Tk and T0. Formally: k

= f (  Rk):

Through k , the relatedness measure should work to tailor the inductive bias from each of the developing parallel tasks so that the most related tasks will have the largest inuence on weight modications. This modied version of the standard back-propagation learning algorithm for MTL will be referred to as the MTL method. The success of MTL rests on a judicious choice for the function k = f (  Rk ) and the ability to measure the relatedness of the parallel tasks to the primary task. Let the learning rate 0 for the primary task, T0, be the full value of the base learning rate , that is, 0 = . Then, as per Abu-Mostafa's weight of importance, let the learning rate for any parallel task, Tk , be dened as: k = f (  Rk) = Rk  such that

@Ek = @Ek  wjk = Rk @w k @w jk jk

The use of an adaptive or separate learning rate at the node or weight level is not a new concept. It has been used for various purposes by other authors such as Jaco88, Naik92, Vogl88]. 1

72 where 0 Rk 1 for all k = 1 : : : t thereby constraining the learning rate for any parallel task to be at most . Notice, that if Rk = 1 for all k = 0 : : : t, we have MTL learning as per Baxt95b, Caru95]. Alternatively, if R0 = 1 and Rk = 0 for all k = 1 : : : t, we have standard single task learning (STL) for the primary function. In this way, the MTL framework generalizes both STL and MTL neural network learning. The Problem of Rapid Learning. One of the biases of the back-propagation learning algorithm is the rapid learning of tasks which can be developed most easily by the the representational language of the network's connection weights. This includes tasks which are functionally less complex than others such as linearly separable functions versus non-linearly separable functions. The negative result is that a simple but unrelated task can \pull" the common representation of the MTL network toward a sub-optimal area of weight space during the early iterations of training. This is an inherent bias of the back-propagation learning algorithm. To discourage this bias from occurring, the learning rate, k must be dampened to prevent the hypothesis for a simple task, Tk , from being learned more quickly than the primary hypothesis Silv97b]. We propose two methods of approaching this problem. Both are based on detecting when the error of the secondary hypothesis, Ek , drops below the error on the primary hypothesis, E0. The two methods are:

Continuous: If Ek < E0, then

= k ( EEk0 )d , where d is typically set to 2, which reduces the learning rate in accord with the ratio of the errors whenever a secondary task trains more rapidly than the primary task and

Intermittent: If Ek < E0, then

k

= 0, which temporarily halts the back-propagation of error for a secondary task which trains more quickly than the primary task. k

The continuous method has been found to be the better approach as it provides a smooth variation in the k value over the training period and an opportunity for all secondary tasks to provide some degree of inductive bias to the primary task. Furthermore, dampening can be tailored to the domain by adjusting the d parameter. A value of d = 0 turns o dampening where as a value of d  2 emphasizes those di erences in training error that are greatest. One should be aware that dampening can often increase the training time of a network because the formation of related task hypotheses will be slowed down as well as those of unrelated tasks. Combining Rk and k Dampening We prefer to keep the issue of k dampening separate from that of the measure of relatedness. However, in reality, the dynamics of the the

73 back-propagation algorithm brings the two factors together. Formally: k = dampening (f (

 Rk)):

Figure 3.7 shows the percentage of misclassications produced by hypotheses developed by two MTL network congurations used on the Band Domain (described in the Section 4.1) when the primary task has an abundance of training examples (45). The value of Rk for the 6 secondary tasks is varied from 1 toward 0. The continuous dampening method is employed with d = 2. Graph (a) shows the results of a network with one layer of 14 hidden nodes whereas (b) shows the results of a network with just 2 hidden nodes. In both graphs, the percent of errors declines as the Rk value decreases, thus allowing a more e ective primary hypothesis to develop. Eventually, as the Rk values for the secondary tasks drop toward zero, the primary task can be learned. The 2 hidden node network is the more sensitive because the common feature layer is too highly constrained to represent even moderately accurate hypotheses for all 7 tasks simultaneously.

3.4 The Nature of Task Relatedness Critical to the transfer of knowledge from a pool of source tasks to a primary task is some measure of relatedness between those tasks Thru96, Caru97b]. Consider the situation in which the learner has a diverse set of tasks held within domain knowledge, where some of the tasks are far less related to the primary task than others. Also, consider that the learner has some xed amount of internal representation that is su cient for learning the primary task examples but not necessarily su cient for learning all secondary tasks in domain knowledge. In this context a meta-level bias is required because interference from less related tasks will prevent the development of an e ective hypothesis for the primary task. Some a priori measure of relatedness must be chosen by the learner to assist in the best use of domain knowledge source tasks. The xed amount of internal representation and the measure of relatedness must work together to create an appropriate task-level inductive bias that will in turn develop an accurate hypothesis, h0 . The internal representation establishes the available hypothesis space of the MTL system whereas the measure of relatedness must select those secondary tasks most benecial to the development of h0 . More related secondary tasks will constrain the available MTL hypothesis space to a sub-space of more e ective hypotheses. This brings us to a general denition of task relatedness in the context of functional transfer.

De nition: Let Tk be a secondary task and T0 a primary task of the same domain with

training examples Sk and S0, respectfully. The relatedness of Tk with respect to T0 in the

74

0.0001

0.001

0.01

0.1

45 40 35 30 25 20 15 10 5 0

Percent Misclassifictaions

Percent Misclassifictaions

9 8 7 6 5 4 3 2 1 0 1

R_k Values

(a) MTL with 14 hidden nodes.

0.001

0.01

0.1

1

R_k Value

(b) MTL with 2 hidden nodes.

Figure 3.7: Percent of test set misclassications by primary task hypotheses versus the value of Rk used for all secondary tasks under two di erent MTL network congurations. The misclassication statistic is the mean over 3 trials using a test set of 200 examples. A total of 7 tasks were being learned within each MTL network. All tasks, primary and secondary, having 50 training examples and 20 test examples. See the Band Domain experiment section for a complete description of the domain.

75 context of learning system L, that uses functional knowledge transfer, is the utility of using Sk along with S0 toward the e cient development of an e ective hypothesis for T0. Therefore, the relatedness of Tk can be expressed as a function of the e ciency and e ectiveness of using Sk to develop a hypothesis for T0 as compared with the training examples of all other secondary tasks. In this way all secondary tasks can be partially ordered from most related to least related such that the most related secondary task results in the most e ective primary hypothesis, h0 , developed in the shortest period of time. There are two dimensions to this graded sense of relatedness, e ectiveness and e ciency. We consider e ectiveness to be the more important dimension. However, if two secondary tasks assist in developing equally e ective hypotheses for T0, the secondary task associated with the most rapidly developed hypothesis should be considered the more related. The above denition suggests a brute force approach to determining the relatedness between two tasks. The learning system could learn T0 along with every other secondary task and record, for example, the e ectiveness of each hypothesis. The secondary tasks could then be graded as to their degree of relatedness. The time complexity of this approach would grow linearly in the number of tasks. If groups of several secondary tasks are considered then the method would be impractical because the time complexity would grow with combinations of secondary tasks. This leads to the question: What characteristics of task examples or hypotheses that develop from these examples can be used to measure the relatedness of secondary tasks a priori or during the learning process? The remainder of this section explores this issue. The relatedness between objects is a di cult subject with philosophical ties to the concepts of similarity, analogy, and metaphor. These subjects have been debated for at least two thousand years. However, in the context of machine learning, researchers agree that there are at least three factors involved in a measure of task relatedness: 1. The nature of the task domain 2. The representational language of the learning system and 3. The learning (search) method used. The true nature of the domain is unknown, however the distribution of tasks will adhere to the probability model (= Q) as per Baxt95b]. The representational language of choice is the W connection weights of a 3-layer feed-forward neural network of xed architecture similar to that of Figure 3.1. The representation of a hypothesis can be characterized as a point within a W dimensional weight space. The back-propagation of error algorithm which

76 employs a gradient descent minimization method will search for a hypothesis by iteratively adjusting the weights of the neural network. Gradient descent is e ectively a strategy for moving through weight space to a point of minimum training or validation error. This section explores the nature of relatedness between tasks in the context of the above.

3.4.1 Relatedness expressed as a Distance Metric.

Consider, as per Figure 3.8, a set of functions = that map from some x 2 X to some y 2 Y . T0 can be said to be related to any Tk to some lesser or greater degree depending upon a function d(T0 Tk ) that measures the di erence between characteristics of T0 and Tk in the context of all other tasks within =. This requires a good choice of characteristics for T0 and Tk and that d is a metric or pseudo-metric over those characteristics.

0

y0

yl

yk Y

Function Space Tl t

T0

1

Tk

d

x X

Figure 3.8: A function space showing the proximity of a primary task T0 to secondary tasks Tl and Tk . Tl is considered more related to T0 because the two functions are closer to each other in accord with a metric or pseudometric d(T0 Tl). Formally, a metric is a function d : =  = ! R over a space of tasks, =, satisfying for some choice of characteristic for all T0 Tl Tk 2 =

 Self identity:

d(T0 Tk ) = 0 if and only if T0 = Tk (for a metric) or less strongly

d(Tk  Tk) = 0 (for a pseudo-metric).

 Symmetry: d(T  Tk) = d(Tk T ). 0

0

77

 Triangle inequality: d(T  Tl) + d(Tl Tk)  d(T  Tk) 0

0

Thus, a pseudo-metric is a weaker measure that accepts the fact that not all di erences between the tasks are measurable by the function d. If an appropriate choice is made for the characteristic(s) for the tasks, hypotheses that are proximal to one another in accord with d(T0 Tk ) will make use of similar task-level inductive bias Thru95a]. For example, Figure 3.8 shows three tasks T0, Tl and Tk , within a function space (= d). In accord with the measure d, T0 is closer and, therefore, more related to Tl than it is to Tk . Notice that the proximity of Tl to T0 does not necessarily translate into their respective outputs being close to the same value. An input of x yields T0(x) = y0 which is closer to Tk (x) = yk than it is to Tl(x) = yl . In fact, two tasks that are functionally opposite to each other can share the exact same common features within an MTL network and, therefore, can be considered closely related.

3.4.2 Relatedness as Similarity. Generally, relatedness can be equated to the concept of similarity. It has long been understood that people solve problems more easily if they have prior experience with a similar problem. In Vosn89] and Robi96b] the review of a number of seminal articles makes a distinction between the nature and role of two important forms of similarity:

Surface similarity can be dened as shallow, easily perceived, external similarity. In the

context of MTL networks and task relatedness, surface similarity can be redened as a measure of the external functional similarity between two tasks. Surface similarity can be derived a priori from the training examples available for each of the tasks. Various mathematical measures of similarity might be proposed such as degree of t of respective output values. Meta-level knowledge must be used to select the most appropriate measure of surface similarity. One possibility is the use of a measure that has been successful on a \similar" domain of tasks. Measuring the relatedness of tasks in terms of surface similarity can be approached statistically. The most natural measure of functional similarity between two tasks T0 and Tk is to compute the deviation between the tasks as follows:

Z

d(T0 Tk ) = (T0(x) Tk (x))dP (x) x where P is the probability of x 2 X and is some metric or pseudo-metric on the output values of T0 and Tk . The measure d(T0 Tk ) expresses the expected di erence between T0 and Tk on an example x drawn from X . Two common metrics for that

are used are the linear coe cient of correlation and the hamming distance. Both of these will be discussed as potential measures of relatedness in Section 3.5.

78

Structural similarity can be dened as deep, often complex, internal feature similarity. In the context of MTL networks and task relatedness, we dene structural similarity as the degree to which two developed or developing hypothesis utilize their shared internal representation (particularly, the common feature layer) to produce accurate approximations of tasks. Structural similarity is not directly dependent on the training examples used to create the hypotheses. Measuring the relatedness of tasks in terms of structural similarity is analogous to measuring how two functions make common use of the components of a Fourier series. Figure 3.9 shows how a set of four Fourier series components can be superposed to create three di erent functions. Function f1 utilizes components c1, c2 and c4, whereas f2 is composed of c1, c3 and c4 , and nally f3 uses only components c2 and c3. Functions f1 and f2 can be said to be related most strongly because they make common use of two components, c1 and c4 . Visually, the surface similarity of f1 and f2 is most evident in the number of the inexions in each of their curves as compared to f3 . Figure 3.10 demonstrates how three di erent hypotheses can make analogous use of the common internal representation of an MTL network. The simple network, reminiscent of Figure 3.1, consists of 1 input, x, one layer of 4 hidden nodes and 3 output nodes. Each of the four hidden nodes will produce a feature value, ci , that will be used to generate an output for each hypothesis in accord with the formula

hj (x) = 1 +1e;J 

where

J=

X4 c w i=1

i ij :

Hypothesis h1 utilizes hidden node features c1 , c2 , and c4 (the weights from hidden node 3 are zero, whereas the weights from hidden nodes 1, 2 and 4 are arbitrary values). Hypothesis h2 is composed from features c1, c3 , and c4, and nally h3 uses only features c2 and c3 . Thus, h1 and h2 can be said to be most strongly related by their use of two common features, c1 and c4. The surface similarity between h1 and h2 is not evident. In fact, the two hypotheses seem to be near opposites. Cognitive researchers believe that knowledge transfer begins by using surface similarity as the preliminary search key into domain knowledge Robi96b]. Surface similarity selects the most analogous source tasks for transfer of related knowledge. Structural similarity is subsequently used to scrutinize the selected tasks at a ner level of detail | at a level where the internal \hidden" representation and features of the tasks are compared. For this reason, most researchers agree that the essence of task transfer occurs as a measure of structural similarity.

79

3

3

1

f(x)

1

f(x)

f_1

2

2

0

0 c_4 -1

-1

c_1

c_2

c_3

c_1

c_2

c_4 -2

-2

-3 0

-3 0

50

100

150

200

250 x

300

350

400

450

50

100

150

200

250 x

500

(a) Four Fourier series components c1 c2 c3 and c4.

300

350

400

450

500

(b) f1 composed of components c1 c2 and c4.

3

3

f_2 2

2 c_3

1

c_4

f_3

0

f(x)

f(x)

1

-1

c_3 c_2

0

c_1

-1 -2

-2 -3 0

50

100

150

200

250 x

300

350

400

450

500

-3 0

50

100

150

200

250 x

300

350

400

450

500

(c) f2 composed of components c1 c3 and (d) f3 composed of components c2 and c3. c4. Figure 3.9: Three functions created from the composition of four Fourier series components.

80

1 c_2

c_3

0.9

1

c_1 0.9

0.8 c_4

0.8

0.7

f(x)

f(x)

0.6

0.5 0.4

0.5 0.4

0.3

0.3

0.2

c_2

0.2

0.1 0 -5

c_4

h_1

c_1

0.7

0.6

0.1 -4

-3

-2

-1

0 x

1

2

3

4

5 0 -5

(a) Four MTL network hidden node features c1 c2 c3 and c4.

-4

-3

-2

-1

0 x

1

2

3

4

5

(b) h1 composed of features c1 c2 and c4 .

1

1

0.9

0.9

c_2 c_1

c_3

h_2 0.8

0.8 c_4

0.7

0.7 0.6

f(x)

f(x)

0.6 0.5 0.4

0.5 0.4

0.3

0.3 c_3

0.2

0.2

0.1 0 -5

h_3

0.1

-4

-3

-2

-1

0 x

1

2

3

4

5

(c) h2 composed of features c1 c3 and c4.

0 -5

-4

-3

-2

-1

0 x

1

2

3

4

5

(d) h3 composed of features c2 and c3.

Figure 3.10: Three hypotheses created from the composition of four MTL hidden nodes features.

81

3.4.3 Relatedness as Shared Invariance The nature of relatedness between two tasks is not a trivial matter because it can involve a shared invariance in the input space with no similarity at the outputs. For example, several di erent alphabetic characters can share rotational, translational, and scale invariance. After the invariants are learned by the common portion of an MTL network, the network encodes an appropriate representation at the common feature layer. With this internal representation in place, the back-propagation algorithm can easily develop the task specic portions to classify the characters. In fact, the complexity of relatedness that can exist between tasks is dependent upon the representational power of both the common and task specic portions of the MTL network. Tasks T0 , T1 and Tt of Figure 3.3 can be related in more complex manners than those of either Figure 3.1 or Figure 3.2. The inability of the task specic portions of the network in Figure 3.1 and 3.2 to represent non-linearly separable functions denes a smaller range of related tasks than that which can be achieved by the task specic portions of the network in Figure 3.3. We will restrict our attention to networks of the form shown in Figure 3.1, leaving more complex networks of the form shown in Figure 3.3 to future research.

3.4.4 Criteria for a Measure of Relatedness The following considers the criteria for an a priori measure of relatedness, Rk , between two tasks, T0 and Tk , in terms of the available training examples and the hypotheses for those respective tasks within an MTL network. Consider a set of t secondary tasks, Tk , for k = 1 : : : t, to be learned within an MTL network along with a primary task, T0 with R0 always set to 1. We propose the following criteria for measure of relatedness, Rk :

Promotes Positive Knowledge Transfer: Rk must express the utility of Tk toward the

e cient and e ective learning of T0 within an MTL environment. Rk will promote the e ects of positive inductive bias from more related tasks Tk during the development of an hypothesis for the primary task T0. Conversely, Rk will reduce the e ects of negative inductive bias from less related tasks Tk . Reduction of negative transfer must hold in the case where most or all Tk are minimally related to T0 .

Considers Impoverished Training Sets: A key characteristic of knowledge transfer is

the ability to develop e ective hypotheses from impoverished training sets (containing fewer examples than required for the desired generalization accuracy). Rk must be computable from impoverished training sets for T0.

82

Considers Su cient Training Sets: Rk will not adversely e ect the development of the

primary hypothesis when there are su cient training examples for learning T0 to a desired level of generalization accuracy under the single task learning (STL) method.

Based on a Metric: Rk should be based on a metric or pseudo-metric, d(T0 Tk), that can

be computed from the training examples of T0 and the virtual examples of secondary task Tk or the some characteristic of the developing hypotheses for T0 and Tk .

Boundary Values: Rk will have an upper bound of 1, because Tk may be exactly the same

task as T0. Similarly, Rk will have a lower bound of 0, because we wish the contribution from unrelated tasks to diminish to zero as a function of their relatedness and not arbitrary negative values. This may require that the metric d(T0 Tk ) be normalized within the range 0,1].

3.4.5 An Appropriate Test Domain for a Measure of Relatedness This section describes the ideal synthetic task domain for testing the functional transfer of task knowledge through the use of a measure of a relatedness. We begin at the task level and move on to describe the domain of tasks. Each task will have:

 Two or more input variables which may have either nominal (categorical), ordinal, or continuous values between 0 and 1. The number of variables should not be so large as to make laboratory analysis of the results di cult. However, it should be possible to scale the number of inputs upward

 One output variable which is binary categorical (dichotomous) in value. Each target value is either of class 0 or class 1

 A set of training examples, if it is a secondary task, of size m su cient to develop a model satisfying the desired generalization error, 

 An impoverished set of training examples, if it is the primary task, of size < m insu cient to develop a model satisfying the desired generalization error, 

 A set of validation examples of size  20% of m for tuning the network (also referred to as an early stopping set) and

 A set of test examples of size  m for testing the network hypothesis. The task domain will be composed of:

 One primary task that is preferably a complex non-linearly separable function of its input variables

83

 Two or more secondary tasks varying in degrees of relatedness to the primary task and possibly varying in degrees of functional complexity. The number of secondary tasks should not be so large as to make laboratory analysis of the results di cult. However, it should be possible to scale the number of tasks upward and

 The majority of the secondary tasks are unrelated to the primary task, forcing the measure of relatedness to overcome a potentially strong negative inductive bias.

3.5 Measures of Relatedness Explored The previous section presented a framework for and criteria of a measure of task relatedness. Based on this theory, we dene three categories of measures of relatedness static, dynamic, and hybrid. Each of these is now discussed.

3.5.1 Static Measures Static measures are determined prior to learning and reect task-level surface similarity between the training examples of the primary task and those of each secondary task. Static measures are constant values that do not change during the learning of a task. They create a partial ordering of the available secondary tasks. The selection of static measures must be made external to the MTL system and be based on background knowledge beyond that of domain knowledge. The following static measures are based on mathematical relationships between the output values of the primary task and each secondary task.

Hamming Distance or Proportion Match. The Hamming distance between two binary codes of the same length is dened to be the number of bits of di erence between the two codes. The Hamming distance, dk , between the target values of m training examples for Tk and T0 , where 0 1 ;  . 0

Thus, the minimum number of examples m0 depends upon the value  and , but never on the target classier task c or the unknown distribution D. This means that the desired level of condence and error can be set even though the target task and the distribution of examples is unknown. PAC learning is the best one can hope to achieve within this probabilistic framework. It is possible, although unlikely, that an unrepresentative training sample will be given to the learning algorithm thus a positive outcome is probable not denite. Similarly, the extension (or generalization) of the training sample by the learning algorithm can only be considered to be approximately correct.

317 Proving that a learning algorithm L is PAC can be a di cult task. An important result due to Valiant provides a method of ensuring that any consistent algorithm for learning hypothesis H by H is in fact PAC. A consistent L is one which generates a hypothesis h = L(S ) that properly classies all of the examples in the training sample S . The set of all consistent hypotheses in H we will refer to as HS (for hypotheses consistent with sample S ). All HS hypotheses have an error e(h) = 0. The PAC model requires that any valid h has, with probability greater than (1 ;  ), an error e(h) < . Consider the set of all hypotheses within H that have e(h)   call this set HBAD . There will be no h 2 HS which are also in HBAD , i.e. HS \ HBAD = 0. We can now dene the meaning of a potentially learnable H . A hypothesis space H is dened as being potentially learnable if, given a condence level of  and an acceptable error level of  (0 <   < 1), there is a positive integer mo = mo ( ) such that, whenever m  m0:

D(S 2 cjHS \ HBAD = 0) > 1 ;  for any probability distribution D on X and any c 2 H . Valiant showed that any nite hypothesis space H is potentially learnable and that the number of examples required to learn a consistent hypothesis h is given by: m  m = 1 ln jH j : 0





This is a very important theorem since it tells us how many examples are required to have a consistent learning algorithm achieve levels of condence  with and error rate less than . In fact this theorem covers all boolean hypothesis spaces f0 1gn for a xed n. All such spaces are potentially learnable, and any algorithm which consistently learns a hypothesis can be considered PAC compliant. In general, it can be shown that for any nite hypothesis space, there is a consistent learning algorithm that is PAC. In fact any such algorithm will be of the class: learning by enumeration. This conrms Gold's early nding that learning by enumeration is uniformly the most e cient method. However, there is still a practical problem to the use of this result. Recall the class, or space of all boolean functions of n variables will have a size jH j = 22n , which grows very rapidly. Even for moderate size n the length m of the sample S will have to be unreasonably large. This leads us once again to the need for inductive bias the space of all possible hypotheses H must be reduced using prior knowledge external to the training examples.

Appendix C Mathematical Details C.1 Mutual Information of Secondary Task with respect to Primary Task Let A be the set of m target values (class identiers), ai , of a training set for a primary task, T0 . Let B be the set of j paired target values, bj , of a training set for a secondary task, Tk . Then the mutual information of the secondary task target values with respect to the primary task target values is dened as:

MI (A B) = H (A) ; H (A=B) where 0

Suggest Documents