The Task Rehearsal Method of Sequential Learning - Semantic Scholar

0 downloads 0 Views 483KB Size Report
Feb 20, 1998 - Daniel L. Silver (1,2) and Robert E. Mercer (1) ...... Prat96] Lorien Pratt and Barbara Jennings, \A Survey of Transfer Between Connectionist Net-.
The Task Rehearsal Method of Sequential Learning Daniel L. Silver (1,2) and Robert E. Mercer (1) (1) Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 3K7, (2) Business Informatics, Faculty of Management, Dalhousie University, Halifax, Nova Scotia, Canada B3H 3J5 email: [email protected] February 20, 1998

Abstract

An hypothesis of functional transfer of task knowledge is presented that requires the development of a measure of task relatedness and a method of sequential learning. The task rehearsal method (TRM) is introduced to address the issues of sequential learning, namely retention and transfer of knowledge. TRM is a knowledge based inductive learning system that uses functional domain knowledge as a source of inductive bias. The representations of successfully learned tasks are stored within domain knowledge. Virtual examples generated by domain knowledge are rehearsed in parallel with the each new task using either the standard multiple task learning (MTL) or the MTL neural network methods. The results of experiments conducted on a synthetic domain of seven tasks demonstrate the method's ability to retain and transfer task knowledge. TRM is shown to be e ective in developing hypothesis for tasks that su er from impoverished training sets. Diculties encountered during sequential learning over the diverse domain reinforce the need for a more robust measure of task relatedness.

Keywords:

tual examples

sequential learning, neural networks, knowledge transfer, inductive bias, task rehearsal, vir-

1

Contents

1 Introduction 2 Background 2.1 2.2 2.3 2.4 2.5

Knowledge Based Inductive Learning . . . . Representational vs. Functional Transfer . . MTL Network Learning . . . . . . . . . . . Inductive Bias and Internal Representation Rehearsal of Task Examples . . . . . . . . .

3 Theoretical Foundations 3.1 3.2 3.3 3.4

Hypothesis of Functional Transfer . . A Dynamic Measure of Relatedness . . Model for The Task Rehearsal Method An Appropriate Test Domain . . . . .

. . . .

. . . .

. . . .

3 3

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 4 5 7 7

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. 8 . 9 . 12 . 15

8

4 The Prototype TRM System

16

5 Experiments on the Band Domain

18

6 Discussion 7 Conclusion

24 27

4.1 The ANN Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 The TRM Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1 Task Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2

1 Introduction Historically, machine learning research has focused on the tabula rasa approach of inducing a model of a classi cation task given a set of supervised training examples. There has been relatively little work on the storage of task knowledge after it has been induced, its consolidation with previously learned tasks of the same problem domain, and its recall to facilitate the learning of a new task. This is unfortunate, for two reasons. First, it seems certain that human learning relies heavily upon the use of previously learned and related task knowledge. Second, for some time it has been recognized that learning by example without an appropriate inductive bias has practical limitations [Mitc80, Mitc97]. Our research investigates methods of consolidating and transferring task knowledge so as to facilitate the sequential learning of tasks. We focus on systems of arti cial neural networks which use prior task domain knowledge to decrease the training time for a new task and/or reduce the number of training examples necessary for acceptable generalization [Silv95, Silv96a]). This form of knowledge-based inductive learning is referred to elsewhere as the transfer of knowledge from one or more source tasks to a target or primary task [Utgo86, Prat93]. The transfer of task knowledge can be considered a major aspect of the problem of learning to learn [Elli65] and has close ties to analogical reasoning [Hall89]. In [Silv96b] we de ne the distinction between two forms of task knowledge transfer, representational and functional, and present a modi ed version of the multiple task learning (MTL) method of parallel functional transfer which we call  MTL. An  MTL arti cial neural network (ANN) biases the induction of a hypothesis for a primary task based on the measure of relatedness to the developing hypotheses for the surrounding parallel tasks. The transfer of knowledge is completely functional, relying only on a randomly generated common initial representation for each of the parallel hypotheses. In this paper we introduce the task rehearsal method, or TRM, of sequential learning which can emplopy either the standard MTL or the  MTL method of functional transfer. Task rehearsal is an appropriate name for the method since previously learned tasks are re-learned or rehearsed during the learning of each new task. The next section provides appropriate background on knowledge based inductive learning, the functional transfer of knowledge in ANNs, a review of the MTL method, and previous work on rehearsal and virtual examples. Section 3 presents a hypothesis of functional transfer, an outline of a dynamic measure of relatedness and a theoretical model for the task rehearsal method of sequential learning. In Section 4 a TRM prototype system based on the theoretical model is detailed including important enhancements to the  MTL system since [Silv96b]. Section 5 reports on experimentation with the TRM system against a synthetic domain of seven tasks. Sections 6 and 7 conclude with a discussion of results and observations made during the experiment, a summary of the paper and an outline of future work.

2 Background

2.1 Knowledge Based Inductive Learning

Mitchell [Mitc80] points out and Utgo [Utgo86] iterates that inductive bias is essential for the development of a hypothesis with good generalization in a tractable amount of time and with a 3

practical number of examples. In other words, inductive bias is required for ecient and e ective learning of most real-world tasks. Mitchell sites ve major classes of inductive bias used by intelligent learners; universal heuristics, knowledge of intended use, knowledge of the source, analogy with previously learned tasks, and knowledge of the task domain. All are forms of prior knowledge used to facilitate the search of the learner's hypothesis space. In general, an inductive bias can be said to produce a partial order over the space of hypotheses the learner is about to search (e.g. the universal heuristic, Occam's Razor, suggests that the hypotheses be ordered by their level of complexity). For a recent discussion of the need for inductive bias refer to [Mitc97]. We de ne knowledge based inductive learning, or KBIL, as a learning method which relies on prior knowledge of the problem domain to reduce the hypothesis space which must be searched. Figure 1 provides the framework for knowledge based inductive learning. Domain knowledge is a database of accumulated information which has been acquired from previously learned tasks. The intent is that this knowledge will bias a pure inductive learning system in a positive manner such that it trains in a shorter period of time and produces a more accurate hypothesis with a fewer number of training examples. In turn, new information is added to, or consolidated within the domain knowledge database following its discovery. Michalski refers to this as constructive inductive learning [Mich93]. In the extreme, where the new classi cation task to be learned is exactly the same as one learned at some earlier time, the inductive bias should provide rapid convergence on the optimal hypothesis with very few examples. Formally, given domain knowledge, K , the problem becomes one of selecting or constructing a hypothesis, h, based on a set of examples, S = (xi ; ti), such that:

K ^ h ^ xi j= ti or

domain knowledge ^ hypothesis ^ input attributes j= target class:

The last two classes of inductive bias identi ed by Mitchell, analogy with previously learned tasks and knowledge of the task domain are of primary concern to our research; henceforth they are jointly referred to as task domain knowledge. When task domain knowledge is used to bias an inductive learner a transfer of knowledge occurs from one or more source tasks to a target or primary task. Thus, the problem of selecting an appropriate bias is transformed into the problem of selecting the appropriate task knowledge for transfer.

2.2 Representational

vs.

Functional Transfer

In [Silv96b, Prat96] the di erence between two forms of task knowledge transfer is de ned: representational and functional. The representational form of transfer involves the direct or indirect assignment of known task representation (weight values) to a new task. We consider this to be an explicit form of knowledge transfer from a source task to a target task. Since 1990 numerous authors have discussed methods of representational transfer [Fahl90, Prat93, Ring93, Shar92, Shav90, Sing92, Towe90] which often results in substantially reduced training time with no loss in generalization performance. In contrast to representational transfer is a form we de ne as functional. Functional transfer does not involve the explicit assignment of prior task representation to a new task, rather it employs the 4

Environment

Testing Examples Domain Knowledge Inductive Bias

Training Examples

Inductive Learning System Knowledge Based Inductive Learning System

Induced Model of Classifier Task

Output Classification

Figure 1: The framework for knowledge based inductive learning. use of implicit pressures from supplemental training examples [AM95, Sudd90], the parallel learning of related tasks constrained to use a common internal representation [Baxt95, Caru95], or the use of historical training information (most commonly the learning rate or gradient of the error surface) to augment the standard weight update equations [Mitc93, Naik93, Thru94, Thru95a]. These pressures serve to reduce the e ective hypothesis space in which the learning system performs its search. This form of transfer has its greatest value from the perspective of increased generalization performance. Certain methods of functional transfer have also been found to reduce training time (measured in number of training iterations). Chief among these methods is the parallel MTL paradigm explored recently by Caruana and Baxter [Baxt95, Caru95]. A recent paper by Caruana [Caru97] expresses plans for research into the use of MTL networks for sequential learning. We encourage these e orts as this is a large and exciting area of scienti c discovery.

2.3 MTL Network Learning

Kehoe points out in [Keho88] that psychological studies of human and animal learning suggest that besides the development of a speci c discriminant function which satis es the task at hand, there is the acquisition of general knowledge of the task domain. This general knowledge remains available for use in subsequent learning. This concept has been formalized by Baxter [Baxt95] as parallel learning and demonstrated by Caruana [Caru95] by a method called multiple task learning (MTL) which we classify as a functional form of knowledge transfer. An MTL network uses a feed-forward multi-layer network with an output for each task to be learned (see Figure 2). Training examples contain a set of input attributes as well as a target output for each task. The standard back-propagation of error learning algorithm is used to train all tasks in parallel. The weights, wjk , a ecting an output node k are adjusted according to the equation: 5

Task 0

Output Layer

...

Task 1

Task t

...

k

wjk

Task specific representation

j

Hidden Layer

Common task domain representation

wij Input Layer

...

i

Figure 2: A multiple task learning (MTL) network. There is an output node for each task being learned in parallel. The representation formed in the lower portion of the network is common to all tasks. @Ek =  o ; where  is the learning rate parameter, o is the input from the hidden wjk =  @w k j j jk layer node j , and k is dependent upon the cost function being minimized by the back-propagation algorithm. The cross-entropy cost function, given by

Ek = ?

X t log(o ) + (1 ? t )log(1 ? o ) p

k

k

k

k

over all training examples p for task output k, seeks the maximum likelihood hypothesis under the assumption that the training example target values are a probabilistic function of their attributes [Mitc97]. Under the cross-entropy cost function, k can be shown to equal (tk ? ok ); where tk is the desired output at node k and ok is the actual output of the network at node k. Similarly, the weights, wij , a ecting any hidden node j are modi ed as per the following: wij = j oi ; P where j is given by oj (1 ? oj ) k k wjk ; where oj is the output of the hidden node j and the summation proportions the error from each of the k output nodes in accord with the weights wjk . This equation can be used for networks of any number of hidden layers below layer i given the appropriate notational substitutions are made. We will deal exclusively with 3-layer networks, therefore we will consider all nodes in layer i as inputs. In preparation for the following sections, it is important to note that under this set of equations the learning rate,  , is a constant, global parameter. Subsequently, the back-propagated error signal from any output node k is considered to be of equal value to all others. At the point of lowest training error, the MTL network does its best to average the error across all of the output nodes.

6

2.4 Inductive Bias and Internal Representation

As in [Baxt95], consider the environment of the learner to be modeled by a pair (=; Q) where = is a set of tasks fTk g, and Q is a probability distribution over =. That is, (=; Q) de nes the task domain and the probability of occurrence of any task. Depending upon Q the learner will be required to develop hypotheses from one of a number of possible hypothesis spaces, H . One H may contain hypotheses which are primarily linear in nature while others are of varying degrees of non-linearity. Each H de ned by (=; Q) requires an inductive bias appropriate for learning tasks within that H . For learning to become ecient and e ective in any one environment an appropriate bias must be discovered. In particular, we are concerned with the inductive bias provided by the shared use of MTL network internal representations. Baxter [Baxt95] has proven that the number of examples required for learning any one task using an MTL network decreases as a function of the total number of tasks being learned in parallel. Let m be the number of examples required to \probably approximately learn" (PAC) learn a task Tk to some desired generalization error using a standard inductive learner. Let t be the number of tasks selected from the domain. If the input attributes for the tasks can be compressed to a smaller internal representation, then it can be shown that the number of examples required to PAC learn all Tk using the MTL network is O( mt + a); where a is a constant such that m  a. Baxter also proves that the common internal representation acquired will facilitate the learning of subsequent tasks sampled from the domain according to the distribution Q. In an MTL network this translates into a common representational component developed within the input to hidden weights for all tasks (see Figure 2). For any particular task the hidden to output weights constitute the task speci c component. Since the number of weights in this section of the network is relatively small, the training of a new task from (=; Q) can be accomplished with relatively few examples and with a smaller amount of e ort then compared with single task learning. Functional transfer and positive inductive bias occurs in an MTL network due to the pressures of learning several related tasks under the constraint that the majority of the connection weights of each task are shared. Therefore, to ensure the functional transfer of knowledge from several secondary tasks to the primary task, the following must be optimized:  The MTL network should have a sucient amount of internal representation (at least as much as that required for the best single task learning (STL) network model and [Caru95] suggests k times as much representation as that required by an STL network); and  The secondary tasks should be as closely related to the primary task as possible.

2.5 Rehearsal of Task Examples

In [Robi95, Robi96a] Robins discusses the concept of rehearsal and pseudo-rehearsal of task examples as a solution to the problem of catastrophic forgetting [McCl89, Gros87]. Robins considers the problem of learning one set of randomly chosen paired-associate examples by an STL ANN and then subsequently learning another set of paired-associates using the same ANN (potentially, a sequence of paired-associate tasks can be considered). Catastrophic forgetting occurs as the second task interferes with the hidden node representation originally developed for the rst task. Psychologists have long considered this a major failing of ANN models of long-term memory. Robins 7

demonstrates that the problem can be solved by rehearsing a subset of previously learned examples for the rst paired-associate task while concurrently learning a new task. Given sucient internal representation (hidden nodes), an appropriate model will develop such that it satis es the requirements of both tasks to the extent to which the examples do not interfere. This rehearsal method requires that at least some portion of the training examples be retained inde nitely. The pseudo-rehearsal method overcomes the problem of retaining speci c training examples of earlier tasks. It does this by rst using the existing STL to generate a random set of virtual examples which Robins calls pseudoitems. Pseudoitems are used as the subset of training examples for the previously learned association task(s). The paper shows that pseudo-rehearsal is nearly as e ective as rehearsal of retained examples. Robins goes on to suggest that pseudo-rehearsal is a potential model for long-term memory consolidation in the mammalian neocortex. He relates this to a recent neuroscience paper [McCl94] which discusses the complimentary roles of the hippocampus and the neocortex in human learning.

3 Theoretical Foundations The above background material has led to the development of a hypothesis of functional transfer in the context of ANNs and to the development of supportive theory.

3.1 Hypothesis of Functional Transfer

The knowledge of a task is held within a learning system's representation of that task. Let domain knowledge consist of a collection of previously learned task representations that may or may not be organized into a systematic structure. We de ne functional knowledge to be virtual examples of previously learned task input/output pairs generated from domain knowledge. The contribution of domain knowledge to subsequent learning is dependent upon the learning system's ability to index into this source of functional knowledge. Knowledge can be transferred at a purely functional level within a neural network by learning a new task in parallel with related tasks from the same domain (as per MTL). Furthermore, a measure of relatedness based on characteristics of the developing hypotheses can be used to utilize virtual examples dynamically from the most appropriate domain knowledge tasks during the learning process. In this way virtual examples serve as an emergent form of knowledge-based inductive bias that constrain the e ective hypothesis space of the learning system.

The validity of this hypothesis depends on the validity of the following assumptions:  There exists a method of sequentially retaining functional task knowledge and transferring it during the training of a new task. Functional task knowledge can be embodied in the form of virtual examples of previously learned tasks.  A measure of relatedness exists for utilizing virtual examples from those source tasks which are most related to the primary task of interest. The measure can be considered a method of 8

indexing into a functional form of domain knowledge that results in a positive inductive bias and contributes to ecient and e ective learning1 . This paper focuses on the rst of these two assumptions; a method of sequential learning using functional transfer. But before moving on to describe such a method, it is necessary to outline the measure of relatedness used in the experiment section. A more general discussion of a measure of relatedness will be the subject of a forthcoming paper. In addition we refer the reader to a recent article that discusses task relatedness and knowledge transfer in the context of a nearest neighbour memory based learning system [Thru95b].

3.2 A Dynamic Measure of Relatedness

Relaxing the MTL Parallel Learning Constraint. Being the common learning rate across all

outputs employed by the back-propagation algorithm,  is normally brought outside the summations of the weight update equation. However, this does not have to be the case. A separate learning rate, k , for each output node k can be considered and kept inside the backward propagated error signal k . This being the case, a notational modi cation is in order. The weights, wjk , e ecting an @Ek =  o ; where o is as described output node k are adjusted according to the equation wjk = @w k j j jk in section 2.3, however k becomes k (tk ? ok ) under the cross-entropy cost function; where k is the learning rate parameter speci c to output node k. Lower layer weights, wij , e ecting any hidden node j are then modi ed as per the following: wij = j oi ; where j is as described in section 2.3. Thus, by varying k it is possible to adjust the amount of weight modi cation associated with any one output of the network2. A separate learning rate, k , provides an opportunity to relax the parallel learning constraint of the MTL network paradigm in accord with a measure of relatedness. The k can be used as a dynamic control mechanism over the level of in uence that each parallel task exerts during the induction process. Consider T0 as our primary task of interest in an MTL network where we are uncertain of its relatedness to all of the other parallel tasks T1; : : :; Tk ; : : :; Tt. We require a method of tuning the learning rate k automatically for each parallel task, such that k re ects a base value  as well as a measure of the relatedness, Rk , between Tk and T0. Formally:

k = f (; Rk): Eciency is de ned as follows: Given P tasks from a domain each with some xed number of training examples sucient to PAC learn each task to some error tolerance, , and one new task T with some xed number of training examples sucient for PAC learning to , the system should show a reduction in the number of batch training iterations for T as compared to when learned without the domain knowledge (eciency is based on the number of training iterations and not on training time). E ectiveness is de ned as follows: Given P tasks from a domain each with some xed number of training examples sucient to PAC learn each task to some error tolerance , and one new task T with insucient number of training examples for PAC learning to , the system should show higher generalization accuracy for T as compared when learned without the domain knowledge. As the systems knowledge of the task domain increases there should be a resulting decrease in the number of training examples (to some lower bound) required to reach a desired level of generalization accuracy for a new task. 2 The use of an adaptive or separate learning rate at the node or weight level is not a new concept. It has been used for various purposes by other authors such as [Jaco88, Naik92, Vogl88]. 1

9

Through k , the relatedness measure should work to tailor the inductive bias from each of the developing parallel tasks such that the most related tasks will have the largest in uence on weight modi cations. This modi ed version of the standard back-propagation learning algorithm for MTL will be referred to as the  MTL method. The success of  MTL rests on a judicious choice for the function k = f (; Rk) and the ability to measure the relatedness of the parallel tasks to the primary task. Let the learning rate 0 for the primary task, T0, be the full value of the base learning rate  , that is, 0 =  . Then the learning rate for any parallel task, Tk , is de ned as:

k = f (; Rk) =   Rk ; where 0  Rk  1 for all k = 1; : : :; t thereby constraining the learning rate for any parallel task to be at most  . Notice, that if Rk = 1 for all k = 0; : : :; t, then we have MTL learning as per [Baxt95, Caru95]. Alternatively, if R0 = 1 and Rk = 0 for all k = 1; : : :; t, then we have standard single task learning (STL) for the primary function. In this way, the  MTL framework generalizes both STL and MTL neural network learning. The above formulation of k agrees with mathematics resulting from Y. S. Abu-Mostafa's research into the use of hints in inductive learning [AM93, AM95]. This will be discussed at length in a forthcoming paper.

Accuracy/Distance Measure of Relatedness. The following dynamic measure of relatedness,

Rk is based on the representational similarity between the developing hypotheses as well as the

functional accuracy of these hypotheses. The measure has been found to be e ective on several task domains. An early version of the measure was reported in [Silv96b]. Assume that the weights in the task speci c portion of an  MTL network are initialized such that all hypotheses have an identical random starting position in weight space. As the back-propagation algorithm updates the weights in the network, each hypothesis traces out a trajectory. Over time these trajectories will diverge as the task speci c output node weights of the network take on unique values to overcome di erences which cannot be resolved in the common portion of the network. The representations for hypotheses associated with similar tasks will stay relatively close together in weight space, whereas the representations for hypotheses associated with dissimilar tasks will move apart. Thus, one can consider the distance between the weight space representations as a relative measure of relatedness between the hypotheses. To tie this measure back to the tasks being modeled, one must consider the relative accuracy (or error) of the developing hypotheses. Formally,  Let E1k be the accuracy of hk for task Tk , where Ek is some error measure over the training set for task Tk ; such that 0 < E1k < 1.  Let dk be the current weight space distance between the primary hypothesis, h0, and the hypothesis, hk ; such that 0 < dk < 1. Given the developing hypotheses h0 for T0, and hl and hm , for two secondary tasks, Tl and Tm , we propose:  If hl and hm are of the same accuracy, 1=El = 1=Em, then the hypothesis which is closest to h0 in weight space should be considered the more related. That is Rk / 1=dk . 10

 If hl and hm are the same distance, dl = dm, in weight space from h0, then the hypothesis with the greater accuracy should be considered the more related. That is Rk / 1=Ek .

De nition: The Accuracy/Distance measure of relatedness between any secondary hypothesis,

hk , and the primary hypothesis, h0 , is de ned to be

Rk = tanh( E d2c + ) k k

where

is a small constant to prevent division by zero. The tanh function restricts the value of

Rk to the range (0,1) since the operand is always positive and c controls the rate of decay of tanh

from an asymptotic value of 1. Under the above de nition of Rk , c is the major tuning parameter for the  MTL learning method. To facilitate the choice of a value for c, Ek should express a mean error (ME) value across all training examples (so that Ek does not vary directly as a function of the number of examples). For the experiments reported in this paper, mean cross-entropy error measure is used. The best value for c will be a function of the number of hidden nodes in the network since the distance measure, dk , is a function of the weight vector for hk . For the domains we have investigated it has not been dicult to choose an appropriate value for c which has been shown to be reasonably robust to a wide range of values [Silv96a]. The range can be determined from a set of short preliminary runs. Typically, a value between 2 and 100 is selected. The accuracy factor, 1=Ek , and distance factor, dk , must work together such that highly accurate and representationally similar hypotheses will exert the greatest in uence over the development of h0, whereas less accurate and more distant hypotheses will exert little or no in uence. We can view the accuracy factor as modifying the e ect of the distance factor as the parallel hypotheses diverge from the weight representation of h0 . A side e ect of the back-propagation algorithm is that a simple secondary task will be learned more quickly than a more complex primary task, thus a simple but unrelated task can \pull" the common representation of the  MTL network toward a suboptimal area of weight space during the early iterations of training. To discourage this from happening the accuracy factor is dampened so as to prevent the hypothesis for a simple task, Tk , from being learned more quickly than the primary hypothesis [Silv97]. The following logic performs the dampening after every iteration through the training set: if 1=Ek > 1=E0 then 1=Ek = 1=E0  Ek =E0 . Thus, a bias from a simple and potentially unrelated task is allowed but reduced in its early e ect upon the measure of relatedness, Rk . The dynamic portion of the Rk equation, Ek dc2k + , has been designed to re ect various inverse square laws observed in natural systems, such as gravitational attraction. The relational \attraction" between the two hypotheses, h0 and hk , can be characterized as varying directly as a function of the training accuracy of hk , and inversely as the square of their distance. A more accurate parallel hypothesis has a greater relation to h0 as does a \closer" hypothesis. Thus we are left with two extremes: (1) a parallel hypothesis, hk , that has low training accuracy and is relatively distant from h0 corresponds to a task totally unrelated to T0; and (2) a parallel hypothesis, hk , which has high training accuracy and that occupies the same point in weight space as h0 corresponds exactly to the primary task, T0. 11

training target values for primaryT0

training targets for secondary tasks T1

Tk

Tt

Tt Tk

T1

Inductive Learning System ( η MTL)

I0 I0

I0

In I0

In

In

In

Domain Knowledge (previously learned task representations)

input attributes T0 training examples

Figure 3: A model for the Task Rehearsal Method. Domain knowledge is composed of neural network representations of previously learned tasks. These representations are able to generate suciently accurate virtual examples which can be used during the learning of a new task. The inductive learning system is an  MTL network which is capable of transferring related domain knowledge via the virtual examples. As a whole, the system of networks performs as a knowledge based inductive learning system.

3.3 Model for The Task Rehearsal Method

In [Robi95] the method of pseudo-rehearsal of virtual examples was proposed as a solution to the problem of catastrophic forgetting. We apply this method to the problem of sequentially learning a series of tasks. This section presents the task rehearsal method, or TRM. TRM is a knowledge based inductive learning system that relies upon the re-learning or rehearsal of previously learned tasks in an  MTL network concurrent with the learning of a new task. Task domain knowledge is transferred in a functional manner using virtual examples, which can be considered a form of hint as per [AM95]. Once a new task has been learned to a desired level of generalization accuracy, its representation is retained in domain knowledge for use during future learning. Figure 3 provides a conceptual block diagram of the task rehearsal method. Two sets of feedforward neural networks interact during two di erent phases of operation to produce a knowledge based inductive learning system, KBIL. 12

The Networks. The set of single output feed-forward networks labeled domain knowledge is

the long-term storage area for tasks which have been successfully learned. A task is considered successfully learned when a hypothesis for that task has been developed such that it meets a minimum level of generalization error on a validation set. The retention of representations of previous learned tasks eliminates the need to retain speci c training examples for each task. Domain knowledge representations also provide a method of generating a complete virtual example for any set of input attributes. This provides a more exible source of functional knowledge for a new task. Consolidation and transfer of knowledge within the TRM system happens at a functional level, at the time of learning a new task, and it is speci c to the training examples for that task. It is the relationship between the function of the various tasks and not the relationship between their representation which is important. Subsequently, there is no need to consolidate the representations of the domain knowledge networks as was the case with the consolidation system discussed in [Silv95]. In fact, there is no requirement for any form of representational compliance between the di ering domain knowledge networks. The networks can be of di ering architectures, composed of various connections and types of nodes, and potentially have additional inputs beyond those used in the current primary task (additional inputs might be xed at constant values or varied in an organized manner over a range of values). This form of domain knowledge has a degree of freedom which is rarely seen in computational learning. One might envision a task rehearsal system that employs domain knowledge representations resulting from various forms of inference and induction such as logical expressions, regression equations, probabilistic networks, and decision trees. The fundamental requirement for the domain knowledge component is an ability to store and retrieve the representations of induced hypotheses and to use those representations to generate virtual examples of the original tasks3. The inductive learning system of the task rehearsal method is the  MTL back-propagation network. It will have as many outputs as there are source tasks stored in domain knowledge plus one for learning the new classi cation task. This network can be considered a short-term memory area for the learning of new tasks. The  MTL network provides the means by which to learn a new task while dynamically consolidating and transferring knowledge from the domain knowledge networks. There is no requirement that the architecture of the  MTL network be the same for each task which is learned. Although Figure 3 shows a 3-layer network, for another task the network might have 4 layers and utilize di erent types of active nodes. Similarly, the learning algorithm used need not be the same for each task. Many extensions of the back-propagation algorithm might be used including line search techniques and weight decay or weight elimination methods. The fundamental requirement of the inductive learning system is an ability to develop a suciently accurate hypothesis through the use of real and virtual training examples. This having been said, we recognize that management of domain knowledge in a long-term storage area is of fundamental importance to a life-long learning agent. Such management is likely to include a method of knowledge consolidation that is gradual and o -line to individual task learning. As suggested in [McCl94], one possibility is that task knowledge developed in a working memory area (in humans, the hippocampus) is consolidated within long-term storage (in humans, the neocortex) using a method of inter-leaved learning very similar to that suggested by [Robi96a]. 3

13

Phases of Operation. The task rehearsal method has two phases of operation, the training

phase and the domain knowledge update phase. One phase is mutually dependent upon the other from the perspective of sequentially learning. The training phase concerns the learning of a new task within the  MTL network. The operation of the network proceeds as if actual training examples were available for all tasks. Each primary task training example provides n input attributes and a target classi cation which is accepted by the  MTL network. The n input attributes also feed into the domain knowledge networks which produce the target values for each source task, Tk . These target values are used to complete the training examples for each Tk used by the  MTL network. Given that the range and resolution of input values remains constant over a domain, the virtual target values will have the same generalization accuracy as the domain knowledge networks at the time of their induction. It is important to note that the virtual target values will not strictly be class values of 0 or 1. The domain knowledge networks output continuous values and therefore the target values used will be in the range 0 through 1. Although it is a simple matter to convert to a strict 0 or 1 class identi er based on a cut-o such as 0.5, it is bene cial to consider leaving the target as a continuous value. Continuous target values will more accurately convey the function of the domain knowledge networks and they provide the means by which dichotomous classi cation tasks may transfer knowledge from related continuous valued tasks and visa versa. The domain knowledge update phase follows the successful learning of a new task. If the hypothesis for the primary task is able to classify a validation set of examples below a speci ed level of error, then the task is considered successfully learned. The representation composing the primary task within the  MTL network is used to produce a new feed-forward network within the domain knowledge area. There are a number of important housekeeping activities that need to be managed within the domain knowledge area. Consider the problem of becoming increasingly better at a task through practice. Should older and less accurate representations of the task be replaced? How does one identify these representations? Then there is the issue of adding or removing input attributes from a task which has been previously learned and stored in domain knowledge. Should we consider this the start of a new task domain or is there a method of utilizing existing domain knowledge? Generally, we consider such housekeeping matters to be part of the larger problem of task knowledge consolidation within a long term memory system. This is outside the scope of our current research.

Bene ts of Task Rehearsal. The following list of bene ts have been identi ed for the Task Rehearsal Method. All are related to the use of functional knowledge through the use of virtual training examples.

 Ecient storage of training examples. Under TRM there is no need to explicitly store the

training examples for a task. The representations stored in domain knowledge implicitly retain the information contained in the training examples in a compressed form. The compressed representation can result in a dramatic memory saving. Consider the di erence in storing 61 weights of a 10 input, 5 hidden node, and 1 output neural network versus storing 500 training examples. 14

 Free choice of training examples. There is no need to pair the training examples of a new

task with those of previously learned tasks since the system will automatically generate paired virtual examples from the domain knowledge networks. Being able to utilize training examples as they are encountered is a natural requirement for any sequential learning system.  Wealth of virtual examples. The source of inductive bias under TRM is the virtual examples chosen for each of the domain knowledge tasks. There is great potential bene t in being able to generate virtual examples beyond those paired with the real training examples for a new task [Caru97]. Virtual examples can be selected by way of random sampling or by an ordering over the input attribute space. One might propose that the generation of virtual examples vary dynamically as some function during the learning process (this could be seen as an extension of Mostafa's idea of adaptive minimization [AM95]). Regardless, one must be careful not to overwhelm the information provided in the actual training examples for the primary task; that is to say, induction must be driven by a fair mix of information from the real examples and bias from the virtual examples.

3.4 An Appropriate Test Domain

This section describes the criteria for an appropriate synthetic task domain for testing sequential learning via the TRM. We begin at the task level and then move on to describe the domain of tasks. Let m be the number of examples found sucient to develop an hypothesis satisfying the desired generalization error, . Each task should have:  Two or more input variables which may have either nominal (categorical), ordinal, or continuous values between 0 and 1 (the number of variables should not be so large as to defy laboratory analysis of the results - however it should be possible to scale the number of inputs upward);  One output variable which is binary categorical (dichotomous) in value - each target value is either of class 0 or class 1;  A set of training examples of size  m;  A set of test examples of size  20% of m for tuning the network (also referred to as an early stopping set); and  A set of validation examples of size  m for validating the network hypothesis. The task domain should have:  One or more primary tasks of interest (preferably complex non-linearly separable functions) each with an impoverished set of training examples of size < m, insucient to develop a model satisfying the desired generalization error, ;  Two or more secondary tasks varying in degrees of relatedness to the primary task(s) and varying in degree of functional complexity (the number of secondary tasks should not be so large as to defy laboratory analysis of the results - however it should be possible to scale the number of tasks upward); 15

 The majority of the secondary tasks unrelated to the primary task(s), forcing the KBIL to

overcome a potentially strong negative inductive bias; and  The order of task learning chosen randomly.

4 The Prototype TRM System A prototype TRM system has been develop which emulates the model shown in Figure 3. This system calls an enhanced version of the ANN software rst reported in [Silv96b]. This ANN software is capable of either single task learning (STL), multi-task learning (MTL), or  MTL. The latest version of the ANN software will be reviewed followed by the details of the TRM software.

4.1 The ANN Software.

The ANN architecture embedded in the software is the standard feed-forward type as shown in Figure 2, composed of an input layer of nodes, one hidden layer, and an output layer with one output node for each task. The number of hidden units chosen for each experiment is a design decision prior to learning a sequence of tasks. The system employs a batch method of backpropagation of error that utilizes a momentum term in the weight update equation to speed the convergence of the network. At the start of learning, small random initial weight values are selected for all runs such that the multiple network hypotheses start with an identical representation, that is, at the same point in weight space. For  MTL, using the accuracy/distance measure of relatedness, this means that the initial value of Rk for all k is 1; that is, all secondary tasks, Tk , are initially considered maximally related to the primary task, T0 . The ANN software was developed with the intention of using it as part of a task rehearsal method. Subsequently, the requirements of the system included:  The acceptance an impoverished set of training examples for the primary task, that is the primary task may have fewer training examples than the secondary tasks.  Unused task outputs would have no e ect on the back-propagation algorithm as it worked to develop hypotheses for the used task outputs. This would allow a single MTL network architecture with t outputs to be used for learning t tasks in sequence. Both of these requirements are satis ed by a simple modi cation to the back-propagation algorithm that allows for examples of unknown target classi cation. A special target value outside of the standard 0 to 1 range marks the example as unknown. The back-propagation algorithm recognizes this value and automatically considers the error for that example to be zero. Subsequently, during an iteration through the training data, any example with an unknown target value makes zero contribution to weight modi cations. If a task output is not to be used, then for each training example, the target value for that output is set to the unknown target value. The result is that zero error is backward propagated through the network for the output. It is as if the output was not part of the network. An additional method of preventing an unused task output from having an e ect on the development of the network hypotheses is to set its learning rate,  , to zero. 16

The ANN software has been previously applied to a small synthetic task domain [Silv96b] and to a more complex medical diagnostic domain [Silv97]. Results from those studies demonstrated:  The ability of the method to perform to the level of generalization accuracy achieved by a standard MTL network when all parallel tasks are closely related to the primary task;  The advantage of MTL over the standard MTL network in a situation where one of the parallel tasks is unrelated to the primary task;  The robustness of the c parameter to a reasonably wide range of values for a given problem. Since this initial research, the prototype system has been enhanced signi cantly:

 A run-time parameter le is used extensively by the enhanced system for selection of the either STL, MTL, and  MTL learning, choice of measure of relatedness, and the setting of miscellaneous parameters, such as the base  value.  The accuracy dampening calculation described in the section on a dynamic measure of relat-

edness has been implemented.  Mean error (ME) is used in computing the Ek for each task output by the MTL network. This simpli es the task of tuning the c parameter of the accuracy/distance dynamic measure of relatedness. The software considers only known examples in computing the ME.  The system automatically estimates the point of minimum generalization error through the use of a test (or tuning) set of examples. A save-best weights method is used. Both the number of iterations since last minimum test error and the increase in test error over the current minimum are considered criteria for early stopping.  The system saves, summarizes, and reports on a larger number of parameters and statistics both to the screen and to a le. Various color plots can be used in conjunction with the ANN software to analyze the graphs of training and test error, changes in  values, and functional characterizations of the developing hypotheses.

4.2 The TRM Software.

The TRM prototype system uses a sequence table to control the order in which tasks will be learned. The software moves through the two phases of operation shown in the model of Figure 3 for each task in the table.

Training Phase. Before learning a new primary task, the examples for the primary task are used to generate the virtual examples for all secondary tasks. A domain knowledge table contains the names of previously learned secondary task representations (this table can be populated with names of representations from previous runs of the system if so desired). One after the other, the previously learned task representations are setup as a feed-forward network and the virtual target values are generated and stored. Virtual target values can be left as continuous numbers in the range from 0 to 1 or converted to dichotomous class values using a cut-o of 0.5. The preliminary 17

generation of all virtual examples eliminates repeated computation during the training process. Additional virtual examples for the secondary tasks can be generated by adding training examples with unknown target values. The unknown target examples will have no e ect on the primary task but they will generate useful virtual examples for the secondary tasks. Following the generation of the virtual examples, the TRM software calls the ANN software to develop a hypothesis for the primary task. Training proceeds until the point of minimum test error for the primary task has been determined. The weights of the network are restored to the point of minimum test error before returning to the TRM software.

Domain Knowledge Update Phase. Once a minimum test error hypothesis has been devel-

oped the TRM software must determine if the hypothesis is suciently accurate to place within domain knowledge. The criteria for an accurate hypothesis is that it meets some minimum level of error on a set of validation data. The validation set is composed of examples that are not in either the training or test sets. If the validation error minimum is met, then the hypothesis is accepted as accurate, its representation is stored, and its name is added to the domain knowledge table. If the validation error minimum is not met, then the hypothesis is rejected, and no representation or record of it is kept. A record of the task's name in the domain knowledge table ensures that it will be considered during the learning of future tasks.

5 Experiments on the Band Domain This section reports on the TRM system applied to a domain of seven tasks some of which have impoverished training sets. The objective of the experiment is to demonstrate that the Task Rehearsal Method is able to sequentially retain and transfer functional task knowledge to the bene t of tasks with impoverished training sets. The experiment compares the performance of network hypotheses developed under STL to hypotheses developed by the TRM system under the MTL and  MTL learning methods.

5.1 Task Domain.

The domain contains seven synthetic tasks where each task is best described as a band of positive examples across a 2-dimensional input space. Figure 4 characterises the entire domain of tasks. All tasks are non-linearaly separable and require the use of at least two hidden nodes in order to form the internal representation required by the output nodes. The domain was chosen since it provides a challenging set of tasks within a 2-dimensional input space. The tasks satisfy all of the criteria for an appropriate test domain for sequential task learning described in section 3.4. Furthermore, the 2-dimensional input space provides an environment that lends itself to visual as well as mathematical analysis. For each task a total of 50 training, 20 test, and 200 validation examples were randomly selected. The training sets for 6 of the tasks were then impoverished by randomly marking a speci ed number of the training examples as unknown. Figure 5 presents the training sets for each task within their 2variable input space. The examples with unknown target values are shown as `.'. Table 1 shows the number of unknowns for each task and the order in which the tasks will be learned. The sequence 18

Ta

Tb

Tc

Te

Tf

Tg

Td

Figure 4: The Band domain. Each of the 7 tasks is a band of positive examples bordered by negative examples within a 2-variable input space. The tasks are shown in the order in which they will be learned, from Ta through Tg . Attention will be focused on the hypotheses developed for Tf and Tg .

Task Training Set Test Set Name Pos. Neg. Unk. Pos. Neg. Ta 20 30 0 10 10 Tb 13 22 15 10 10 Tc 15 15 20 12 8 Td 11 14 25 8 12 Te 4 6 40 5 15 Tf 10 10 30 10 10 Tg 5 5 40 8 12

Validation Set Pos. Neg. 134 66 130 70 116 84 128 72 129 71 112 88 110 90

Table 1: Summary of the training, test, and validation sets of examples for each of the seven tasks of the Band domain. Shown is the mix of positive (pos.), negative (neg.), and unknown (unk.) examples for each set.

19

of the rst four tasks is in accord with an increasing number of unknowns. This curriculum makes sense for the purposes of demonstration. A learner of a new domain must rst constructed domain knowledge from successfully learned tasks before any transfer can take place. The order of the remaining impoverished tasks is arbitrarily chosen. STL, MTL and  MTL hypotheses were developed directly from the training sets and evaluated against their validation sets. Accurate hypotheses having generalization error rates less than 15% were developed under STL for tasks Ta through Te . In contrast, STL hypotheses for Tf and Tg performed poorly with minimum error rates of 29% and 28%, respectively. In comparison,  MTL networks were able to generate hypotheses for both Tf and Tg with error rates as low as 23%. These are not highly accurate hypotheses, however in the case of Tg a di erence of means t-test (twotailed, 95% con dence level) showed that the mean error of a set of  MTL models was statistically lower than that of a set of STL models. Figure 6 shows the classi cation of the validation set for the Tf task by hypotheses developed under STL, MTL and  MTL. Figures 7 shows the classi cation of the validation set for the Tg task by hypotheses developed under STL, MTL and  MTL. Given the above ndings, the analysis of this experiment will focus on the nal two tasks of the sequence, Tf and Tg . Observe the training examples for each of these tasks in Figure 5. Tf and Tg are particularly poor training sets for single task learning since they appear to have been produced by linearly separable functions. However, under TRM and multiple task learning the unknown examples (indicated by a `.') will have associated virtual examples with known target classes generated by the domain knowledge networks. These virtual examples are the source of the inductive bias in the TRM system. The challenge for the TRM system is to overcome the impoverished training sets for these two tasks by transferring knowledge from related tasks that have been previously learned. Based on Figure 4 the most related tasks should be those who's bands of positive examples are most similarly oriented in space to the primary task of interest. Random training samples from similarly oriented band tasks have the highest probability of statistical similarity. A forthcoming paper will discuss aspects of training sample similarity and a measure of relatedness.

5.2 Method

Network Architecture and Learning Parameters. The architecture of the neural network

used in the experiment is similar to that shown in Figure 3. The network is composed of an input layer of 2 nodes, a hidden layer of 14 nodes, and an output layer of 7 nodes, one for each task. The number of hidden nodes is well suited for the MTL method [Caru95]. Two linear descriminant functions, or hyperplanes are required within the internal representation of the network in order to properly separate the positive and negative examples for any one of the band tasks. One hidden node is required to form each linear descriminant function (each node has a standard sigmoid activation function). Therefore, 14 hidden nodes should provide sucient internal representation for 7 di erent band tasks under MTL. Attempts were made to optimize the network for STL learning since too many hidden nodes have been known to promote over tting of a neural network model to the training data. For networks of 2, 4, 8, and 14 hidden nodes there was no statistically signi cant di erence found between hypotheses developed for each task. Subsequently, a network of 14 hidden nodes is used for STL learning as well as for MTL and  MTL. For this experiment the cross-entropy cost function is minimized by the BP algorithm. Hence20

forth when we refer to the mean error, Ek , we are referring to the mean cross-entropy over an entire set of examples. The cross-entropy function has the advantage that with target values of 0.1 and 0.9 (for classes 0 and 1), the mean error cannot fall below the value 0.469. This minimum value helps to prevent secondary hypotheses from \hijacking" the development of internal representation based solely on Ek with no consideration for representation distance, dk , to the primary hypothesis. For all runs the base learning rate,  , is set to 0.05 and the momentum term is set to 0.9. Based on several preliminary runs a value of 100 is chosen for the c tuning parameter of Rk = tanh(c=(Ek d2k + )) with set to 10?6 . Random initial weight values are selected in the range ?0:1 through 0.1 for all runs. The corresponding weights for the 7 output nodes are set to identical random values as explained in the section 3. The minimum misclassi cation rate on a validation data set is 25%, which means that an hypothesis representation will not be saved in domain knowledge unless there are 50 or fewer validation examples misclassi ed.

Evaluation of Performance. The sequence of 7 tasks are learned using the TRM system and each of the inductive learning methods (STL, MTL and  MTL). For each task a network is developed using the training and test examples and then validated against the 200 validation examples. Training is stopped at the point of minimum test set error for the primary hypothesis. Care must be taken to ensure that each learning method has sucient opportunity to train on the data to the point of a minimum test error. This is repeated for 5 trials using 5 di erent random initial weight vectors. To reduce experimental variance, the same set of initial weights are used for a task across the three learning methods. The performance of the TRM system under each learning method is based on the generalization ability of hypotheses developed by each method against validation sets for the tasks. The consistency that each method exhibits in developing accurate hypotheses is of prime importance. The mean number of misclassi cations as well as the lowest number of misclassi cations by the best hypotheses over the 5 trials will be used as measures of performance. Classi cation is subject to a cuto of value of 0.5 (any example with a value  0:5 is considered class 1). A di erence of means hypothesis test (2-tailed, 2-sample, unequal variance) based on a t-distribution will determine the statistical signi cance between the mean number of misclassi cations to a con dence level of 95%. A di erence of proportion hypothesis test (2-tailed) will indicate the statistical signi cance between the lowest number of misclassi cations to a con dence level of 95%. Particular attention will be paid to tasks Tf and Tg that have been shown to generate poor hypotheses under STL.

5.3 Results.

Table 2 shows the validation results for hypotheses developed by each learning method over the 5 trials. The STL results can be used as a baseline for comparison. Tasks Ta through Td are learned very well under STL. In fact the mean misclassi cation performance by the STL models is signi cantly better for Tb , Td and Te than either MTL or  MTL. The results indicate that domain knowledge can have a detrimental e ect on learning when sucient training information is available. This is most evident in the case of task Te , where the 10 training examples convey sucient information to develop an accurate hypothesis under STL. Inductive bias from secondary hypotheses result in MTL and  MTL hypotheses for Te with signi cantly lower accuracy as compared with the STL models. Inductive bias from secondary hypotheses will always have an e ect 21

Mean (standard error) number of misclassi cations

Method Ta Tb Tc Td Te Tf Tg STL 5.4(6.0) 16.6(1.9) 16.2(2.6) 12.2(2.2) 8.2( 4.2) 62.8( 6.9) 91.8(20.6) MTL 5.4(6.0) 21.8(1.8) 16.2(5.1) 19.4(3.1) 43.2(15.9) 59.4(21.7) 61.0( 7.7)  MTL 5.4(6.0) 21.0(1.6) 19.0(4.3) 21.6(5.0) 30.6(17.8) 46.0(15.4) 59.6( 8.0) Lowest number of misclassi cations Method Ta Tb Tc Td Te Tf Tg STL 1 14 13 9 3 58 55 MTL 1 19 12 16 21 36 51  MTL 1 19 15 14 13 28 48 Table 2: Summary of experimental results for the Band domain. Shown in the top portion of the table are the mean (standard error) number of validation examples misclassi ed by the network hypotheses developed for each of the 7 tasks. In the bottom portion of the table are the lowest number of misclassi cations made by the best hypothesis for each task. Note that each validation set contained 200 examples. on the internal representation developed within the network. The challenge for a knowledge based inductive learning system is to lter out negative bias for the primary task. The direction for future research is to overcome this problem under  MTL by improving upon the measure of relatedness. Note that  MTL hypotheses have substantially lower mean classi cation error for Te than MTL hypotheses. This is due to an overall reduction in the  values for the hypotheses for Ta through Td, particularly for Ta and Tb which are the least related tasks. We will now focus on the nal two tasks of the learning sequence, Tf and Tg . The training examples for these two tasks present what appear to be linearly separable functions (see Figure 5).

Task Tf . STL learning of Tf produces hypotheses which misclassify between 58 and 75 (> 29%)

validation examples. MTL learning which transfers knowledge from as many as 5 previously learned tasks does somewhat better with between 36 and 90 (> 18%) class errors4.  MTL learning produces the most accurate hypotheses, having between 28 and 69 (> 14%) misclassi ed examples. Di erence of means hypothesis tests (two-tailed, 95% con dence level) shows that the MTL and  MTL hypotheses are not statistically di erent from the STL hypotheses.  MTL comes closest with a p-value of 0.071 versus 0.753 for MTL. The best hypotheses produced by MTL and  MTL have signi cantly lower number of misclassi cations (36 and 28, respectively) than the best produced by STL (58). In order for the TRM to transfer functional task knowledge it must rst be able to retain that knowledge. Figure 8shows the 50 virtual examples generated for Ta, Tb, Tc , Td , and Te during MTL learning of Tf during one of the trials. Compare the positive class regions of these virtual training sets with those of Figure 1. It is evident that functional knowledge of the ve tasks has been retained in accord with the validation statistics presented in Table 2. The virtual training One trial under MTL and one trial under MTL had only 4 accurate hypotheses saved at the time of learning , all other trials involved 5 domain knowledge networks. 4

Tf

22

examples for Ta, Tb , Tc , and Td are very accurate, whereas the virtual examples for Te are less accurate, yet in accord with the generalization error criteria (25%) for retention of the task in domain knowledge. Figure 9 shows graphically the validation results from one trial for the STL, MTL and  MTL learning methods, respectively. The number of misclassi cations for the STL, MTL and  MTL hypotheses were 61, 36, and 28, respectively. The e ect of knowledge based inductive learning observed in most of the trials is evident in the gures. The STL hypothesis re ects information from only the Tf training examples which appear to have been generated by a linearly separable function. The MTL and  MTL hypotheses consider the Tf training examples as well as the virtual examples generated by 5 domain knowledge networks. In both cases, the results are more accurate hypotheses. Notice that the upper portion of the positive class region is more accurately de ned for the hypothesis developed by the  MTL method. This was the case in all but one of the trials. We interpret this to be the result of dynamic k values providing inductive bias that favours those hypotheses most related to the primary hypothesis. Figure 6 shows the validation results of using the original training data for the rst 5 tasks to directly generate MTL and  MTL hypotheses for Tf . The initial weight values were the same as those used to develop the hypotheses discussed in Figure 9. The number of misclassi cations for the MTL and  MTL models were 44 and 46, respectively. These values are slightly higher than the number of misclassi cations after sequential learning Tf under TRM, however they are generally in agreement with the mean classi cation error. We conclude that knowledge of the previously learned tasks has been retained in the domain knowledge networks and then transferred to an hypothesis for Tf via task rehearsal. Comparing the MTL and  MTL validation results in Figures 6 and 9 one can observe that the most accurate positive class regions for Tf have developed under TRM sequential learning. We propose that the more accurate hypotheses are due to the stronger positive inductive bias provided by 50 virtual training examples per secondary task (shown in Figure 8) under TRM as compared with as few as 10 original examples under direct MTL or  MTL (shown in Figure 5). Many of the original examples have unknown target values whereas their associate virtual examples are all known. Given the virtual target values are accurate, they convey a greater amount of information to the learning system.

Task Tg . STL learning of task Tg fails 4 times out of 5 trials to improve upon the 101 validation set misclassi cations provided by the random initial weights of the network. One trial did produce a hypothesis that misclassi ed only 55 of the validation examples. This reminds us that STL without the aide of domain knowledge may by chance nd a more optimal hypothesis. MTL hypotheses perform signi cantly better with between 51 and 68 (> 25.5%) class errors on the validation set. MTL learning produces hypotheses varying in accuracy between 48 and 67 (> 24%) misclassi ed examples. There is strong evidence that a positive transfer occurs from domain knowledge to the hypotheses formed for Tg under MTL and  MTL5. Di erence of means t-test (two-tailed, 95% con dence level) show that the mean number of misclassi cations by the MTL and  MTL models Under MTL there were four trials where only 5 accurate hypotheses had been previously saved, under MTL there was one trial where only 4 accurate hypotheses had been previously saved, all other trials involved 6 domain knowledge networks 5

23

are statistically di erent from that of the STL models with p-values of 0.025 and 0.021, respectively. Figure 10 shows graphically the validation results from one trial for the STL, MTL and  MTL learning methods, respectively. The number of misclassi cations by the STL, MTL and  MTL hypotheses are 101, 55, and 55, respectively. As with Tf , knowledge based inductive learning is evident in the gures for Tg . The STL hypothesis re ects information derived from only the Tg training examples which results in a linearly separable hypothesis. The MTL and  MTL hypotheses consider the Tf training examples as well as the virtual examples generated by 5 and 6 domain knowledge networks, respectively. In either case, the result is a more accurate hypothesis. Figure 7 shows the results of using the original training data for the rst 5 tasks to directly generate MTL and  MTL hypotheses for Tg . The initial weight values were the same as those used to develop the hypotheses discussed in Figures 10. The number of misclassi cations by the MTL and  MTL models are 65 and 46, respectively. This result is generally in agreement with the mean classi cation error for Tg after sequentially learning Ta through Tf . We conclude that knowledge of the previously learned tasks has been retained in the domain knowledge networks and then transferred to a hypothesis for Tg via task rehearsal.

6 Discussion A number of observations have been made while developing the TRM system and conducting the experiment presented in this paper.

Retention of Functional Knowledge. The experimental results demonstrate the ability of the Task Rehearsal Method to sequentially retain functional task knowledge in the form of neural network weight representations. The virtual examples generated from domain knowledge and used in training task Tf have shown the value of the method. Essentially, the original examples for each task have been compressed into 57 weight values. The TRM system has been shown to automatically select an optimum hypothesis via early stopping based on minimum test set error. Furthermore, the TRM system has been shown to selectively retain only network hypotheses which have met a speci ed generalization accuracy measure based on classi cation of a validation set of examples.

Transfer of Functional Knowledge. The experimental results show that the TRM using MTL

and  MTL has some success in the transfer of retained task knowledge to the bene t of tasks with impoverished training sets. The training examples for tasks Tf and Tg present what appear to be linearly separable tasks. Using the MTL and  MTL learning methods the TRM system typically develops more accurate hypotheses for task Tf than those developed under STL.  MTL is particularly successful in developing accurate non-linearly separable hypotheses for Tf . In the case of task Tg , the MTL and  MTL methods develop hypotheses which are statistically more likely (with 95% con dence) to properly classify future examples. The sequential learning results are very similar to those of hypotheses for Tf and Tg developed under MTL and  MTL directly from the original training examples. The original example hypotheses are the result of a positive transfer of functional knowledge from real examples of related tasks to the primary hypothesis. Under TRM 24

this same transfer has been observed, however the source of the functional knowledge must be the virtual examples generated by related domain knowledge tasks. Unfortunately, the TRM is not successful on all tasks of the sequence. Under MTL and  MTL a negative transfer of knowledge occurs during the learning of tasks Tb , Tc , Td , and Te . The resulting hypotheses misclassify larger numbers of validation examples than their respective STL hypotheses. As pointed out in the early portion of the experiment section, the training examples for these four tasks are sucient for learning an accurate hypothesis under STL. This means that inductive bias is not required. The transfer of knowledge that occurs, regardless of how subtle, manifests itself in the form of less accurate hypotheses. This form of negative inductive bias can never be entirely eliminated if the initial assumption is that all secondary tasks are related to the primary task. The next step in developing a more e ective measure of relatedness is to consider a priori aspects of task similarity. Only in this way can a knowledge based inductive learning system provide a tractable method of knowledge transfer from a large and diverse base of domain knowledge.

Limitation of the Accuracy/Distance Measure of Relatedness. On this domain, the ac-

curacy/distance measure of relatedness does not allow the  MTL method to consistently develop hypotheses that are as accurate or more accurate than STL hypotheses. There appears to be two reasons for this. Having a xed value for the c parameter over an entire sequence of tasks is one reason. A value of c = 100 works well for certain of the tasks in the Band domain and not so well for others. A lower value for c during the learning of the rst ve tasks will produce results closer to the STL method. However, a lower value of c will not be bene cial when learning tasks Tf and Tg which bene t most from the transfer of knowledge from related tasks. Either an automatic method for adjusting the c parameter must be found or a more e ective measure of relatedness must be discovered. The second reason for  MTL hypotheses performing less accurately than STL hypotheses is the inability of the accuracy/distance measure of relatedness to scale up to large sequences of tasks. The measure initially assumes that all secondary tasks of domain knowledge are highly related to the primary task; i.e. all k have the same value. Thus, unrelated tasks are guaranteed to have some e ect on the development of internal representation. As the number of secondary tasks increases so does the collective e ect of this initial inductive bias, as well as the on-going cumulative e ect of many unrelated tasks.

Complexity of Relatedness. Relatedness, similarity and analogy are very dicult subjects

which have been matters of philosophical debate for over 2500 years (see [Robi96b] for a detailed discussion). Experimentation with TRM suggests a fundamental reason for the complexity surrounding the issue of relatedness. Measuring task relatedness is an activity conducted within a frame of reference that is relative to those tasks which have been previously learned. Task knowledge retained by the TRM a ects the measure of relatedness between the primary hypothesis and all other secondary hypotheses since virtual examples for each task contribute to the development of the primary hypothesis. Thus, once a new task has been learned, the frame of reference for measuring task relatedness changes. In summary, task relatedness is relative to the learners current state of domain knowledge. 25

Training Examples and Inductive Bias Task Tf has 20 training examples compared to task

Te's 10 examples. However, the results in Table 2 show us that Te is by far the easier task to learn. The Te results reinforce the importance of using either a large set of training examples or a carefully selected set of training examples when learning under STL. In the case of Tf as well as Tg , it reinforces the importance of using a source of inductive bias whenever possible to mitigate the ill e ects of impoverished training sets. The TRM using either MTL or  MTL provides just such an inductive bias via virtual examples. Furthermore, we have observed the development of a more accurate hypothesis (for task Tf ) through the use of supplemental virtual examples beyond those generated from the training examples provided for the primary task.

Relearning of Domain Knowledge Tasks are Easier. Under TRM, we observed that hy-

potheses for domain knowledge tasks tended to be learned more easily than the primary hypothesis. Typically, two or three of the secondary hypotheses would develop training errors that were slightly less than that of the primary hypothesis. We propose the reason for this is the ease at which the inductive network is able to develop weight values satisfying a function generated by a domain knowledge network of the same architecture. Under this condition, the  MTL network is guaranteed to have the correct hypothesis space and representational language to develop an equivalent hypothesis for the secondary task. In comparison, developing weight values for the real examples of the new primary task is more dicult. Although, the hypothesis space of the inductive network may be sucient, there is no guarantee that the representational language can generate as accurate an hypothesis. The MTL method has particular diculty in overcoming rapid learning of secondary tasks. The  MTL method has greater success. The dampening component of the measure of relatedness prevents secondary tasks from \hi-jacking" the internal representation of the inductive network while still considering the relative similarity of the hypotheses representations to the primary hypothesis.

Management of Domain Knowledge Error. A sequential learning agent, such as TRM, is

faced with a dilemma. The agent must ensure that only suciently accurate hypotheses are stored in domain knowledge, however it also needs to accumulate domain knowledge so as to facilitate the learning of accurate hypotheses. In the experiment presented we chose a generalization error rate of 25% as the criteria for successful learning. This allows for the rapid development of domain knowledge over our experimental domain. Unfortunately, it also allows for an accumulation of error over the domain that may act in a compound fashion. Observe the virtual examples generated by the domain knowledge network for task Te shown in Figure 8. The error exhibited by these virtual examples will be transferred into future related hypotheses. These hypotheses once retained in domain knowledge will have their error transferred into yet other hypotheses. Clearly, managing the build up of domain knowledge error is of prime concern to a sequential learning agent. Human develop of domain knowledge suggests that one method of preventing this accumulation of error is through the continued practice of key tasks from the domain concurrent with the acquisition of new task knowledge.

26

7 Conclusion In this paper we have introduced an hypothesis of functional transfer of task knowledge, the validity of which depends upon (1) the existence of a method of sequentially retaining and transferring functional task knowledge through the generation of virtual examples and (2) a measure of relatedness for properly utilizing those virtual examples. The paper then focuses on the rst of these two assumptions by describing a theoretical model and a prototype system for sequential task learning called the Task Rehearsal Method (TRM). The TRM is a knowledge based inductive learning method that uses functional domain knowledge as a source of inductive bias. Domain knowledge is retained in the form of neural network representations (weight values) of successfully learned tasks. Using virtual examples generated from the domain knowledge, previously learned tasks are rehearsed or re-learned in parallel with the each new primary task using either the standard multiple task learning (MTL) or the  MTL neural network methods. Consolidation and transfer of knowledge within the TRM happens at a functional level, at the time of learning a new task, and it is speci c to the training examples for that task. It is the relationship between task function provided by real and virtual examples which is ultimately important and not the relationship between task representations. The criteria for an appropriate task domain for testing a sequential learning system is outlined. This is followed by the description of a prototype TRM system developed to learn a series of tasks listed in a sequence table. The TRM system begins by generating virtual examples for all previously learned tasks listed in a domain knowledge table (initially empty). The TRM system then calls ANN software which is capable of performing either STL, MTL, and  MTL induction. The ANN software is capable of accepting impoverished training sets that include examples with unknown target values and as well as fewer tasks than there are outputs in the  MTL network. This allows a single  MTL network architecture to be used during the course of sequential learning. The ANN software is capable of automatically estimating the point of minimum generalization error by employing a save-best weights method of early stopping that monitors a test set of examples. If the hypothesis has been learned to a sucient level of generalization accuracy, the TRM system saves the weight representations in a le and enters the task name into the domain knowledge table. The results of an experiment conducted with the TRM on a synthetic domain of seven tasks is reported. Functional knowledge of previously learned tasks is shown to have been retained in domain knowledge by comparing graphics of virtual examples generated by domain knowledge networks to the original training examples and the ideal classi cation regions. Under both MTL and  MTL, functional knowledge is demonstrated to have been transferred to hypotheses from domain knowledge for dicult tasks with impoverished training sets. Under these circumstances TRM produces more accurate hypotheses than STL.  MTL develops slightly better hypotheses than MTL through the use of a measure of relatedness that directs the inductive bias from those tasks in domain knowledge most related to the primary task of interest. Limitations of the method are demonstrated for those tasks which have sucient training examples and for long sequences of tasks. With abundant training information there is no need for inductive bias. In fact, the transfer of knowledge from secondary tasks prevents the development of a more accurate hypothesis, that is a negative transfer of knowledge occurs. With respect to  MTL, this indicates the need for an automatic method of tuning the c parameter of the current measure of relatedness. Long sequences of tasks bring an additional problem for the current measure which 27

assumes that all tasks are equally related at the start of learning. A more e ective measure of relatedness is needed to deal with this issue of scalability. In a forthcoming paper we will investigate the issue of a measure of relatedness more thoroughly. Criteria for an approriate metric will be presented followed by research into other dynamic measures as well as a priori static and hybrid (static plus dynamic) measures. There are two areas we would like to address in future research. The rst concerns the impact that selection and number of virtual examples has on the development of an hypothesis under the TRM. The second has to do with continued practice of a task under the TRM while being interrupted by the learning of unrelated tasks.

References [AM93] [AM95] [Baxt95] [Caru95] [Caru97] [Elli65] [Fahl90] [Gros87] [Hall89] [Jaco88] [Keho88] [McCl89]

Yaser S. Abu-Mostafa, \A method for learning from Hints", Advances in Neural Information Processing Systems 5, Morgan Kaufmann, Vol. 5, pp. 73{80, San Mateo, CA, 1993. Yaser S. Abu-Mostafa, \Hints", Neural Computation, Massachusetts Institute of Technology, Vol. 7, pp. 639{671, 1995. Jonathan Baxter, \Learning internal representations", Proceedings of the Eigth International Conference on Computational Learning Theory, (to appear) ACM Press, Santa Cruz, CA, 1995. Richard A. Caruana, \Learning many related tasks at the same time with backpropagation", Advances in Neural Information Processing Systems 7, Morgan Kaufmann, Vol. 7, pp. 657{664, San Mateo, CA, 1995. Richard A. Caruana, \Multitask learning", Machine Learning, Vol. 28, pp. 41{75, 1997. H. Ellis, \Transfer of Learning", MacMillan, New York, NY, 1965. S.E. Fahlman and C. Lebiere, \The cascade-correlation learning architecture", Advances in Neural Information Processing Systems 2, Morgan Kaufmann, Vol. 2, pp. 524{532, San Mateo, CA, 1990. Stephen Grossberg, \Competitive learning: From interactive activation to adaptive resonance", Cognitive Science, Vol. 11, pp. 23{64, 1987. Rogers P. Hall, \Computational approaches to analogical reasoning: A comparative analysis", Ari cial Intelligence, Elseivier Sience Publishers B.V., Vol. 39, pp. 39{120, North-Holland, 1989. R.A. Jacobs, \Increased rates of convergence through learning rate adaptation", Neural Networks, Vol. 1, pp. 295{307, 1988. E. James Kehoe, \A layered network model of associative learning: Learning to learn and con guration", Psychological Review, Vol. 95, No. 4, pp. 411{433, 1988. Michael McCloskey and Neal J. Cohen, \Catastrophic interference in connectionist networks: the sequential learning problem", The Psychology of Learning and Motivation, Vol. 24, , 1989. 28

[McCl94] James L. McClelland, Bruce L. McNaughton, and Randall C. O'Reilly, \Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory", Technical Report PDP.CNS.94.1, Department of Psychology, Carnegie Mellon University, Pittsurgh, PA 15213, 1994. [Mich93] R.S. Michalski, \Learning = Inferencing + Memorizing", Foundations of Knowledge Acquistion: Machine Learning, Kluwer Academic Publishers, pp. 1{41, Boston, MA, 1993. [Mitc80] Tom. M. Mitchell, \The need for biases in learning generalizations", Readings in Machine Learning, Morgan Kaufmann, pp. 184{191, San Mateo, CA, 1980. [Mitc93] Tom Mitchell and Sebastian Thrun, \Explanation based neural network learning for robot control", Advances in Neural Information Processing Systems 5, Morgan Kaufmann, Vol. 5, pp. 287{294, San Mateo, CA, 1993. [Mitc97] Tom M. Mitchell, \Machine Learning", McGraw Hill, New York, NY, 1997. [Naik92] D. K. Naik, R. J. Mammone, and A. Agarwal, \Meta-Neural Network approach to learning by learning", Intelligence Engineering Systems through Arti cial Neural Networks, ASME Press, Vol. 2, pp. 245{252, 1992. [Naik93] D.K. Naik and Richard J. Mammone, \Learning by learning in neural networks", Arti cial Neural Networks for Speech and Vision; ed. Richard J. Mammone, Chapman and Hall, London, 1993. [Prat93] Lorien Y. Pratt, \Discriminability-Based transfer between neural networks", Advances in Neural Information Processing Systems 5, Morgan Kaufmann, Vol. 5, pp. 204{211, San Mateo, CA, 1993. [Prat96] Lorien Pratt and Barbara Jennings, \A Survey of Transfer Between Connectionist Networks", Connection Science Special Issue: Transfer in Inductive Systems, Lorien Pratt (Editor), Carfax Publishing Company, Vol. 8, No. 2, pp. 163{184, 1996. [Ring93] Mark Ring, \Learning sequential tasks by incrementally adding higher orders", Advances in Neural Information Processing Systems 5, Morgan Kaufmann, Vol. 5, pp. 155{122, San Mateo, CA, 1993. [Robi95] Anthony V. Robins, \Catastrophic forgetting, rehearsal, and pseudorehearsal", Connection Science, Carfax Publishing Company, Vol. 7, pp. 123{146, Cambridge, MA, 1995. [Robi96a] Anthony V. Robins, \Consolidation in neural metworks and in the sleeping brain", Connection Science Special Issue: Transfer in Inductive Systems, Lorien Pratt (Editor), Carfax Publishing Company, Vol. 8, No. 2, pp. 259{275, Cambridge, MA, 1996. [Robi96b] Anthony V. Robins, \Transfer in Cognition", Connection Science Special Issue: Transfer in Inductive Systems, Lorien Pratt (Editor), Carfax Publishing Company, Vol. 8, No. 2, pp. 185{203, Cambridge, MA, 1996. [Shar92] Noel E. Sharkey and Amanda J.C. Sharkey, \Adaptive generalization and the transfer of knowledge", Working paper - Center for Connection Science, University of Exeter, UK, 1992. [Shav90] Jude W. Shavlik and Geo rey G. Towell, \An appraoch to combining explanation-based and neural learning algorithms", Readings in Machine Learning, Morgan Kaufmann, pp. 828{839, San Mateo, CA, 1990. 29

[Silv95] [Silv96a] [Silv96b] [Silv97] [Sing92] [Sudd90] [Thru94] [Thru95a] [Thru95b] [Towe90] [Utgo86] [Vogl88]

Daniel L. Silver and Robert E. Mercer, \Toward a model of consolidation: The retention and transfer of neural net task knowledge", Proceedings of the INNS World Congress on Neural Networks, Lawrence Erlbaun Assosciates, Vol. III, pp. 164{169, July 1995. Daniel L. Silver, \Consolidation and Transfer of Neural Network Task Knowledge, A PhD Proposal", Department of Computer Science, University of Western Ontario, Middlesex College, London, Ontario N6A 5B7, June 1996. Daniel L. Silver and Robert E. Mercer, \The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness", Connection Science Special Issue: Transfer in Inductive Systems, Lorien Pratt (Editor), Carfax Publishing Company, Vol. 8, No. 2, pp. 277{294, Cambridge, MA, 1996. Daniel L. Silver and Robert E. Mercer, \The functional transfer of knowledge for coronary artery disease diagnosis", Technical Report No. 513, Department of Computer Science, University of Western Ontario, Middlesex College, London, Ontario N6A 5B7, January 1997. Satinder P. Singh, \Transfer of learning by composing solutions for elemental sequential tasks", Machine Learning, 1992. Steven Suddarth and Y Kergoisien, \Rule injection hints as a means of improving network performance and learning time", Proceedings of the EURASIP workshop on Neural Networks, 1990. Sebastian Thrun and Tom M.Mitchell, \Learning one more thing", Technical Report CMU-CS-94-184, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1994. Sebastian Thrun, \LIfelong Learning: A Case Study", Technical Report CMU-CS-95208, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Nov 1995. Sebastian Thrun and J. O'Sullivan, \Clustering learning tasks and the selective crosstask transfer of knowledge", Technical Report CMU-CS-95-209, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Nov 1995. Geo rey G. Towell, Jude W. Shavlik, and Michiel O. Noordewier, \Re nement of approximate domain theories by knowledge-based neural networks", Proceedings of the Eigth National Conference on Arti cial Intelligence (AAAI-90), AAAI Press/MIT Press, Vol. 2, pp. 861{866, Menlo Park, CA, 1990. Paul E. Utgo , \Machine Learning of Inductive Bias", Kluwer Academc Publisher, Boston, MA, 1986. T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon, \Accelerating the convergence of the back-propagation method", Biological Cybernetics, Vol. 59, pp. 257{ 263, 1988.

30

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

(a) Ta

0.6

0.8

1

0.6

0.8

1

(b) Tb

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0 0

0.4

0.1

0.2

0.4

0.6

0.8

0 0

1

(a) Tc

0.2

0.4

(b) Td

31

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

0.4

0.6

0.8

1

(b) Tf

(a) Te

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

(a) Tg Figure 5: The training sets for the 7 Band tasks Ta through Tg within their 2-variable input space. A `' indicates a positive example, a `o' indicates a negative example, and a `.' indicates an example with an unknown target class. 32

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

Input 2

Input 2

1 0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

Input 1

0.2

0.4

0.6

0.8

1

Input 1

(a) STL

(b) MTL

1 0.9 0.8 0.7

Input 2

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Input 1

(c)  MTL Figure 6: Classi cation of a validation set for task Tf by STL, MTL, and  MTL hypotheses developed directly from the actual training examples for Ta through Te . A `' indicates a true positive classi cation by the network and a `o' indicates a true negative. A `+' indicates a false positive classi cation while `.' indicates a false negative. 33

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

Input 2

Input 2

1 0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

Input 1

0.2

0.4

0.6

0.8

1

Input 1

(a) STL

(b) MTL

1 0.9 0.8 0.7

Input 2

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Input 1

(c)  MTL Figure 7: Classi cation of a validation set for task Tg by STL, MTL, and  MTL hypotheses developed directly from the actual training examples for Ta through Te . A `' indicates a true positive classi cation by the network and a `o' indicates a true negative. A `+' indicates a false positive classi cation while `.' indicates a false negative. 34

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

(a) Ta

0.6

0.8

1

0.6

0.8

1

(b) Tb

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0 0

0.4

0.1

0.2

0.4

0.6

0.8

0 0

1

(c) Tc

0.2

0.4

(d) Td

35

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

(e) Te Figure 8: The sets of virtual training examples for Ta through Te generated by the domain knowledge networks for learning task Tf under TRM and  MTL. Each set has a total of 50 examples. A `' indicates a positive example and a `o' indicates a negative example.

36

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

Input 2

Input 2

1 0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

Input 1

0.2

0.4

0.6

0.8

1

Input 1

(a) STL

(b) MTL

1 0.9 0.8 0.7

Input 2

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Input 1

(c)  MTL Figure 9: Classi cation of a validation set for task Tf by hypotheses developed by the STL learning method as well as by the MTL and  MTL learning methods under TRM. A `' indicates a true positive classi cation by the network and a `o' indicates a true negative. A `+' indicates a false positive classi cation while `.' indicates a false negative. 37

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

Input 2

Input 2

1 0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

0.2

0.4

0.6

0.8

0 0

1

Input 1

0.2

0.4

0.6

0.8

1

Input 1

(a) STL

(b) MTL

1 0.9 0.8 0.7

Input 2

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Input 1

(c)  MTL Figure 10: Classi cation of a validation set for task Tg by hypotheses developed by the STL learning method as well as by the MTL and  MTL learning methods under TRM. A `' indicates a true positive classi cation by the network and a `o' indicates a true negative. A `+' indicates a false positive classi cation while `.' indicates a false negative. 38

Suggest Documents