Learning From Context Without Coding-Tricks John Case
Sanjay Jainy
Matthias Ottz
Universitat Karlsruhe National University of Singapore Frank Stephan{ Arun Sharmax Universitat Heidelberg University of New South Wales
University of Delaware
Abstract
Empirical studies of multitask learning provide some evidence that the performance of a learning system on its intended targets improves by presenting to the learning system additional related tasks, also called contexts, as additional input. Angluin, Gasarch, and Smith, as well as Kinber, Smith, Velauthapillai, and Wiehagen have provided mathematical justi cation for this phenomenon in the inductive inference framework. However, their proofs rely heavily on self-referential coding tricks, that is, they directly code the solution of the learning problem into the context. In this work we prove, in the inductive inference setting, that multitask learning is extremely powerful even without using obvious coding tricks. Coding tricks are avoided in the powerful sense of Fulk's notion of robust learning. Also, studied is the diculty of the functional dependence between the intended target tasks and useful associated contexts.
Department of CIS, University of Delaware, Newark, DE 19716, USA, Email:
[email protected] y Department of Information Systems and Computer Science, National University of Singapore, Singapore
119260, Republic of Singapore, Email:
[email protected]. z Institut fur Logik, Komplexitat und Deduktionssysteme, Universitat Karlsruhe, 76128 Karlsruhe, Germany, Email: m
[email protected]. Supported by the Deutsche Forschungsgemeinschaft (DFG) Graduiertenkolleg \Beherrschbarkeit komplexer Systeme" (GRK 209/2-96). x School of Computer Science and Engineering, University of New South Wales, Sydney 2052, Australia, Email:
[email protected]. Supported by Australian Research Council Grant A49600456. { Mathematisches Institut, Universitat Heidelberg, Im Neuenheimer Feld 294, 69120 Heidelberg, Germany, Email:
[email protected]. Supported by the Deutsche Forschungsgemeinschaft (DFG) Grant Am 60/9-1.
1 Introduction There is empirical evidence that sometimes performance of learning systems improves when they are modi ed to learn auxiliary, \related" tasks (called contexts) in addition to the primary tasks of interest [7, 8, 28]. For example, an experimental system to predict the value of German Daimler stock performed better when it was modi ed to track simultaneously the German stock-index DAX [4]. The value of the Daimler stock here is the primary or target concept and the value of the DAX|a related concept|provides useful auxiliary context. The additional task of recognizing road stripes was able to improve empirically the performance of a system learning to steer a car to follow the road [8]. Other examples where multitask learning has successfully been applied to real world problems appear in [37, 33, 30, 14]. Importantly, this empirical phenomena of such context sensitivity in machine learning [28] is also supported, for example, by mathematical existence theorems for this phenomena (and variants) in [1].1 More technical theoretical work appears in [23, 26]. The theoretical papers, [1, 23], provide theorems witnessing situations in which learnability absolutely (not just empirically) passes from impossible to possible in the presence of suitable auxiliary contexts to be learned. However, these theorems are proved there by means of selfreferential coding tricks, where, in eect, correct hypotheses for the primary tasks are coded into the auxiliary contexts. The use of such coding tricks severely detracts from these otherwise interesting results. A crucial contribution of the present paper is to employ the notion of robust identi cation essentially from [19] to show (in Sections 3 and Section 4 below) that these sorts of existence theorems hold without the use of obvious coding tricks! In Section 5 below we present results about the problem of nding useful auxiliary contexts to enable the robust learning of classes of functions which might not be learnable without such contexts. Since robustness is an important theme of the present paper, we next discuss it and other needed notions more carefully, before discussing, in more detail, the main results of the present paper. In what follows partial functions are to be understood to map non-negative integers into same. When they are de ned for all non-negative integer arguments, i.e., when they are total, they will be referred to as functions. A machine M Ex-identi es computable f just in case M , fed the values of f , outputs a sequence of programs eventually converging to a program for f [6, 10]. A class of functions S is Ex-identi able just in case there is a machine that Ex-identi es each member of S . Here is a particularly simple example of a self-referential coding trick. Let SD = f comIn other empirical work (for example, [42, 43, 41, 11, 12, 13, 16, 40, 39]), one pre-trains on a succession of prior tasks to achieve success on a current task. Mastery of previous tasks provides useful context for the next. One sees similar attempts in animal training by shaping desired behavior through a succession of approximations [21, 15] (e.g., to teach a dog to jump through a hoop of re, rst teach it to jump, then to jump through a non ery hoop, and then add re). [9] and [3], in a related vein, show how to improve learnability by the providing of information in addition to data about the desired task: algorithms already nearly right and higher-order but correct programs, respectively. We see, then, situations in which it is useful to learn additional task(s) simultaneously (i.e., in parallel) and situations in which it is useful to have learned (or to be given how to do) some task(s) previously (i.e., in some sequence). [1] also theoretically exhibits cases for sequential dependence and as well as the above mentioned cases for parallel dependence. [38] theoretically exhibits cases of arbitrary dependence: diagraph dependence. 1
1
putable f j f (0) is the numerical name for a program for f g. Clearly, SD is Ex-identi able since a machine on f 2 SD need only wait for the value f (0) and output it.2 However, the Ex-identi cation of SD severely depends on programs for its members being coded into the values (at zero) of those members. In the 1970's, Barzdins was, in eect, concerned, among other things, with how to formulate that an existence result in function learnability was proved to hold without resort to such selfreferential coding tricks. He, in eect, reasoned that one should concoct an associated function class S and then show that the desired result holds for any suitable computable transformation of S into a (possibly dierent) function class S 0 . The idea was that suitable such computable transformation of a self-referential class would irretrievably scramble its coding tricks. For example, for SD itself, consider the transformation L such that, for all partial functions and , for all non-negative integers x, if = L ( ), then (x) = (x + 1). L essentially \shifts the partial function to the left." It is easy to see that L (SD) = R, the class of all computable functions. It is well known that R is not Ex-identi able; hence, L transforms the identi able class SD into an unidenti able class R by removing the self-referential information from SD that made it identi able. Hence, the simple scrambling transformation L will wreak havoc with proofs of existence results about learnability which employ SD. Barzdins conjectured that learnability restricted to the avoidance of coding tricks as above would have a particularly simple form. Fulk [19] formally refuted this conjecture, but, for the present paper, we only borrow variants of some of Fulk's precise de nitions.3 Before we introduce robust identi cation, it is necessary to precisely de ne the transformations we will consider for scrambling coding tricks. First: A computable operator is (by de nition) a mapping from all partial functions to partial functions so that some algorithm transforms arbitrary listings of the graph of any input into a listing of the graph of the output.4 For example, L de ned above is, clearly, a computable operator. So is the self-composition operator S such that, for all partial functions , S ( ) = . Of course, we want the scrambling of a class of computable functions to be again a class of computable functions. In the present paper, then, we de ne our scrambling transformations to be the general computable operators which are (by de nition) the computable operators which happens to map all the total functions into total functions. The scrambling transformation L above is clearly a general computable operator as is S .5 Based on [19], we de ne S R to be robustly Ex-identi able just in case, for all general computable operators , (S ) is Ex-identi able. In the present paper we employ variants of this to overcome coding tricks in models of multi-task learning, and we will show many of our existence theorems hold robustly. A machine M Bc-identi es computable f just in case M , fed the values of f , outputs a 2 And it's a very large class of computable functions. [6] essentially show that such classes contain a nite variant of each computable function! 3 In the present paper, it would take us too far a eld to go into more detail about Barzdins' conjecture itself. 4 These are called recursive operators in [35]. If, as in most of the present paper, we are interested in feeding such operators total function inputs only, then they are equivalent to the operators computed by Turing machines where the total function argument is treated as an oracle [35]. 5 Some of our results hold for a strictly more general class of computable operators, i.e., computable operators which are merely required to map computable functions into same [20, 29], but it is still open whether all our results do. Of course, L and S also map computable functions into same. To simplify our exposition, though, in the present version of the paper, we will state all our results about computable operators in terms of general computable operators.
2
sequence of programs and beyond some point in this sequence all the programs compute f [5, 10]. In Section 3 we consider essentially the model of Kinber, Smith, Velauthapillai, and Wiehagen [24]. At least implicit in their model is the de nition: a learner M is said to (a; b)Ex-identify a set of b pairwise distinct functions just in case M , fed graphs of the b functions simultaneously, Ex-identi es at least a of them. They showed that for all a, there is a class of functions S which cannot even be Bc-identi ed but which is (a; a + 1)Ex-identi able with 0 mind changes. This result may be viewed as a very strong case for learning from context, since presenting a + 1 functions from a very dicult to learn class (not even in Bc) allows nite learnability of at least a of them. Unfortunately, their proof of this result uses a coding trick that detracts from its strength. Our Theorem 4.4 below in Section 3 nicely provides the robust version of this result from [24]! The above model of parallel learning may be viewed as learning from arbitrary context. No distinction is made between which function is the target concept and which function provides the context. Let R R R be given. Intuitively, for (f; g) 2 R, f is the target function and g is the context. We say the class R is ConEx-identi able if there exists a machine which, upon being fed graphs of (f; g) 2 R (suitably marked as target and context), converges in the limit to a program for f . This model is similar to the parallel learning model of Angluin, Gasarch, and Smith [2] where they require the learner to output also a program for the context g. Our Theorem 4.5 in Section 4 is a robust version of their [2, Theorem 6]. The above model of ConEx-identi cation may be viewed as a bit too strong since it requires a learner to be successful on each target with every context. In practice, it may be possible to determine a single suitable context for each target. We say that a class of functions S R is SelEx-identi able if there exists a mapping C : S ! S such that f(f; C (f )) j f 2 S g is ConEx-identi able. Here, C may be viewed as a context mapping for the concept class S . Of course, the freedom to choose any computable context is very powerful since, then, even R can be SelEx-identi ed with 0 mind changes. To see this, just consider a mapping C that maps each f 2 R to a computable function g such that g(0) codes a program for f . Then, after reading (f (0); g(0)), a machine need only output g(0). Of course, this may be viewed as cheating since it resorts to a coding trick. For this reason we consider robust versions of both ConEx-identi cation and SelEx-identi cation. Surprisingly, though, we are able to show (Theorem 4.4 in Section 4) that, even if we do not allow obvious coding tricks, the class R can be SelEx-identi ed, although no longer with 0 mind changes! Though R is robustly SelEx-learnable, the appropriate context mappings may be uncomputable with unpleasant Turing complexity. For this reason, we also investigate the nature of the appropriate context mapping to gain some understanding of the functional dependence between the target (primary task) and the context. In particular, we look for example classes that are SelEx-identi able or robustly SelEx-identi able but are not Ex-identi able and are such that the context mapping may be more feasible. In machine learning presently, one generally selects, by common sense and intuition, auxiliary contexts that may help with some primary learning tasks. Furthermore, one selects them to be, in some intuitive sense, related to the primary tasks. Then at least some empirically successful cases get reported. It would be good (but likely not easy) to supply advice based on theorems as to the selection of useful auxiliary contexts. In Section 5 below we present some of our preliminary results on this dicult problem. There we preliminarily model a possibly 3
useful context as related to a class S of primary functions if it's in S too. Continuous operators are essentially characterized as those for which an algorithm for passing from the graph of the input partial function to that of the output partial function may or may not have access to some non-computable oracle.6 The general continuous operators are, then (by de nition), the continuous operators which happen to take all total functions into total functions. Theorem 5.1 in Section 5 below surprisingly says that, for S R, where S is nitely invariant, no general continuous operators mapping S to S can be used to nd contexts in S to enable us to robustly SelEx-identify S unless we could robustly Ex-identify S without any such helping context. Lets see what might be salvaged. For some classes of primary tasks (functions) S , we may know some programs for members of the class. If we could nd some computable mapping h from the programs for members of S into programs for associated useful contexts also in S , we might be able to get some intuition from h on how to nd contexts at least in some cases given, not programs for targets in S , but, as may happen in learnability, given a general description of S and data about targets in S . Anyhow, we have a surprising but preliminary positive result (Theorem 5.4 in Section 5 below) that for S = the (0-1 valued) members of R some computable mapping h takes us from programs for members of S to useful contexts in S enabling us to robustly SelEx-identify S !
2 Preliminaries The set of natural numbers, i.e., the set of the non-negative integers, is denoted by !. If A !n , we write Aji = fxi j (x1 ; : : : ; xi ; : : : ; xn ) 2 Ag for the projection to the i-th component. We are using an acceptable programming system '0 ; '1 ; : : : for the class of all partial computable functions [34, 36]. MinInd(f ) = minfe j 'e = f g is the minimal index of a partial computable function f with respect to this programming system. R denotes the set of all (total) computable functions; R0;1 denotes the class of all f0; 1g-valued functions from R. A class S R is computably enumerable if empty or S = f'h(i) j i 2 !g for some h 2 R. Seq = ! is the set of all nite sequences from !. For strings ; 2 Seq [ !! , means that is an initial segment of . ja1 : : : an j = n denotes the length of a string a1 : : : an 2 Seq. Total functions f : ! ! ! are identi ed with the in nite string f (0)f (1) : : : 2 !! . We write f [n] for the initial segment f (0) : : : f (n ? 1) of a total function f . For sets D S !! we say that D is a dense subset of S , if (8f 2 S )(8n)(9g 2 D)[f [n] g]. Let P denote the class of all partial functions mapping ! to !. An operator : P ! P is general if (f ) is total for all total functions f . 0 ; 1 ; : : : denotes a (non-eective) listing of all general computable operators. If is a general computable operator, and is a string, we say that ()(x) is de ned and equals y (or ()(x) # = y) if for all total functions f , (f )(x) = y and during the computation of (f )(x) only values f (y) with y < jj are used. 1 The quanti er ( 8 n) abbreviates (9m)(8n m). Learning machines are typically total Turing machines which compute some mapping Seqm ! !n (although sometimes we allow them to output a ? symbol). Bc is the class of all subsets S of R such that there exists a 1 learning machine M : Seq ! ! with (8f 2 S )( 8 n)['M (f [n]) = f ]. S is in Ex if (8f 2 1 S )(9e)['e = f ^ ( 8 n)['M (f [n]) = e]] for some learning machine M . M makes a mind change at A formal de nition appears in Section 5 below. Further discussion of the equivalence to the standard topological notion of continuity may be found in [35]. 6
4
stage n + 1 on input f if ? 6= M (f [n]) 6= M (f [n + 1]). A learner M learns S nitely i for every f 2 S , M , fed graph of f , outputs only one program, and this program is correct for f (besides this program, M might also initially output a special symbol ? to indicate that M has not yet made up its mind on the hypothesis). Fin denotes the collection of all nitely learnable classes S , that is, the classes which are Ex-learnable without any mind changes. It is well known that Fin Ex Bc and R0;1 62 Bc (see, e.g., [32]).
3 Learning From Arbitrary Contexts A very restricted form of multitask learning arises when we require that the learning machine be successful with any context from the concept class under consideration. In this case it is most natural (as argued below) to look at the learning problem in a symmetric manner, that is, we do not distinguish between the target function and the context. Instead, we treat each input function with the same importance and try to learn programs for each of them (but may only be successful on some of the input functions). However, in this case we do have to require that the input functions are pairwise dierent; otherwise, we do not get a dierent learning notion, since the ordinary Ex-learning problem would reduce to such a learning type. The resulting learning notion, which we formally introduce in the next de nition, has essentially already been introduced and studied by Kinber, Smith, Velauthapillai and Wiehagen [24, 25] (see also the work of Kummer and Stephan [26]). De nition 3.1 S R is in (a; b)Ex if there exists a learning machine M such that that for all pairwise distinct f1 ; : : : ; fb 2 S : (9i1 < : : :< ia )(9e1 ; : : : ; ea )(8j; 1 j a) 1 ['ej = fij ^ ( 8 n)[M (f1 [n]; : : : ; fb [n])jij = ej ]]: In the literature, so far only the very restrictive nite identi cation variant of (a; b)Ex has been studied, in which the learner has to correctly infer a out of b given functions without any mind changes. For this version, it was shown in [24] that, for all a, there is a class S of functions which is not in Bc but is (a; a + 1)Ex-learnable without any mind changes. Thus, presenting a + 1 functions of a non-Bc-learnable class in parallel may allow nite learnability of at least a of the a + 1 functions. This result appears to provide a very strong case the usefulness of parallel learnability. However, the proof of this result uses a coding trick. More precisely, each non-empty nite subset F of S contains one function which holds programs for all other functions of F in its values. This coding can easily be destroyed by applying a computable operator on the functions in S . Therefore, the proof given in [24] cannot be applied to handle the following robust version of multitask learning with arbitrary context. De nition 3.2 S R is in (a; b)RobEx if (S ) 2 (a; b)Ex for all general computable operators . Surprisingly, we are able to show that there are still classes in (a; a + 1)RobEx ? Bc. Thus, one can establish without using obvious coding tricks that Ex-learning at least a out of a + 1 parallely presented functions may capture problems which are not even in Bc!7 7 This result can be \improved" to show that there are classes that are not in Bc, but which can be robustly
(a; a + 2)- nite identi ed (i.e., with 0 mind changes). Furthermore, one can show that there are classes that are
5
Theorem 3.3 (a; a + 1)RobEx 6 Bc for all a 2 !. Proof. Let M0 ; M1 ; : : : be an enumeration of all learning machines. We inductively de ne functions g0 ; g1 ; : : : and sequences 0 ; 1 ; : : : below. Suppose we have de ned gi ; i , for i < n. Then de ne gn and n as follows: (1) Choose gn such that, (a) for i n, Mi does not Bc-infer gn , and (b) if n > 0, then gn n?1 . (2) Choose n gn such that, (a) for all m n, for all x MinInd(gn ), m (n )(x) # and (b) if n > 0, then n n?1 . Now let S = fgn j n 2 !g. By construction Mi does not Bc-identify gn , for n i. Thus, S 62 Bc. We now show S 2 (a; a + 1)RobEx for a = 1. The generalization to arbitrary a 2 ! is straightforward. Suppose an arbitrary general computable operator k is given. We need to show that fk (gn ) j n 2 !g 2 (1; 2)Ex. Note that (A) (a; b)Ex is closed under union with nite sets, and (B) R, the class of all the computable functions, can be identi ed in the Ex sense, if one is given an upper bound on the minimal program for the input function in addition to the graph of the input function [18] (see also [22]). Thus, it suces to construct a machine M such that, for all i and j satisfying i; j k, and k (gi ) 6= k (gj ), M (k (gi ); k (gj ))# min fMinInd(k (gi )); MinInd(k (gj ))g. Let er be a program, obtained eectively from r, for k ('r ). De ne M as follows. ; if f1 [n] = f2 [n]; M (f1 [n]; f2[n]) = 0max fer j r yg; if y = min fx j f1(x) 6= f2(x)g. We claim that for all i and j such that i; j k, and k (gi ) 6= k (gj ), M (k (gi ); k (gj ))# min fMinInd(k (gi )); MinInd(k (gj ))g. To see this, suppose i; j k, f1 = k (gi ), f2 = k (gj ), and f1 6= f2 . Let r = min fi; j g. Thus, r gi and r gj . It follows by (2) above that, for all x MinInd(gr ), f1 (x) = f2 (x). Thus, min fx j f1 (x) 6= f2(x)g MinInd(gr ). It follows that M (f1; f2 ) max fer0 j r0 MinInd(gr )g. Thus, M (f1; f2) min fMinInd(f1 ); MinInd(f2 )g. Theorem follows. 2
In addition to Bc, one can also show for all other inference types IT, which do not contain a cone ff 2 R j fg for some , that (a; a +1)Ex contains classes which are not in IT. This is achieved by suitably modifying (1) in the just above proof. For example, for all non-high sets A there exist (a; a + 1)RobEx-inferable classes which are not in Ex[A] (see [17, 27]).8 However, along the lines of [24] it follows that (b; b)Ex = Ex, in particular, (b; b)RobEx = RobEx for all b 1. Thus, it is not possible to improve Theorem 3.3 to (b; b)RobEx-learning. Furthermore, one may wonder whether it is possible to guarantee that an (a; b)Ex-leaner always correctly infers, say, the rst of the b input functions. This means that we declare the rst function as the target function and all other functions as context. However, one can show that this yields exactly the class Ex, by choosing the context functions always from a set not in Bc, but which can be robustly (a; a + 1)- nite identi ed if one is willing to tolerate a nite number of errors in the output programs. It is open at present whether (a; a + 1)RobFin ? Bc 6= ;, where RobFin is RobEx with 0-mind changes. 8 0 0 A is high i K T A .
6
F = fg1 ; g2 ; : : : ; gb g of cardinality b. Then, on any input function f , we simulate the (a; b)Exlearner on (f; hi1 ; : : : ; hib?1 ), where Y = fhi1 ; : : : ; hib?1 g is a subset of F ? ff g containing b ? 1 functions. Thus, variants of learning with arbitrary context, where a target function is designated, do not increase the learning power compared to an ordinary Ex-learner.
4 Learning From Selected Contexts In Section 3 we have established that an arbitrary context may be enough to increase learning ability without resorting to any obvious coding tricks, if one is willing to pay the price of not learning at most one of the input functions. Of course, on intuitive grounds, it is to be expected that the learning power can be further increased if the context given to the learner is not arbitrary, but is carefully selected. In order to formally de ne such a notion of learning from selected context, we rst introduce the notion of asymmetric learning from context. This notion is asymmetric since, in contrast to Section 3, here we distinguish between the target function and the context: De nition 4.1 P R R is in ConEx if there exists a learning machine M that for all (f; g) 2 P : 1 (9e)['e = f ^ ( 8 n)[M (f [n]; g[n]) = e]]: For (f; g) 2 P we call f the target and g the context function. The concept ConEx is related to the notion of parallel learning studied by Angluin, Gasarch and Smith [2]. The main dierence being that in the learning type from [2], the learning machine was required to infer programs for both input functions, not just the target. In Theorem 4.5 we will also investigate the robust version of parallel learning as de ned in [2], and present a \coding free" proof for one of the results from [2]. The next de nition formally introduces the notion of learning in the presence of a selected context : De nition 4.2 S R is in SelEx if there exists a mapping C : S ! S such that the class S C := f(f; C (f )) j f 2 S g is in ConEx. C is called a context mapping for S . As was discussed in the introductory section, using a coding trick, one can easily see that the freedom of carefully selecting a context yields extreme increases in learning power. Indeed, the entire class R is in SelEx without any mind changes. However, if we do not allow coding tricks, as speci ed in De nition 4.3 below, we can still show that freely selecting a context makes it possible to learn the class of all computable functions. And this result is far from being obvious. De nition 4.3 P R R is in RobConEx if the class (P ) := f((f ); (g)) j (f; g) 2 P g is in ConEx for all general computable operators . S R is in RobSelEx if there exists a context mapping C : S ! S such that the class S C is in RobConEx.
Theorem 4.4 If S R contains a dense, computably enumerable subclass, then S 2 RobSelEx. In particular, R 2 RobSelEx. 7
Proof. The proof is based on similar ideas as that of Theorem 3.3. Let f0 ; f1 ; : : : be a (not necessarily computable) listing of the functions in S , and f'h(i) gi2! , with h 2 R, be a dense subset of S . For each n choose a sequence n fn such that
for all m n and all x MinInd(fn ); m (n )(x) # : We de ne the context mapping C : S ! S by C (fn) = 'h(i) for the least i such that n 'h(i) . We want to show that S C is in RobConEx. Let an arbitrary general computable operator k be given. We need to show that f(k (f ); k (C (f ))) j f 2 S g 2 ConEx. For this, rst note that (A) ConEx is closed under union with nite sets, and (B) R, the class of all the computable functions, can be identi ed in the Ex sense, if one is given an upper bound on the minimal program for the input function in addition to the graph of the input function [18] (see also [22]). Thus, it suces to construct a machine M such that, for all n k, (i) if k (fn ) = k (C (fn )), then M (k (fn ); k (C (fn )))# to a program for k (fn ) = k (C (fn )), and (ii) if k (fn ) 6= k (C (fn )), then M (k (fn ); k (C (fn)))# MinInd(k (fn )). Let er be a program, obtained eectively from r, for k ('r ). De ne M as follows. 8 > if f [n] = g[n], and < eh(r) M (f [n]; g[n]) = > r = min fr0 j g[n] (' 0 )g; : max fe j r yg if y = min fx j f (x) 6=kg(xh)(gr .) r
We claim that for all n k (i) if k (fn ) = k (C (fn )), then M (k (fn); k (C (fn )))# to a program for k (fn) = k (C (fn )), and (ii) if k (fn ) 6= k (C (fn )), then M (k (fn ); k (C (fn )))# MinInd(k (fn)). To see this, consider any n k. If k (fn) = k (C (fn )), then in particular, we will have k (fn ) 2 fk ('h(r0 ) ) j r0 2 !g. Thus, rst clause in the de nition of M ensures that M ConEx-identi es (k (fn ); k (C (fn ))). If k (fn ) 6= k (C (fn)), then by de nition of n and C (fn), we have n fn and n C (fn ). Thus by (1) we have that min fx j k (fn )(x) 6= k (C (fn))(x)g MinInd(fn ). Thus by second clause in the de nition of M it follows that M (k (fn ); k (C (fn ))) MinInd(k (fn )). Theorem follows. 2 Theorem 4.4 can be improved in several ways. First, one can show that the mapping C : R ! R, which provides a context for each function in R, can actually be chosen 1 : 1 and onto. Furthermore, this 1 : 1 and onto mapping can be constructed in such a way that not only the target functions but also the context functions can be robustly learned in parallel. In order to state this result we let ParEx denote the variant of ConEx from De nition 4.1, where the learning machine is replaced by a machine M : Seq2 ! !2 such that M converges on each (f; g) 2 P to a pair of programs (i; j ) with 'i = f and 'j = g. ParEx coincides exactly with the 2-ary parallel learning type as de ned in [2]. Analogously to the other robust variants, we let RobParEx contain all classes P 2 S 2 such that f(f ); (g) j (f; g) 2 P g 2 ParEx for all general computable operators . Thus, the following theorem provides a surprising robust version of [2, Theorem 6]. Theorem 4.5 There exists a class P RR such that P j1 = P j2 = R, but P 2 RobParEx. Our results demonstrate that learning with a selected context is a very powerful learning notion, in particular, it renders the entire class of computable functions learnable. However, it should 8
be noted that in this particular case, the high learning power also results from the very large function space, namely R, from which a context can be selected. We required in De nition 4.2 that for each class S R, the context which we associate with each f 2 R, is also chosen from the set S . So, if one considers proper subsets S R, it may happen that one loses learning power, just because the space of possible contexts is reduced. Indeed, one can show that even the non-robust version SelEx of learning from selected contexts does not contain all subsets of R. Due to the result from Theorem 4.4 that every class with a dense, computably enumerable subclass is in RobSelEx, it is actually not easy to construct such a set which is not in SelEx.9 Although the result is perhaps not unexpected, it is, nonetheless, our most dicult to prove.
Theorem 4.6 (9S R)[S 62 SelEx].
5 Measuring The Functional Dependence In Section 4 we have shown that there are very hard learning problems which become learnable when a suitably selected context is supplied to the learner. In this section we analyze the possible functional dependence between the target functions and the contexts in such examples. In particular, we attempt to nd examples which are only learnable with a context, but so that the functional dependence between the target function and the context is manageable. The functional dependence is measured from a computational theoretic point of view, that is, we are looking for examples in RobSelEx ? Ex and SelEx ? Ex, such that the problem of implementing a suitable context mapping has low Turing complexity. We will consider two types of implementations for context mappings: operators, which work on (the values of) the target functions, and program mappings, which work on programs for the target functions. First we consider context mappings C : S ! S which are implementable by operators. Here, one can show that, if the functional dependence between the target function and the context is too manageable then the multitask problem will not have the desired property, that is, the learnability with the help of a selected context will imply the learnability of the target functions without any context. This can easily be seen if one assumes that the context mapping is implemented by a computable operator: Suppose C is a general computable operator mapping S ! S . Let C 0 be a general computable operator such that, (a) for all total f , C 0 (f ) = C (f ), and, (b) for all , fx j C 0 ( )(x)#g is nite, and a canonical index [35] for it can be eectively determined from . Note that such a C 0 can easily be constructed by \slowing down" C appropriately. 0 Now S the function0 s( ) = maxf j (8x < jj)[C ( )(x) # = (x)]g is computable. Note that n2! s(f [n]) = C (f ) = C (f ) for all f 2 S . This implies that the machine de ned by N (f [n]) = M (f [ js(f [n])j ]; s(f [n])) Ex-infers every f 2 S . In the case of robust learning, this observation can even be surprisingly generalized to the result that for all classes S R, which are closed under nite variants and robustly learnable from selected contexts, the existence of a general continuous operator implementing a context mapping is enough to guarantee the robust learnability of S itself! Furthermore, this result represents a rather unusual phenomenon, since in inductive inference most learning types are closed with respect to subclasses. 9
9
An operator , not necessarily computable, is continuous (by de nition)10 i it is compact, that is, (8f )(8x)(9 f )[ ()(x) = (f )(x)], and monotone, that is, (8; )(8x)[ ^ ()(x) # =) ( )(x) # = ()(x)]. As noted in Section 1 above, the general continuous operators are the continuous operators which map all total functions into total functions. Theorem 5.1 If S R is closed under nite variants and RobSelEx-learnable as witnessed by a (not necessarily computable) general continuous operator C : S ! S , then S 2 RobEx. Since each operator, which is general computable relative to some oracle A, is already general continuous, we can thus not hope to nd examples in RobSelEx ? Ex such that the context mapping can be implemented by an general A-computable operator, no matter how complex is the oracle A! However, for the non-robust version such examples exist for all suitably non-trivial oracles A, i.e., for all A such that Ex[A] ? Ex 6= ;. An abundance of such A exist by [17, 27]. Theorem 5.2 Let S 2 Ex[A] ? Ex such that S contains all almost constant functions. Then S is SelEx-learnable as witnessed by some general A-computable operator. We now turn our attention to context mappings which are implementable by program mappings. One can show that the context mapping C : R ! R constructed in Theorem 4.4 in Section 4 above is computable relative to K 0 , that is, there exists a partial K 0 -computable function h : ! ! ! with (8e)['e 2 R =) [h(e) # ^ 'h(e) = C ('e )]]: 0 Thus, K provides an upper bound on the Turing degree of context mappings for classes in RobSelEx ? Ex. However, if one wants to reduce this upper bound, the problem arises that these program mappings are generally not invariant with respect to dierent indices of the same function [31].11 And transforming an arbitrary program mapping into an invariant (or extensional) one generally requires an oracle of degree K 0 . It is convenient, then, for the sequel, to use the following equivalent de nition for SelEx and RobSelEx instead of De nition 4.2: De nition 5.3 S R is in SelEx if there exists a class S 0 S S in ConEx and a partial program mapping h : ! ! ! such that, for every 'e 2 S , h(e) # and the pair ('e ; 'h(e) ) is in S 0. If, furthermore, S 0 can be chosen from RobConEx, then we say that S is in RobSelEx. Recall that there are no classes S 2 RobSelEx ? Ex such that the corresponding context mapping is implementable by any, even noncomputable, general continuous operator. In contrast, the following interesting theorem shows that the class R0;1 is in RobSelEx as witnessed by a program mapping h which is computable, i.e., which requires no oracle to compute. Theorem 5.4 R0;1 is RobSelEx-learnable as witnessed by a computable program mapping. On the other hand, one can also show that there is no upper bound on program mappings implementing context mappings for classes in RobSelEx ? Ex. That is, for each oracle A, there is a class S 2 RobSelEx ? Ex such that for every program mapping h : ! ! ! witnessing S 2 RobSelEx it holds that A is Turing reducible to h. As we pointed out in Section 1 above, continuous operators are essentially characterized as those for which an algorithm for passing from the graph of the input partial function to that of the output partial function may or may not have access to some non-computable oracle. 11 They are not extensional in the terminology of [35]. 10
10
References [1] D. Angluin, W. Gasarch, and C. Smith. Training sequences. Theoretical Computer Science, 66(3):255{272, 1989. [2] D. Angluin, W. I. Gasarch, and C. H. Smith. Training sequences. Theoretical Computer Science, 66(3):25{272, 1989. [3] G. Baliga and J. Case. Learning with higher order additional information. In K. Jantke and S. Arikawa, editors, Algorithmic Learning Theory, volume 872 of Lecture Notes in Arti cial Intelligence, pages 64{75. Springer-Verlag, Berlin, Reinhardsbrunn Castle, Germany, October 1994. [4] K. Bartlmae, S. Gutjahr, and G. Nakhaeizadeh. Incorporating prior knowledge about nancial markets through neural multitask learning. In Proceedings of the Fifth International Conference on Neural Networks in the Capital Markets, 1997. [5] J. Barzdin. Two theorems on the limiting synthesis of functions. In Theory of Algorithms and Programs, Latvian State University, Riga, 210:82{88, 1974. [6] L. Blum and M. Blum. Toward a mathematical theory of inductive inference. Information and Control, 28:125{155, 1975. [7] R. A. Caruana. Multitask connectionist learning. In Proceedings of the 1993 Connectionist Models Summer School, pages 372{379, 1993. [8] R. A. Caruana. Algorithms and applications for multitask learning. In Proceedings 13th International Conference on Machine Learning, pages 87{95. Morgan Kaufmann, 1996. [9] J. Case, S. Kaufmann, E. Kinber, and M. Kummer. Learning recursive functions from approximations. In Proceedings of the Second European Conference on Computational Learning Theory (EuroCOLT'95), volume 904 of Lecture Notes in Arti cial Intelligence, pages 140{153, March 1995. Journal version to appear in Journal of Computer and System Sciences (Special Issue for EuroCOLT'95). [10] J. Case and C. Smith. Comparison of identi cation criteria for machine inductive inference. Theoretical Computer Science, 25:193{220, 1983. [11] H. de Garis. Genetic programming: Building nanobrains with genetically programmed neural network modules. In IJCNN: International Joint Conference on Neural Networks, volume 3, pages 511{516. IEEE Service Center, Piscataway, New Jersey, June 17{21, 1990. [12] H. de Garis. Genetic programming: Modular neural evolution for darwin machines. In M. Caudill, editor, IJCNN-90-WASH DC; International Joint Conference on Neural Networks, volume 1, pages 194{197. Lawrence Erlbaum Associates, Publishers, Hillsdale, New Jersey, January 1990. [13] H. de Garis. Genetic programming: Building arti cial nervous systems with genetically programmed neural network modules. In B. Soucek and T. I. Group, editors, Neural and Intelligenct Systems Integeration: Fifth and Sixth Generation Integerated Reasoning Information Systems, chapter 8, pages 207{234. John Wiley & Sons, Inc., New York, 1991. 11
[14] T. G. Dietterich, H. Hild, and G. Bakiri. A comparison of ID3 and backpropogation for English text-to-speech mapping. Machine Learning, 18(1):51{80, 1995. [15] M. Domjan and B. Burkhard. The Principles of Learning and Behavior. Brooks/Cole Publishing Company, Paci c Grove, California, 2nd edition, 1986. [16] S. Fahlman. The recurrent cascade-correlation architecture. In R. Lippmann, J. Moody, and D. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 190{196. Morgan Kaufmann Publishers, Inc., San Mateo, California, 1991. [17] L. Fortnow, W. Gasarch, S. Jain, E. Kinber, M. Kummer, S. Kurtz, M. Pleszkoch, T. Slaman, R. Solovay, and F. Stephan. Extremes in the degrees of inferability. Annals of Pure and Applied Logic, 66:21{276, 1994. [18] R. V. Freivalds and R. Wiehagen. Inductive inference with additional information. Elektronische Informationsverarbeitung und Kybernetik, 15:179{185, 1979. [19] M. Fulk. Robust separations in inductive inference. In Proceedings of the 31st Annual Symposium on Foundations of Computer Science, pages 405{410, St. Louis, Missouri, 1990. [20] J. Helm. On eectively computable operators. Zeitschrift fur Mathematische Logik und Grundlagen der Mathematik, 17:231{244, 1971. [21] E. Hilgard and G. Bower. Theories of Learning. Prentice-Hall, Englewood Clis, NJ, 4th edition, 1975. [22] S. Jain and A. Sharma. Learning with the knowledge of an upper bound on program size. Information and Computation, 102(1):118{166, Jan. 1993. [23] E. Kinber, C. Smith, M. Velauthapillai, and R. Wiehagen. On learning learning multiple concepts in parallel. In Proceedings of the Workshop on Computational Learning Theory, pages 175{181. ACM, NY, 1993. [24] E. Kinber, C. H. Smith, M. Velauthapillai, and R. Wiehagen. On learning multiple concepts in parallel. In Proceedings of Sixth Annual Conference on Computational Learning Theory, pages 175{181, Santa Cruz, CA, USA, 1993. ACM Press. [25] E. Kinber and R. Wiehagen. Parallel learning - a recursion-theoretic approach. InfomrakikPreprint 10, Fachbereich Informatik, Humbold-Universitat, 1991. [26] M. Kummer and F. Stephan. Inclusion problems in parallel learning and games. Journal of Computer and System Sciences (Special Issue COLT'94), 52(3):403{420, 1996. [27] M. Kummer and F. Stephan. On the structure of degrees of inferability. Journal of Computer and System Sciences, 52(2):214{238, Apr. 1996. [28] S. Matwin and M. Kubat. The role of context in concept learning. In M. Kubat and G. Widmer, editors, Proceedings of the ICML-96 Pre-Conference Workshop on Learning in Context-Sensitive Domains, Bari, Italy, pages 1{5, 1996. 12
[29] A. Meyer and P. Fischer. Computational speed-up by eective operators. Journal of Symbolic Logic, 37:48{68, 1972. [30] T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski. Experience with a learning, personal assistant. Communications of the ACM, 37:80{91, 1994. [31] P. Odifreddi. Classical Recursion Theory. North-Holland, Amsterdam, 1989. [32] D. Osherson, M. Stob, and S. Weinstein. Systems that Learn. MIT Press, Cambridge, Massachusetts, 1986. [33] L. Pratt, J. Mostow, and C. Kamm. Direct transfer of learned information among neural networks. In Proceedings of the 9th National Conference on Arti cial Intelligence (AAAI91), 1991. [34] H. Rogers. Godel numberings of partial recursive functions. Journal of Symbolic Logic, 23:331{341, 1958. [35] H. Rogers. Theory of Recursive Functions and Eective Computability. McGraw Hill, New York, 1967. Reprinted, MIT Press, 1987. [36] J. Royer. A Connotational Theory of Program Structure. Lecture Notes in Computer Science 273. Springer-Verlag, 1987. [37] T. J. Sejnowski and C. Rosenberg. NETtalk: A parallel network that learns to read aloud. Technical Report JHU-EECS-86-01, Johns Hopkins University, 1986. [38] C. Smith and H. Tu. Training digraphs. Information Processing Letters, 53:184{192, 1995. [39] S. Thrun. Is learning the n-th thing any easier than learning the rst? In Advances in Neural Information Processing Systems, 8. Morgan Kaufmann, 1996. [40] S. Thrun and J. Sullivan. Discovering structure in multiple learning tasks: The TC algorithm. In Proceeedings of the Thirteenth International Conference on Machine Learning (ICML-96), pages 489{497. Morgan Kaufmann, San Francisco, CA, 1996. [41] F. Tsung and G. Cottrell. A sequential adder using recurrent networks. In IJCNN-89WASHINGTON D.C.: International Joint Conference on Neural Networks, volume 2, pages 133{139. IEEE Service Center, Piscataway, New Jersey, June 18{22, 1989. [42] A. Waibel. Connectionist glue: Modular design of neural speech systems. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceeedings of the 1988 Connectionist Models Summer School, pages 417{425. Morgan Kaufmann Publishers, Inc., San Mateo, California, 1989. [43] A. Waibel. Consonant recognition by modular construction of large phonemic time-delay neural networks. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems I, pages 215{223. Morgan Kaufmann Publishers, Inc., San Mateo, California, 1989.
13