Universal Approximation of Mappings on

6 downloads 0 Views 227KB Size Report
Recall that a multilayer perceptron with input units i1;:::;ip and output units .... function l constructed above can be substituted by linear and square units. If.
Universal Approximation of Mappings on Structured Objects using the Folding Architecture Barbara Hammer Department of Mathematics/Computer Science University of Osnabruck, Albrechtstr. 28, D-49069 Osnabruck, e-mail: [email protected]

Abstract: The folding architecture is a universal mech-

anism to approximate mappings between trees and real vector spaces with a neural network. The part encoding the trees and the part approximating the mapping are trained simultaneously so that the encoding ts to the speci c learning task. In this article we show that this architecture is capable of approximating any mapping arbitrary well.

Keywords: Neural Networks, Structured Objects, Folding Architecture.

1 Introduction Neural networks are capable of approximating any function from one nite dimensional real vector space to another arbitrary well [4]. Nevertheless in many elds people have to deal with structured objects. Therefore it is desirable that neural networks work directly on these objects rather than on a sometimes arti cial coding in terms of real values. Several approaches can be found in the literature showing that neural networks are able to handle structured objects like trees, lists, terms, or sentences in a natural way. As an example Pollacks RAAM [11] suggests a method to encode and decode trees and lists universally in a distributed manner, with a xed number of neurons. Sperduti generalizes this result to encode labeled trees [13]. Even Smolenskys tensor construction [12] is a method to encode structured data although it is a more theoretical result not suggesting a special learning algorithm and requiring a rapidly growing number of neurons. Elmans architecture [2] is capable of approximating time sequences, another sort of structured objects. Other approaches dealing with structured objects, especially with 1

terms and trees, are [3, 10, 15]. In each case the coding is in some kind universal and not tted to the speci c learning task. For an overview see e.g. [14]. Recently a very promising approach was published [7] in which a mapping from rooted labeled directed acyclic graphs (RLDAGs) into a real vector space and the appropriate coding are learned simultanuously. The architecture consists of some kind of standard multilayer perceptron with additional recurrence according to the structure of the RLDAGs, the training algorithm is a modi cation of standard backpropagation called backpropagation through structure. In this paper we show that this architecture is capable of approximating any mapping from labeled trees to a nite dimensional real vector space arbitrary well. Because of this universal approximation capability and because labeled trees are a universal formalism, this architecture is well suited to many elds dealing with structured objects. This paper is organized as follows: First we shortly recall the folding architecture [7]. We establish the de nition of approximation used in this context and present the main theorem. This is proven in several steps. Afterwards we discuss the conditions concerning the activation function and a generalization of this approach to RLDAGs. We end with some remarks on the generalization capability of the architecture.

2 The architecture We consider the set S of labeled trees where each node has not more than k successors. Adding a special node nil, we can assume that every other node has exactly k successors. Each node v of a tree is assigned a label (v) that is an element of a nite alphabet  with n elements. For each possible label x 2  we x a representation c(x) 2 Rt. First we recall the formal de nition of the folding architecture.

De nition: Choose a coding ncil 2 Rm for the empty tree. For a function l : Rt+km ! Rm the induced function ~l : S ! Rm is de ned recursively using l: ~l(s) =

(

ncil if s = nil; ~ ~ l(c((v)); l(s1); : : :; l(sk )) else;

where v is the root of s 2 S , and s1; : : :; sk are the k subtrees of v. A function f : S ! Rq is computed by a folding architecture if there are functions l : Rt+km ! Rm; g : Rm ! Rq ; that are computed by a multilayer perceptron, and an ane mapping A : Rq ! Rq such that f = A  g  ~l : 0

0

2

2

Recall that a multilayer perceptron with input units i1; : : :; ip and output units j1; : : :; jr computes the function f : Rp ! Rr with f (x1; : : :; xp) = (oj1 ; : : : ; oj ). For each unit j the output oj is de ned recursively r



xk P if j is the input unit ik , ( i wjioi + j ) else, where the sum is taken over all units i in the previous layer. wji are the weights and j is the bias of unit j .  : R ! [0; 1] is an appropriate activation function. So the folding architecture consists of two parts, one for encoding the elements of S to Rm - the function ~l - and one for approximating the induced function from Rm to Rq - the function A  g. To encode a tree, the network function l is applied recursively. For a tree s with root v it maps the code of the label, a vector in Rt, and the k codes of the subtrees, k vectors in Rm, to a new vector in Rm representing the whole tree with root v. So if one wants to encode the entire tree, one has to start at the leaves and encode each leave in an Rm using l and a xed vector for the empty node nil. Afterwards each node of the next level is encoded using the codes of the leaves and so on until one reaches the root. ~l refers to the entire encoding procedure. See gure 1 for an example. oj =

l

a b

a c

b

d

= nil c

b d

a 

c

d l

l

l

Ag

unfolded architecture



~l

folded architecture with implicit recurrence

Ag

Figure 1: encoding a tree The second part A  g maps the encoded tree to the output of the resulting function f directly. Because f shall map to the entire Rq, the ane transformation A is added to the network function g. 3

Figure 2 shows the entire folding architecture. The dashed arrow denotes that there will be recurrence and the architecture has to be unfolded according to the structure of the input tree if an output is computed. code of the label in Rt code of the subtrees in Rk m k

code of the tree in Rm

...

...

...

:::

...

:::

output in Rq

approximation part



implicit recurrence encoding part

Figure 2: folding architecture In a real learning problem there are examples (si; f(si))ri=1 of an unknown function f. The task is to choose weights for the networks and for A so that the di erence between the resulting function f implemented by the folding architecture and the function f on the examples s1; : : :; sr becomes minimal. If the activation function  is di erentiable a learning algorithm based on a gradient descent applied. The quadratic cost function E consists of the Pr methodi can be i 2 sum i=1(f (s ) ? f (s )) . In each step the weights in the neural network are lowered proportional to the derivation of E with respect to the weights. The weight updates can be computed similar to standard backpropagation in a forward and a backward phase. In fact the resulting formulas concerning the weights in g and A are the same as in standard backpropagation. According to the recursive de nition of ~l the architecture is virtually unfolded, therefore the formulas for the weights in l are similar to those in backpropagation through time. More details concerning this backpropagation through structure can be found in [7]. The speci c learning algorithm will not be referred to in the rest of the paper. We investigate only the principle capability of the architecture to approximate an arbitrary function f. If this capability would be limited, no learning algorithm has a chance to solve a concrete learning task in general.

4

3 Approximation capability As above, S is the set of -labeled trees where each node except nil has exactly k successors. f : S ! Rq is an arbitrary function. We consider the discrete -algebra and an arbitrary probability measure P on S . Let  : R ! [0; 1] be a squashing function with at least one point x1 in ]0; 1[ so that  is two times continuously di erentiable in an environment of x1 and 00(x1) 6= 0. Recall that a squashing function is a monotonous function with limx!1 (x) = 1 and limx!?1 (x) = 0. The class of functions implementable by a folding architecture is the class F of functions A  g  ~l given by natural numbers m and t, a function A  g : Rm ! Rq realized by a multilayer perceptron with activation function  with additional ane transformation at the output, a representation c(x) 2 Rt for each of the n elements in , a vector ncil 2 Rm, and a function l : Rt+km ! Rm computed by a multilayer perceptron with activation function . As de ned above ~l : S ! Rm is the induced function for l.

Theorem: We claim that for any  > 0 there exists a function f  in F so that P (jf  (s) ? f(s)j > ) <  : This will be proven in several steps. If  = fa1; : : :; ang is an enumeration we may assume that the label ai is represented by the following real number c(ai) =

z }|i { 0: |00 : : : 01{z0 : : : 00} in ]0; 1[. n

Lemma: We rst show that there exists an injective mapping ~l : S ! R2, supposed that linear as well as product units can be used in the network l used to de ne ~l recursively. Here we can choose m = 2. Proof: Each tree s can be represented as a decimal string ts as follows:  s = nil; ts = 32c((v))t : : : t ifelse, s1 s k

z }|i { where v is the root of s, s1, : : : , sk are the subtrees of v, and c((v)) = 0| : : :{z1 : : : 0} n if (v) = ai. The idea is to consider this string as a real number. The syntax

is unique. Because we concatenate strings in the recursive de nition, it will be useful if the suitably scaled length of each string is also available. So we deal with tuples (x1; x2) with x1 = 0:x and x2 = (0:1)length(x) for a string x. The tuple that represents the concatenation of two strings x and y can be computed as (x1 + x2  y1; x2  y2). 5

Setting ncil = (0:3; 0:1), w = (0:1)n+1 , and l : R1+k2 ! R2, l(x; x11; x12; : : :; xk1 ; xk2 ) = (0:2 + 0:1  x + w  x11 + w  x12  x21 + w  x12  x22  x31 + : : : + w  x12  x22  : : :  x2k?1  xk1 ; w  x12  : : :  xk2 ) we obtain an injective mapping ~l : S ! R2 that computes for each tree s the tuple corresponding to the string ts. Note that l can be realized by a multilayer perceptron of depth O(log2 k) with linear units and units computing the product of not more than two elements. The number of units required in each layer is bounded by O(k). We can assume that the last layer consists of linear units; maybe it will be necessary to add another layer before. 2

Lemma: For every nite subset S 0 in S there is a map ~l : S ! R2 injective on

S 0 where l is a standard multilayer perceptron with activation function . Proof: Because ((x + y)2 ? x2 ? y2)=2 = x  y, the multiplication units in the function l constructed above can be substituted by linear and square units. If necessary, we add linear units so that the architecture becomes layered. The idea is to approximate the linear and square units with standard units. We use a method of [6]. Because 00(x1) 6= 0, a point x0 can be found in the environment of x1 with 0  (x0) 6= 0. It is (x0 +   x) ? (x0) = x lim !0   0(x )

and

0

(x1 +   x) + (x1 ?   x) ? 2  (x1) = x2 : lim !0 2  00(x ) 1

The convergence is uniform on compact intervals. Now the activation functions x and x2 in the network l are substituted by the quotients above. The resulting network l depends on the parameters  where the  can be chosen independently for each unit. We will show that parameters  exist so that the induced map ~l is injective on S 0. Let  be the minimum distance between two images f~l(s) : s 2 S 0g. The map ~l will be injective on S 0 if the distance j~l(s) ? ~l(s)j can be bounded by =2 for each s 2 S 0. For each s 2 S 0 the mappings ~l and ~l are virtually unfolded so that feedforward networks ns with linear and square units respectively ns with the above quotients as activation functions result. We store the input corresponding to s in the networks ns and ns. For one s and the corresponding networks the  can be chosen recursively so that the output of ns and ns di er not more than =2. We give an example if the network ns has the form depicted in gure 3. 6

a

3

1

4

2

b

c(a)

c(b)

3

c

11

no: :

nil

4

3

31

no: :

21

no: :

2

12

no: :

x1

1

22

4

32

no: :

1

41

no: :

2

42

no: :

no: :

x2

x3

x4

x5

Figure 3: choosing the  Let x1 = (x11; x12; x13), : : : , x5 = (x51; x52) be the output vectors of the input, rst, : : : , fourth layer of ns if the input corresponding to s is stored in the network. The units in the unfolded network ns are numerated. The rst digit refers to the layer. Let ui be the function computed by unit number i in ns and ui the function computed by the corresponding unit in ns. Each function ui in ns depends on a parameter . Because ns is an unfolded network, the same parameter  can occur in several units. Without loss of generality we consider only positiv . Starting with 5 = 2p 2 we choose recursively for each layer i and unit i:j in this layer the corresponding parameter k and a positiv i so that ui:j (x) di ers from ui:j (xi) not more than i+1 for all   k and inputs x that di er not more than i from the input xi to the unit ui:j . We start at the right side and use the maximum norm. Choose 4 such that ju4:1(x) ? x51j < 25 for all x with jx ? x4j < 4. Choose 1 so that u4:1 di ers from u4:1 not more than 25 for all   1. Because of the triangle inequality the images under u4:1 of the environment of x4 with radius 4 di er from x51 not more than 5 for all   1. In the same manner, we obtain 2 and another 4 so that ju4:2(x) ? x52j < 5 for all x with jx ? x4j < 4 and   2. We take the minimum of these two 4. Now we can continue and choose 3, 4, and 3 such that ju3:1(x) ? x41j < 4 for all x with jx ? x3j < 3 and   3 respectively ju3:2(x) ? x42j < 4 for all x with jx ? x3j < 3 and   4. Afterwards, we choose 2 and the parameters 1 and 2 of the units in the second layer. Note that these values 1 and 2 are already chosen but perhaps they have to be lowered. This only improves earlier inequalities. It is obvious how to continue: Maybe 3 or 4 have to be lowered if we consider the units 1:1 and 1:2. The same procedure has to be applied for any s 2 S 0 and the corresponding networks ns and ns. Because the same parameter i occur in each network, they are already chosen, but possibly they have to be lowered. Note that this method works because S 0 is nite. 7

Now the resulting set of parameters guarantees that any output ~l(s) - that is the output of the unfolded network ns on the input corresponding to s - di ers from ~l(s) - that is the output of the unfolded network ns on the input corresponding to s - not more than =2. Therefore the function ~l is injective on S 0. The corresponding network l consists of units with the above quotients as activation function. But these quotients can be substituted by a linear combination of one or two standard units. Except for the last layer, the linear transformation can be considered to be part of the ane transformation of the following units in the next layer. If the linear transformation applied after the standard units in the last layer is stored in the weights of the rst layer, the resulting function will di er from the original ~l only at the beginning and at the end: The transformation will be computed additionally if we encode the leaves. But if we simply change the coding of the empty tree nil according to this transformation, we will get the same mapping as before. The transformation is not computed at the output. But here we can simply drop it because we only need an arbitrary injective mapping. Therefore we can assume that the network l consists of standard units only. The number of nodes and layers we use remains the same order. 2

Proof of the main theorem: Let  > 0 and S 0  S a nite subset so that P (s 62 S 0) < =2 : Choose a map l : Rt+km ! Rm that is realized by a multilayer perceptron so that the induced mapping ~l : S ! Rm is injective on S 0. De ne the function f^ : Rm ! Rq with   0 ~ f^(x) = f (s) if s 2 S exists such that l (s) = x ; 0

else.

De ne theP probability measure P^ where for any Borel measurable subset B of Rm P^ (B ) = ( s2S ; ~l (s)2B P (s))=P (S 0 ) . According to a result of [4] there exists a function A  g : Rm ! Rq realized by a neural network with activation function  and an ane transformation A so that P^ (jA  g ? f^j > ) < =2 : 0



For the resulting function f  = A  g  ~l we have P (jf  ? fj > )  P (s 62 S 0) + P (jf  ? fj >  j s 2 S 0)  =2 + P^ (jA  g ? f^j > ) <  : The number of units required are O(log2 k) layers with less than O(k) nodes in each layer in the coding part and one hidden layer in the approximation part. m can be chosen less than 3. 2 8

4 General activation function Note that it is only required in the approximation part that  is a squashing function. Therefore the theorem remains valid for any function in which at least one x1 exists where  is two times continuously di erentiable, 00(x1) 6= 0 as considered above, and where the approximation result of Hornik et. al. is valid. Especially the theorem holds for radial basis functions [9]. Now we consider the case that the activation function is piecewise constant or 00 linear so that there is no point  x1 with  (x1) 6= 0. We rst consider the Heaviside function H with H (x) = 10 ifelsex  0; . Trees up to the maximal depth h can be encoded directly using O((kh?1 + : : : + k +1)  n) =: m0 nodes for the representation. As a string ts we simply write 01110 for the empty tree nil and recursively ts = 0110 0| : : :{z1 : : : }0 ts1 : : : ts ; c((v))

k

where v is the root of s, and ts are the strings representing the k subtrees of v. To compute ts recursively with a neural network in which each letter is stored in a single unit and the remaining of the totally m0 places are lled with 0, we need a method to realize the concatenation of strings. This can be done by brute force i

m { z }| : : : 1} 0 : : : 0, and a gate if the length of each number is encoded additionally as |11 {z 0

length

matching the end of a string i.e. the pattern 10 is used. In gure 4 we give an example if k = 2 and m0 = 3. We drop the part representing the node. Note that each string representing a tree has a length of at least 1. code1 code length1

code2

length

length2

Figure 4: example for the concatenation The lled circles represent the gates matching 10, e.g. as a function H (x ? y ? 0:5). The dotted circles are and-gates, as a function H (x + y ? 1:5), the other ones 9

are or-gates, as a function H (x + y ? 0:5) respectively H (x ? 0:5). Of course the resulting network requires a number of units that is exponential in the depth h. Because this mapping is injective on the set of trees up to depth h, we can continue with the same argumentation as before showing the approximation capability. If  is an arbitrary squashing function we can approximate the Heaviside function by lim!0+ (x=). The convergence is uniform on intervals ]?1; ?a] [ [a; 1[. After changing the biases, if necessary, we can assume that the input to any threshold gate is di erent from 0 on the nite subset of S considered. Then the threshold gates can be substituted by units with activation function . As before the appropriate  can be chosen recursively.

5 Functions on RLDAGs RLDAGs are rooted labeled directed acyclic multigraphs. The only di erence between trees and RLDAGs is that each node in a tree has not more than one predecessor. In an RLDAG there can be arbitrary arrows from one node to another so that the multigraph remains acyclic. As before we consider the set S of RLDAGs where each node has not more than respectively (after adding nil) exactly k successors that are not necessarily di erent. For each s 2 S we simply enumerate the nodes and add a component encoding these numbers to the neural network. Now each string that represents a node begins with 3c((v))number (v) : : : . Especially this representation is unique because if a node v is used twice according to the two (or more) predecessors the number will be the same. In this case the coding of v has to be computed only once. As before this architecture is capable of approximating any function from S  RLDAGs to a real vector space arbitrary well. Note that nite time sequences can be regarded as strings i.e. labeled trees where each node has exactly one successor. Therefore the architecture can approximate functions on time sequences as well.

6 Generalization capability A trained network will be able to generalize if it performs well on the actual data not used for training. The experimental result of [7] indicates that generalization on this symbolic domain can be done with the architecture. If the network is used for classi cation tasks, there is a theoretical bound for the number of training patterns required so that the networks error on the training set is representative for the error on the whole input set with high probability. The bound is related to the so-called Vapnik-Chervonenkis dimension (VCdim) that measures the capacity of the system. The VCdim of standard feedforward networks with w weights is bounded by O(w log2 w) if Heaviside functions are used and O(w4) if sigmoidal units are used. See for example [8] for an overview. 10

It is desirable to obtain bounds for the VCdim of a xed folding architecture, too, but according to the depth of the tree at the input, the length of the computed function in terms of neural network units varies and becomes arbitrary large. Therefore neither the counting argument of [1] nor the method considering the implemented function as a Boolean combination of atomic formulas of [5] can be applied here. In fact the VCdim will be in nite even if only a very small architecture is used. Recall that the VCdim of a class F of functions from a set S to f0; 1g is the largest number k so that there exist values x1; : : : ; xk 2 S that are shattered by F . A set fx1; : : : ; xkg is shattered by F if for every function g : fx1; : : : ; xkg ! f0; 1g there exists a function f 2 F so that f jfx1; : : : ; xk g = g. We consider a xed folding architecture (that means that the number of layers and units in each layer is xed) with output space R and add a Heaviside function at the output so that dichotomies result.

Lemma: The VCdim of a folding architecture with at least three hidden layers

in the coding part will be in nite if linear, multiplication and threshold units can be used. Proof: We consider trees where each node has only one successor. Let m = 2, t = 0, and ncil = (2; 0). For w 2 R with 0  w < 1, the i-th number in the decimal representation can be computed recursively as w0 = 0 and

wi+1 = H (0:wi+1 : : : ? 0:1); 0:wi+2 : : : = 10  0:wi+1 : : : ? wi+1 : The architecture in gure 5 is capable of computing any dichotomy on a set fsi : i = 1; : : : ; rg of trees where si has the height i. 1 w

?1

threshold

+1

multiplication linear

10

?1

?0 1

?1

default threshold: 0 default weight: 1

:

encoding part

approximation part

Figure 5: architecture with in nite VCdim The threshold units have been introduced to encode the empty tree as (w; 1). For any other tree of the depth i, only the recursive formula for wi as above 11

is computed, so the tree si is coded as (wi; 0:wi+1 : : :). Now w can be chosen according to the dichotomy required. Because the identity is available, the same function can be implemented by a layered architecture with 3 hidden layers in the coding part. Of course any architecture that uses more units or layers has in nite VCdim, too. 2 If the sigmoid function is considered, the same argumentation as before can be used. We x r trees s1; : : :; sr where si has the height i. For each dichotomy we construct the neural network with threshold, multiplication, and linear units as before. After a possible changing of the biases, we can assume that the input to any gate is di erent from 0. As before, we substitute the multiplication units by linear and square units. Virtually unfolding the architecture the activation functions are recursively substituted by (x0+x()x?0) (x0 ) , (x1+x)+2(x1(?xx1))?2(x1) , respectively ( x ) with appropriate  so that the encoding of each tree di ers from the original one not more than 0:1 in the second component. x0 and x1 are points where  is two times coninuously di erentiable and 0(x0) 6= 0 respectively 00(x1) 6= 0. The additional ane transformation in the last layer can be included in the approximation part and in the rst layer. The additional ane transformation in the rst layer can be included in the coding of the empty node. This argumentation works for any squashing function  with at least one point x1 where  is two times continuously di erentiable in an environment and 00(x1) 6= 0. Therefore the VCdim is not a useful concept to obtain results concerning the generalization capability of the folding architecture. But in a real learning task we can choose a nite subset S 0 of S so that P (s 62 S 0) is low with high probability. On S 0 we can controll the risk because everything is nite. Therefore, we can controll the risk on S , too. Unfortunately, this does not give bounds to the number of training patterns required for accurate generalization in the worst case. 0

00

References [1] E. B. Baum and D. Haussler, What Size Net give Valid Generalization?, in Neural Computation, 1, pp. 151-165, 1989. [2] J. L. Elman, Distributed Representations, Simple Recurrent Networks, and Grammatical Structure, in Machine Learning, 7, pp. 195-225, 1991. [3] G. E. Hinton, Mapping Part{Whole Hierarchies into Connectionist Networks, in Arti cial Intelligence, 46, pp.47-57, 1990. [4] K. Hornik, M. Stinchcombe, and H. White, Multilayer Feedforward Networks are Universal Approximators, in Neural Networks, 2, pp. 359-366, 1989. 12

[5] M. Karpinski and A. Macintyre, Polynomial Bounds for VC-dimension of Sigmoidal and General Pfaan Neural Networks, to appear in Journal of Computer and System Sciences, 1996. [6] V. Y. Kreinovich, Arbitrary non-Linearity is sucient to represent all Functions by Neural Networks: A Theorem, in Neural Networks, 4, pp. 381-383, 1991. [7] A. Kuchler and C. Goller, Inductive Learning in Symbolic Domains Using Structure-Driven Recurrent Neural Networks, in KI-96: Advances in Arti cial Intelligence, Gunther Gorz, Ste en Holldobler (Eds.), Springer, Berlin, pp. 183-197, 1996. [8] W. Maass, Vapnik-Chervonenkis Dimension of Neural Nets, NeuroCOLT Technical Report Series, NC-TR-96-015, 1996. [9] J. Park and I. W. Sandberg, Universal Approximation using Radial Basis Function Networks, in Neural Computation, 3, pp. 246-257, 1990. [10] T. Plate, Holographic Reduced Representations, in IEEE Transactions on Neural Networks, 6/3, pp.623-641, 1995. [11] J. B. Pollack, Recursive Distributed Representations, in Arti cial Intelligence, 46, pp.77-105, 1990. [12] P. Smolensky, Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems, in Arti cial Intelligence, 46, pp.159-216, 1990. [13] A. Sperduti and A. Starita, Dynamical Neural Networks Construction for Processing of labeled Structures, Technical Report TR-95-1, University of Pisa, Dipartimento di Informatica, 1995. [14] D. S. Touretzky, Connectionist and Symbolic Representations, in The Handbook of Brain Theory and Neural Networks, Michael A. Arbib (Ed.), MIT, Cambridge, pp. 243-247, 1995. [15] D. S. Touretzky, BoltzCONS: Dynamic Symbol Structures in a Connectionist Network, in Arti cial Intelligence, 46, pp.5-46, 1990.

13

Suggest Documents