Showing the Equivalence of two Training Algorithms { Part II
I. Fischer, University of Erlangen{Nuremberg, Germany,
[email protected] M. Koch, Technical University of Berlin, Germany,
[email protected] M. R. Berthold, University of California, Berkeley, CA, USA,
[email protected] Abstract | In previous work Graph Transformations have been shown to oer a powerful way to formally specify Neural Networks and their corresponding training algorithms. It has also been shown how to use this formalism to prove properties of the used algorithms. In this paper Graph Transformations are used to show the equivalence of two training algorithms for Recurrent Neural Networks, Back Propagation Through Time and a variant of Real Time Backpropagation. In addition to this proof a whole class of related training algorithm emerges from the used formalism. In part I of this paper the formalization of the two algorithms is shown; part II then shows how Graph Transformations can be used to prove the equivalence of both algorithms. Keywords | Graph Transformations, Formalization, Comparison of Training Algorithms
I. Introduction
In [1] a method for the unifying formalization of dierent Neural Networks and their training algorithms was proposed. The used Graph Transformations oer a visual way to specify topologies as well as computation and training rules of the Networks. The theory underlying the used formalism makes it possible to prove properties of the used algorithms, in [3] the termination and convergence of a constructive training algorithm for Probabilistic Neural Networks was shown. In this paper the same framework is used to show the equivalence of two existing training algorithms for Recurrent Neural Networks. In [7] it was shown that Flow Graphs can be used to prove the equivalence of Backpropagation Through Time and a variant of Real Time Backpropagation by showing that transposing the Flow Graph of one algorithm results in the Flow Graph of the other algorithm. Here we used a more direct strategy. Both algorithms are represented using one common set of rules. It is then shown that the entire Graph Transformation System is con uent and terminates, indicating that, no matter which algorithm is applied, the same operations will be performed on the Network; that is, both algorithms are equivalent. In addition to this result the formulated set of rules speci es an entire class of algorithms that can also operate only partly in real-time and partly o-line, i.e. through an unfolding of the network.
used consists of a set of training patterns, the actual net, a time node indicating what time step is on and a node labeled T indicating whether the computation phase is nished. The following rules were used to model this graph: To start training a training pattern corresponding to the actual time step has to be inserted into the net (part1, Figure 7). Then the computation can take place with one rule modeling the computation within one layer and one rule modeling the computation between layers (part1, Figure 3,4). When the computation is nished one can switch from forward to backward computation (part1, Figure 8,9) and the error must be calculated between the result of the computation and the wanted target (part1, Figure 10). Then the error can be propagated back through the layers and the neurons (part1, Figure 5,6) with usual backpropagation. After extracting the pattern (part1, Figure 11), it can be switched from backward to forward computation (part1, Figure 12) and this process starts again. After all training patterns have been used the weights must be updated (part1, Figure 13). This set of rules described RTBP. Adding two rules to fold and unfold a net (Figure 14, 15) and using all of the old rules leads to BPTT. Then the main decision is whether one is going to switch from forward to backward computation (RTBP) or whether a new net is unfolded (BPTT) for a new time step. The main goal of this paper is two show that no matter in which order this rules are applied, the result is always the same. This leads to an equivalence of RTBP and BPTT as also shown in [7]. Also dierent classes of algorithms surrounded by RTBP and BPTT as the two extrema emerge. The main question is how many nets are going to be unfolded. III. Confluence and Termination of Rewriting Systems
In this section the basic notations of con uence and termination are given. For more details see [2]. These terms are de ned for rewrite systems (K; !) consisting of a set of con gurations K and a relation ! K K where x ! y denotes that con guration y follows con guration x. We denote by x ! y a path x ! x1 ! ::: ! y through the rewrite system. Following this de nition the graph transformation system as shown in [4] is a general rewrite sysII. Summary of Graph Rewriting Rules tem where the con gurations are graphs and G ! H if In part1 of this paper [4] a set of graph rewriting rules there exists a graph rewriting rule p = (l; r) in the systo model Backpropagation Through Time (BPTT) and Real tem so that p applied to G results in H. We call G ! H Time Backpropagation (RTBP) was proposed. The graphs a derivation with p. A rewrite system is deterministic if Appeared in Proceedings of the IEEE International Joint Conference on Neural Networks, Anchorage (Alaska), May 1998.
(8x 2 K)(x ! y1 ; x ! y2 ) ) (y1 = y2 ). The graph rewrite system in [4] is not deterministic as e.g. the computation rule between two layers (part1, Figure 3) can be applied at several embedings to one net leading to dierent results. It is also nondeterministic whether a switch from forward to backward is done or the net is unfolded. If a graph rewrite system is not deterministic it is interesting to know whether it is at least con uent. Con uence describes the situation where two sequences of derivations starting from the same graph always lead to a common graph. De nition III.1 (Con uence) 1. A rewrite system is con uent if (8x : x ! y ^ x ! w). z)(9w : z ! w ^ y ! 2. A rewrite system is locally con uent if (8x : x ! y ^ x ! w). z)(9w : z ! w ^ y ! 4
In [5] is shown that local con uence for graph rewrite systems can be checked by regarding all critical pairs. Critical pairs H1 G ! H2 are pairs of derivations where G ! H1 and G ! H2 operate on common objects of G that are deleted or changed by one of the derivations. The two derivations can be generated by the same rule with dierent embedings or by two dierent rules. To prove local con uence the following de nition is useful as it describes under which circumstances the order of two rule applications can be switched leading to the same result: De nition III.2 (Parallel Independence) Two derivations G0 to G1 resp. G0 to G2 via p1 = (l1 ; r1) and p2 = (l2 ; r2) with embedings m1 ; m2 are parallel independent if 91 : L1 ! C1; 2 : L2 ! C2 with m1 = l2 1; m2 = l1 2.1 4 This means that graph elements are not allowed to be removed or relabeled by rule p1 if they are needed by rule p2 and vice versa [6]. Two derivations are locally con uent if they are parallel independent. But, please note that if two rules are locally con uent they are not necessarily parallel independent. Another important question is whether a graph rewriting system terminates. De nition III.3 rewrite system (K; !) terminates if there is no in nite chain x0 ! x1 ! x3 ! x4 ! ; xi 2 K; (xi ! xi+1 2!. 4 Proposition III.4 (Termination) A rewrite system terminates if there is a well-founded partial order (M; ) and a function : K ! M so that x ! y ) (x) > (y). 4 In the following the natural numbers N together with are used as well-founded partial order. We describe a translation from the set of all possible graphs into N. The following de nition describes the possible results of a rewriting system. C denotes the context graph, L the left hand side, and l : C G , see [4] 1
i
i
i
i
i
De nition III.5 (Normal Forms) x 2 K is called normal form if there is no con guration y so that ! y. An xxand arbitrary k 2 K has a normal form x if k ! x is a normal form. A rewrite system K has normal forms if all k 2 K have normal forms. It is called functional if these
normal forms are unequivocal. 4 Proposition III.6 (functional rewriting) Each locally con uent and terminating rewrite system is functional. 4 In the following we show that the graph rewrite system as shown in [4] terminates and is locally con uent. Together with proposition III.6 we then know that no matter in which order the rules are applied we always get the same result. In the following section some remarks on the graph rewriting system necessary for further proofs will be given. IV. Preliminary Remarks
For the proofs in the next section the following remarks are important. No proofs are given here as these remarks are quite obvious from looking at the rules and the start graph. Assume we are at time step t with x nets unfolded, m is the complete number of training patterns. Please note that in this situation x t and t ? x training steps are already nished. In this case the graph can be divided into four parts: Remark IV.1 (Graph Parts) The graph can be divided into ve groups at time step t: x subgraphs representing nets together with training patterns with control nodes from t ? x to t (after inserting the corresponding training pattern), the maximal number of nets equals the number of training patterns subgraphs representing patterns the training is already nished for with control nodes ranging from 0 to t ? x ? 1 in the forward phase and from 0 to t ? x in the backward phase, additionally these patterns are not allowed to be connected to a net subgraphs representing patterns not trained with yet, labeled with t+2 to m and having only two input nodes, the pattern labeled with time step t + 1 gets a used pattern in the t time step as its input nodes are extended the node representing the actual time step t and computation direction b or f probably a node labeled T 4
All rules de ned do not change this structure of the graph. Some rules only change one speci c part of the system. It is easy to prove from the rules given in section II that above statements are true. Remark IV.2 (Rule Groups) Five classes of graph rewriting rules exist: rules applied during forward computation: insert data, computation between layers, computation in a neuron, duplicate the net
rules applied during backward computation: extract
! data, backpropagation between layers, backpropagation in a neuron, fold two nets
Switching between forward and backward computation Two situations can occur. Either L (left hand side of the rule) is embedded via m1 ; m2 twice in a subgraph represent weight update 4 ing one net or it is embedded into subgraphs representing
dierent nets. In the second case the parallel independence of the two resulting derivation steps is obvious as there are no overlapping graph parts of the two embedings. In the rst case these overlapping graph parts exist. If there are two dierent embedings m1 ; m2 into one net this means that they must handle neurons in the same layer using the same ? labeled nodes in both embedings. But as the labeling of these nodes is not changed during the application of the rule there exist mappings 1; 2 necessary for the parallel independence (see Def. III.2 and proposition. III.6) The proof for the forward computation works equivalent. 2 The following remark deals with the independence of the net representing time step t from the surrounding graph. Proposition V.2 (Net Independence) Applying the computation rules (backpropagation rules) to the graph part representing the net at time step t always leads to the same weight update wt. 4 As the weights of the nets used are always the same the training with training pattern T always leads to the same weight update no matter if the net is unfolded or used again V. Backward and Forward Computation after a switch. Proposition V.1 (Local Con uence, Termination) The In the next sections local con uence and termination is rules during the forward (backward) computation phase shown for the whole system of rules. (see remark IV.2) are locally con uent and their applicaVI. Local Confluence tion terminates i.e. a unique normal form exists (proposition III.6). 4 For the following rules local con uence must be shown: for all pairs of rules in the backward computation phase Proof: In the following local con uence and termination (see Remark IV.1) and the switch from backward to foris shown for the rules applicable during the backward com- ward computation putation: all pairs of rules in the forward computation phase 1. The system terminates. Each graph is associated with andforthe from forward to backward the number of C plus the number of N plus the number n for theswitch weight update rule with dierent embedings as it of training patterns training is not yet nished for. The ap- can only be applied when the training is nished plication of the rules extract data, backpropagation between There are four possible First it might be easy to layers, backpropagation in a neuron and calculate error de- show that two rules cannotcases: be applied to the same graph. crease the number of C or N. The application of the rule Second their are two rules that can be applied same fold nets decreases the number of patterns the training is graph but their embedings to not overlap. Thentothethederivanot yet nished for. are parallel independent, see def. III.2. Third the 2. The system is locally con uent. This is obvious if the tions derivations overlap, but overlapping graph parts are not rules are applied to dierent nets in the graph. Then they changed by do the Here also parallel independence apare parallel independent. If they are applied in one net of plies. Fourth it isrules. not possible to use parallel independence the graph the following holds: as in the derivations the overlapping are changed by The calculate error rule has always to be applied rst. It is the rule. Then two derivations have toparts be given leading to not possible to apply any other rule instead as it introduces the same result again. the rst ? necessary to proceed. The duplicate net rule An example of this case is given in the next proposition, can only be applied if the backpropagation is done i.e. no examples of the other three cases were already described C; N is left. Only the backpropagation rules within two in prop. V.1. layers or in a neuron can be applied non-deterministically. But it is easy to show that all combinations of these rules Proposition VI.1 (Unfolding vs.forward to backward) with dierent embedings is parallel independent and locally Applying the unfolding rule p1 as shown in partI, Figure 15 con uent. and the rule p2 for switching from forward to backward as As example the local con uence for two dierent embedings shown in partI, Figure 9 to a graph G is locally con uent. for the backpropagation between layer rule is shown: 4 In the following some numbers of nodes and edges are calculated. Remark IV.3 (Number abbreviation) The following abbreviations are used in the next proofs: i: number of input neurons (in the rules in part1 i is 4) o: number of output neurons (in this case 2) h: number of hidden neurons (in this case 4) t: number of training pattern (in this case 3) =) n = i+2o+2h number of nodes in one net part of the graph, m = t n maximal number of nodes in the all nets l the number of layers (in this case 1) h1; : : :; hl numbers of hidden neurons in the dierent layers (in this case 1 layer with four neurons) =) the number of edges in one net graph is e = h + o + i h1 + h1 h2 + : : :hl?1 hl , n = t e is the maximal number of edges representing connections in the net part. 4 In the next section the local con uence and termination for the rules applied during forward and backward computation as described in remark IV.2 is shown.
Proof: The main idea of this proof is simple. The ad-
ditional training pattern that is handled by the newly unfolded net must be used in an additionally backward to forward step in the other possibility. This can not be proven with parallel independence as several rewriting steps are necessary to nd a graph where both lines of derivation meet. Assume that the actual time step is t and x nets are unfolded belonging to time step t ? x to t where the computation is nished for each of them. In this situation either a new net x + 1 for time step t + 1 can be unfolded or the switch from forward to backward can be done. Let's rst consider the situation where the switch is done. The next steps are to calculate all errors and do the backpropagation for all nets. Here we do not have to restrict ourself to a certain order of rule application as they all lead to the same weight update (see prop. V.2,V.1). Then the tth pattern is extracted from the net and the other nets for time steps t ? x to x ? 1 are fold. The wi are added in this rule. In the resulting graph the nodes have the activation values of the xth net. The edges are labeled with the weights of the xth net (also being the activation of all former and all further net). w is calculated2 as w0 +w1 + +wt?x + +wt?1wt. Now the calculation at the t+1st time step must be done by switching from forward to backward training, inserting the pattern, do the computation (no matter in what order), switch to backward computation, do the backpropagation and extract the pattern. After these steps the nodes of the net are labeled with the activation of the t + 1st net and the edges still with the same weight w. now consists of w0 + w1 + + wt?x + + wt?1wt + wt+1. wt+1 was added during die application of the backpropagation between layer rule. The second possibility is to apply the rule for unfolding a net for the t + 1st time step. Then the computation for this net is done and the switch from forward to backward is performed and the errors are calculated. The backpropagation rules are applied to the dierent nets leading to the same w as above due to prop. V.2. Finally the nets are folded onto the net for time step t + 1 and the dierent wi are added again leading to w0 + w1 + + wt?x + + wt?1wt + wt+1. wt+1. as the second edge label. 2 All other pairs for local con uence can be done similarly. Corollary VI.2 (Local Con uence of the Rules) The given graph rewriting system is locally con uent. 4 VII. Termination
if the control node is set to be b the value of
b(j?j; jC j; jN j) = j?j + 2(jC j + jN j) + (m + 1) is used
if the control node is set to f, the following value is used
instead f(j?j; jC j; jN j) = 2j?j + (jC j + jN j)
for the function r(x) = (t ? x) the number of training
patterns training is already nished for is used the function u(x) = (2m + 2) x takes the number of unused training patterns as input i.e. with a time marker bigger than the actual time step and only four input nodes the constant T = 2m + 2 indicates whether the T node is still in the graph the function w(x) = n ? x indicates how many edges (see remark IV.3) already have being updated, i.e. the weight w diers from the start graph. All these numbers are added leading to a natural number associated with each graph: (G) = b + u + r + T + w or (G) = f + u + r + T + w depending on whether an f or b labeled node exists. These functions are always well de ned as only either a b node or an f node exists. Lemma VII.1 (Termination) When applying a rule p as explained in section II to a graph G leading to a graph H, (G) > (H) holds. 4
Proof: 1. computation between layers: when applying this production the number of ? is decreased and the number of C is increased.
f (x;y;v)+u(w)+r(z)+T+w(0) > f (x?1;y;v+1)+u(w)+r(z)+T+w(0) 2x+y +v > 2(x?1)+y+1+v 1 > 0
2. the same for computation within a neuron, backpropagation rules, insert data, error calculation, fold a net 3. extract pattern: the number of nished training patterns is increased
b(x;y;v)+u(w)+r(z)+T +w(0) > b(x;y;v)+u(w)+r(z+1)+T +w(0) t?z > t?(z+1) 1 > 0
4. similar for weight update 5. unfold a net: the number of ? is increased by n, the number of unused training pattern is decreased by one f (x;y;v)+u(w)+r(z)+T +w(0) 2x+y +v +w (2m+2) 2x+y +v +2mw +2w 2m+2 2tn+2
> f (x+n;y;v)+u(w?1)+r(z)+T +w(0) > 2(x+n)+y+v+(w?1)(2m+2) > 2x+2n+y+v+2mw+2w?2m?2 > 2n > 2n
First a translation of graphs as described in remark IV.1 into natural numbers is given. Then it is shown that for each derivation on the graph level G ! H, (G) > As there is at least one training pattern t 1, this always (H) is valid. A graph can be described using the following method : holds. rst the number of ?, C and N is count, denoted as j?j, 6. switch from backward to forward training: jC j and jN j b(x;y;v)+u(w)+r(z)+T +w(0) > f (x;y;v)+u(w)+r(z)+T+w(0) 2
The indices show the time step the w belongs to.
x+2(y+v)+(m+1) >
x y v
2 + +
worst case: y = 0; v = 0; x = m m+(m+1) >
m
2
7. switch from forward to backward training: either the T node is deleted during the application : : : f (x;y;v)+u(w)+r(z)+T +w(0) > b(x;y;v)+u(w)+r(z)+w(0) 2x+y +v +2m+1 > x+2(y+v)+(m+1)
worst case: y + v = m; x = 0 m+2m+2 >
m+m+1
2
or the number of unused data is decreased where the decrease for the corresponding natural number can be shown similarly. 2 From this proposition follows that the graph rewriting system terminates. VIII. Normal Forms
unfold the net is applied t ? 1 times and the rule switching from backward to forward computation is never applied. Now classes of algorithms can be de ned on the basis of the number of application of these two rules. Proposition VIII.3 (Number of Nets) The number of applications of the rule to unfold the and the rule to switch from backward to forward computation is always t ? 1. 4 Proposition VIII.4 (Classes of Derivations) An x-y derivation with x + y = t ? 1 contains x applications of the rule to unfold a network and y application of the rule switching frombackward to forward computation. Each t?1 class contains y single derivations. 4
Please note that nothing is said about the order of applying the rules. Within one of these classes there are algorithms where rst all nets are unfolded and then all switches are done or ice versa. Each possible combination of switches and net unfolding is possible. For RTBP and BPTT where the number of net unfoldings resp. the number of switches is 0 the corresponding classes just contain one algorithm.
Taken together the propositions III.1, III.6 and III.4 we IX. Conclusion know that the graph rewriting system has just one normal In part 1 and part 2 of the paper Showing the Equivalence form. of two Training Algorithms the equivalence of a special variProposition VIII.1 (Normal Form) The normal form ant of Real-time-Backpropagation and Backpropagationof the rules in part I have the form given in Figure 1. through-time with the help of graph transformations was shown. The basic idea came from [7] proving the same re? 16 17 ? 11 12 sult with signal- ow graphs. In part 1 graph rewriting rules 0 ? 14 15 ? 11 12 modeling the two algorithms were given. In part 2 the local 4 0 3 0 con uence and termination of the rules was shown where ? 10 11 12 13 ? the equivalence of the two algorithms can be concluded from. Additionally dierent classes of algorithms can be ? 5 6 7 8 ? described with the rules having RTBP and BPTT as two x1 x2 1 0 2 0 of these classes. o
x
x
x
x
x
b
x
x
x
x
x
x
x
x
w ;
w ;
?
o
o
i
i
i
t
o
o
i
i
x1
i
w ;
w ;
x1
t
i
x2
?
i
x3
?
i
x4
?
x
x2
Fig. 1. Normal Form of the Rules
4
Proof: From Propositions III.4, III.6, III.1 we know that the system has unique normal forms. From giving one possible derivation for Figure 1 we know that all possible derivations lead to the same result. One possible derivation would be the application of BPTT which is not explicitly given here. 2 Corollary VIII.2 (RTBP and BPTT are equal) Real Time Backpropagation and Backpropagation Through Time
lead to the same result. 4 But the set of rules not only contains RTBP and BPTT, but a lot of algorithms lying in between these two extrema. During the application of the RTBP the net is never unfolded, but the rule switching from backward to forward computation is applied t ? 1 times with t being the number training pattern. On the other hand for BPTT the rule to
References [1] Michael R. Berthold and Ingrid Fischer: Modeling Neural Networks with Graph Transformation Systems, Proceedings of the IEEE International Joint Conference on Neural Networks, Houston, Texas, June 1997. [2] N. Dershowitz and J.P. Jouannaud: Rewrite Systems, Handbook of Theoretical Computer Science, Chapter 15, 1990. [3] Ingrid Fischer, Manuel Koch, and Michael R. Berthold: Proving Properties of Neural Networks with Graph Transformation, Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, Anchorage, Alaska, May 4-9, 1998. [4] Manuel Koch, Ingrid Fischer, and Michael R. Berthold: Showing the Equivalence of two Training Algorithms { Part I, Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, Anchorage, Alaska, May 4-9, 1998. [5] D. Plump: Hypergraph Rewriting: Critical Pairs and Undecidability of Con uence, Term Graph Rewriting: Theory and Pratice , Wiley, 1993. [6] G. Rozenberg (ed): Handbook on Graph Grammars: Foundations, 1, World Scienti c, 1996. [7] Eric A. Wan and Francoise Beaufays: Relating Real-TimeBackpropagation and Backpropagation-Through-Time: An Application of Flow Graph Interreciprocity. Neural Computation, Vol. 6, No. 2, 1994.