Deriving the appropriate gradient descent algorithm for a new network architecture or system con guration normally involves brute force derivative calculations.
Submitted to Neural Computation, March 1994.
Diagrammatic Derivation of Gradient Algorithms for Neural Networks Eric A. Wan and Francoise Beaufaysy
Abstract Deriving gradient algorithms for time-dependent neural network structures typically requires numerous chain rule expansions, diligent bookkeeping, and careful manipulation of terms. In this paper, we show how to use the principle of Network Reciprocity to derive such algorithms via a set of simple block diagram manipulation rules. The approach provides a common framework to derive popular algorithms including backpropagation and backpropagation-through-time without a single chain rule expansion. Additional examples are provided for a variety of complicated architectures to illustrate both the generality and the simplicity of the approach.
1 Introduction Deriving the appropriate gradient descent algorithm for a new network architecture or system con guration normally involves brute force derivative calculations. For example, the celebrated backpropagation algorithm for training feedforward neural networks was derived by repeatedly applying chain rule expansions backward through the network (Rumelhart et al. 1986; Werbos 1974; Parker 1982). The actual implementation of backpropagation may be viewed as a simple reversal of signal ow through the network. Another popular algorithm, backpropagation-through-time for recurrent networks, can be derived by Euler Lagrange or ordered derivative methods, and involves both a signal ow reversal and time reversal (Werbos 1992; Nguyen and Widrow 1989). For both these algorithms, there is a reciprocal nature to the forward propagation of states and the backward propagation of gradient terms. Furthermore, both algorithms are ecient in the sense that calculations are order N, where N is the number of variable weights in the network. These properties are often attributed to the clever manner in which the algorithms were derived for a speci c network architecture. We will show, however, that these properties are universal to all network architectures and that the associated gradient algorithm may be formulated directly with virtually no eort. In section 2, we review the basic framework for gradient descent adaptation and error propagation. In section 3, we show how to transform a neural network architecture into a reciprocal network using a set of
Department of Electrical Engineering and Applied Physics, Oregon Graduate Institute of Science & Technology, P.O. Box 91000, Portland, OR 97291. y Department of Electrical Engineering, Stanford University, Stanford, CA 94305-4055.
1
simple block diagram manipulation rules. The reciprocal network directly speci es the adaptive algorithm, providing a formal derivation that requires no explicit algebra. Whereas the original network corresponds to a nonlinear time-independent system (assuming the weights are xed), the reciprocal network is a linear timedependent system. Several examples are provided in section 4 to illustrate the simplicity of the approach. Algorithms are derived for a variety of structures, including feedforward and feedback systems. A formal proof of the method is given in Appendix A.1 The concepts detailed in this papers were developed in Wan (1993) and later presented in Wan (1994).
2 Network Adaptation and Error Gradient Propagation In supervised learning, a neural network is provided with a training set of input vectors and associated desired output vectors [fx(0); d(0)g; :::fx(K); d(K)g]. Let us de ne W as the set of variable weights which parameterizes the neural network. Training amounts to nding the speci c set of weights that minimizes the cost function: J=
K X
k=1
Lk (d(k); y(k));
(1)
where k is used to specify a discrete time index (the actual order of presentation may be random or sequential), y(k) is the output of the network associated with input x(k), and Lk is a generic error metric that may contain additional weight regularization terms. For illustrative purposes, we will work with the squared error metric, Lk = e(k)T e(k), where e(k) is the error vector. In most problems, a desired output is speci ed at each time step. However, in some problems (e.g., terminal control), the desired output is de ned only at nal time k = K. Therefore, we de ne the error vector e(k) as the dierence between the desired and the actual output vectors when a desired output is available, and as zero otherwise. According to gradient descent, the contribution to the weight update at each time step is @J ; (2) W(k) = ? @W(k) where controls the learning rate. Note that we evaluate @J=@W(k) rather than the instantaneous gradient @(eT (k)e(k))=@W(k). This is essential for the desired Network Reciprocity result. There are several possibilities for the actual adaptation procedure depending on the speci c nature of the application. The weights may be iteratively updated at each time step according to W(k), or the gradients may be accumulated for an entire pass through the training sequence followed by a single batch update.2 With a nite length training sequence, adaptation involves repeatedly cycling through the training sequence for time k = 1 through K. If at the begin of each pass through the training sequence, a new set of initial input values x(0) are applied, we eectively minimize E[J], where the expectation is taken over the initial values of the inputs. The method presented here is similar in spirit to Automatic Dierentiation (Rall, 1981; Griewank and Corliss, 1991). Automatic Dierentiation is a simple method for nding derivative of functions and algorithms that can be represented by acyclic graphs. Our approach, however, applies to discrete-time systems with the possibility of feedback. In addition, we are concerned with diagrammatic derivations rather than computational rule based implementations. 2 The gradient may also be accumulated for use in second-order optimization methods. 1
2
At the architectural level, a variable weight wij may be isolated between two points in a network with corresponding signals ai (k) and aj (k) (i.e., aj (k) = wij ai (k)). Using the chain rule, we get @J @aj (k) @J @J (3) @wij (k) = @aj (k) @wij (k) = @aj (k) ai (k); and the weight update becomes 3 wij (k) = ? j (k) ai (k); (4) where we de ne the error gradient 4 @J : (5) j (k)= @a (k) j
The error gradient j (k) depends on the entire topology of the network. Specifying the gradient descent rule necessitates nding an explicit formula for calculating the delta term. Backpropagation, for example, is nothing more than an algorithm for generating these terms in a feedforward network. In the next section, we develop a simple non-algebraic method for deriving the delta terms associated with any network architecture.
3 Network Representation and Reciprocal Construction Rules An arbitrary neural network can be represented as a block diagram whose building blocks are: summing junctions, branching points, univariate functions, multivariate functions, and time-delay operators. Only discrete-time systems are considered. A signal located within the network is labeled as aj (k). A synaptic weight, for example, may be thought of as a linear transmittance, which is a special case of a univariate function. The basic neuron is simply a sum of linear transmittances followed by a univariate sigmoid function. Networks can then be constructed from individual neurons, and may include additional functional blocks and time-delay operators that allow for buering of signals and internal memory. This block diagram representation is really nothing more than the typical pictorial description of a neural network with a bit of added formalism. Directly from the block diagram we may construct the reciprocal network by reversing the ow direction in the original network, labeling all resulting signals i (k), and performing the following operations: 1. Summing junctions are replaced with branching points. δ
ai
δ
aj δ
al
2. Branching points are replaced with summing junctions. δi
a δj
a
δl
a 3
In the general case of a variable parameter, we have
aj (k )
@aj (k ) ?j (k) @w , where the partial term depends on the form of f . ij (k )
3
= f (wij ; ai (k)), and Equation 4 becomes wij (k) =
3. Univariate functions are replaced with their derivatives. ai (k)
δi (k)
aj (k)
f( )
δj (k)
f ’(ai (k))
Explicitly, the scalar continuous function aj (k) = f(ai (k)) is replaced by i (k) = f 0 (ai (k)) j (k), where f 0 (ai (k))4=@aj (k)=@ai (k). Note this rule replaces a nonlinear function by a linear time-dependent transmittance. Special cases are:
Weights: aj = wij ai, in which case i = wij j . wij
ai
wij
δi
aj
δj
Activation functions: an (k) = tanh(aj (k)). In this case, f 0 (aj (k)) = 1 ? a2n(k). aj (k)
δj (k)
an (k)
tanh( )
4. Multivariate functions are replaced with their Jacobians.
a (k) in
{
ai
an
aj
ao
F( ) am
ap
{
{
δ (k)
aout (k)
in
δn(k)
2
1 - an(k)
δi δj
δn δo
F’(ain(k))
δm
δp
{
δout (k)
A multivariate function maps a vector of input signals into a vector of output signals, aout = F(ain). In the transformed network, we have in (k) = F 0(ain (k)) out (k), where F 0(ain (k))4=@ aout (k)=@ ain (k) corresponds to a matrix of partial derivatives. For shorthand, F 0(ain (k)) will be written simply as F 0(k). Clearly both summing junctions and univariate functions are special cases of multivariate functions. Other important cases include:
Product junctions: aj (k) = ai (k) al (k), in which case F 0 = [al (k) ai(k)]T . δi (k)
ai (k)
a( l
aj (k) δl (k)
al (k)
k)
k) a i(
δj (k)
Layered networks. A multivariate function may itself represent a multi-layer network. In this case, the product F 0(ain (k)) out (k) is found directly by backpropagating out through the network. ai
an
δi
δn
aj
ao
δj
δo
4
5. Delay operators are replaced with advance operators. a (k) i
q-1
δ (k) = δ (k+1) i
aj (k) = ai (k-1)
j
q+1
δ (k) j
A delay operator q?1 performs a unit time delay on its argument: aj (k) = q?1 ai (k) = ai (k ? 1). In the reciprocal system, we form a unit time advance: i (k) = q+1 j (k) = j (k + 1). The resulting system is thus noncausal. Actual implementation of the reciprocal network in a causal manner is addressed in speci c examples. 6. Outputs become inputs. ai aj
original network
an = yn
δi
ao = yo
δj
reciprocal network
δ n = − 2e n δ o = − 2e o
By reversing the signal ow, output nodes an (k) = yn (k) in the original network become input nodes in the reciprocal network. These inputs are then set at each time step to ?2en (k). (For cost functions other than squared error, the input should be set to @Lk =@yn (k).) These 6 rules allow direct construction of the reciprocal network from the original network. Note that there is a topological equivalence between the two networks. The order of computations in the reciprocal network is thus identical to the order of computations in the forward network. The signals j (k) that propagate through the reciprocal network correspond to the terms @J=@aj (k) necessary for gradient adaptation. Exact equations may then be \read-out" directly from the reciprocal network, completing the derivation. A formal proof of the validity and generality of this method is presented in Appendix A.
4 Examples 4.1 Backpropagation We start by rederiving standard backpropagation [3] using the principles of Network Reciprocity. Figure 1 shows a hidden neuron feeding other neurons and an output neuron in a multilayer network. For consistency with traditional notation, we have labeled the summing junction signal sli rather than ai , and added superscripts to denote the layer. In addition, since multilayer networks are static structures, we omit the time index k. The reciprocal network shown in Figure 2 is found by applying the construction rules of the previous section. From this gure, we may immediately write down the equations for calculating the delta terms: 8 > >
0 l X l+1 l+1 j wij > : f (si ) j
l=L 0 l L ? 1:
(6)
By Equation 4, the weight update is formulated as wpil = ? il alp?1 : 5
(7)
+
apl-1
l
+
si
+
f( )
snL
yn f( )
+ wij
+
s jl+1
Figure 1: Block diagram construction of a multilayer network. +
δi
δnL
l
f ’(si l)
f ’(snL)
+
-2en
+
wij δjl+1
+
Figure 2: Reciprocal multilayer network. These are precisely the equations describing standard backpropagation. In this case, there are no delay operators and j = j (k) 4= @J=@sj (k) = (@ eT (k)e(k))=@sj (k) corresponds to an instantaneous gradient. Readers familiar with neural networks have undoubtedly seen these diagrams before. What is new is the concept that the diagrams themselves may be used directly, completely circumventing all intermediate steps involving tedious algebra. It should further be emphasized that this approach still constitutes a formal derivation, as we prove in the appendix.
4.2 Backpropagation-Through-Time For the next example, consider a network with output feedback (see Figure 3) described by
y(k) = N (x(k); y(k ? 1));
(8)
where x(k) are external inputs, and y(k) represents the vector of outputs that form feedback connections. N is a multilayer neural network. If N has only one layer of neurons, every neuron output has a feedback connection to the input of every other neuron and the structure is referred to as a fully recurrent network [9]. Typically, only a select set of the outputs have an actual desired response. The remaining outputs have no desired response (error equals zero) and are used for internal computation. Direct calculation of gradient terms using chain rule expansions is extremely complicated. A weight perturbation at a speci ed time step aects not only the output at future time steps, but future inputs as 6
y(k)
x(k) y(k-1) N (k)
q-1
δx(k)
δ(k)
δy(k)
-2e(k) δy(k+1)
N’(k)
q
Figure 3: Recurrent network and backpropagation-through-time. well. However, applying the Network Reciprocity rules (see Figure 3) we nd immediately: (k)
= y (k + 1) ? 2e(k) = N 0(k + 1)(k + 1) ? 2e(k):
(9)
These are precisely the equations describing backpropagation-through-time, which have been derived in the past using either ordered derivatives [8] or Euler-Lagrange techniques [2]. Network Reciprocity is by far the simplest and most direct approach. Note that the causality constraints require these equations to be run backward in time. This implies a forward sweep of the system to generate the output states and internal activation values. These values must be stored in memory to allow implementation of the time-varying reciprocal system in a backward sweep. Also from rule 4 in the previous section, the product N 0 (k) (k) may be calculated directly by a standard backpropagation of (k + 1) through the network at time k.4
4.3 Cascaded Neural Networks Let us now turn to the more complicated example of two cascaded neural networks as illustrated in Figure 4. The inputs to the rst network are samples from a time sequence x(k). Delayed outputs of the rst network are fed to the second network. Often, the last network represents the model of some physical system, and the rst network is used to prewarp or equalize the driving signal. The cascaded networks are de ned as u(k) = N1 (W1 ; x(k); x(k ? 1); x(k ? 2)); y(k) = N2 (W2 ; u(k); u(k ? 1); u(k ? 2));
(10) (11)
Backpropagation-through-time is viewed as an o-line gradient descent algorithm in which weight updates are made after each presentationof an entire training sequence. An on-line version in which adaptationoccurs at each time step is possible using an algorithm called real-time-backpropagation (Williams and Zipser 1989). The algorithm, however, is far more computationally expensive. The authors have presented a method based on ow graph interreciprocity to directly relate the two algorithms (Beaufays and Wan 1994a). 4
7
x(k)
-1
u(k)
-1
q
q
N1
-1
-1
q
q
y(k)
N2
Figure 4: Cascaded neural network lters. where W1 and W2 represent the weights parameterizing the networks, x(k) is the input, y(k) the output, and u(k) the intermediate signal. Given a desired response for the output y of the second network, it is a straightforward procedure to use backpropagation for adapting the second network. It is not obvious, however, what the eective error should be for the rst network. In this case, the chain rule is simple enough to apply directly to nd the instantaneous error gradient: @e2 (k) @W1
where we de ne
=
(k ) ?2e(k) @y @W
=
?2e(k)
=
1 (k)
1
?
?
@y(k) @u(k) @y(k) @u(k 1) @y(k) @u(k 2) + + @u(k) @W1 @u(k 1) @W1 @u(k 2) @W1
?
?
?
@u(k) @u(k 1) @u(k 2) + 2 (k ) + 3 (k ) ; @W1 @W1 @W1
4 ? 2e(k) @y(k) i (k)= @u(k ? i)
?
(12)
(13)
i = 1; 2; 3:
The i terms are found simultaneously by a single backpropagation of the error through the second network. Each product i+1 (k)(@u(k ? i)=@W1 ) is then found by backpropagation applied to the rst network with i+1 (k) acting as an error. However, since the derivatives used in backpropagation are time-dependent, separate backpropagations are necessary for each i+1 (k). These equations, in fact, imply backpropagation through an unfolded structure as illustrated in Figure 5, and is equivalent to weight sharing. (Le Cun et al., 1989). In situations where there may be hundreds of taps in the second network, this approach leads to a very inecient adaptation algorithm. A more ecient algorithm for nding the delta terms may be arrived at by returning to the method of Network Reciprocity. The original cascaded networks are transformed into the reciprocal structure shown in Figure 6. Simply by labeling the desired signals, gradient relations may be written down directly: u (k) = 1 (k) + 2 (k + 1) + 3 (k + 2); with
(14)
[1 (k) 2 (k) 3 (k)] = ?2e(k) N20 (u(k)); (15) i.e., each i (k) is found by backpropagation through the output network, and the i 's (after appropriate advance operations) are summed together. The weight update is given by @u(k) ; (16) W1(k) = ? u (k) @W 1 (k) in which the product term is found by a single backpropagation with u (k) acting as the error to rst network. Equations can be made causal by simply delaying the weight update for a few time steps. Clearly, 8
x(k)
q-1
x(k-1)
q-1
N1
q-1
x(k-2)
q-1
N1
u(k)
q-1
q-1
N1
u(k-2)
u(k-1)
y(k)
N2
Figure 5: Cascaded neural network lters unfolded-in-time. extrapolating to an arbitrary number of taps is also straightforward. This new algorithm is far more ecient than the earlier direct gradient calculation method: we completely avoided backpropagation through a redundant unfolded network. δ (k) x
q
δu(k)
q
q δ (k) 1
N’ 1 (x(k))
δ (k) 2
q δ (k) 3
N’ 2 (x(k))
-2e(k)
Figure 6: Reciprocal network for cascaded neural lters.
4.4 Temporal Backpropagation For xed synaptic weights, a feedforward network forms a static mapping from input to output. A more complicated model can be constructed by replacing all scalar weights with discrete time linear lters to provide dynamic interconnectivity between neurons. The lters represents a more accurate model of axonal transport, synaptic modulation, and membrane charge dissipation (Koch and Segev 1989; MacGregor 1987). Mathematically, a neuron i in layer l may be speci ed as: sli (k) =
X
j
Wijl (q?1 ) alj?1 (k) + wbias
ali (k) = f(sli (k));
(17) (18)
where a(k) are neuron output values, s(k) are summing junctions, f() are sigmoid functions, and W(q?1 ) 9
l-1
spi (k)
+
s il (k)
+
ail (k) f()
+
wij
wij ail (k)
q-1
q-1
q-1
wij(1)
wij(2)
l+1
wij(0)
sij (k)
wij(M)
l+1
sj (k)
+ l+1
sij (k)
+
l
δi (k) f ’(si l (k))
+ l
δij (k) +
wij
wij
l
δij (k)
q wij(0)
wij(1)
q
q wij(2)
wij(M)
l+1
+
δj (k)
l+1
δj (k)
Figure 7: Block diagram construction of an FIR network and corresponding reciprocal structure. are synaptic lters.5 Three possible forms for W(q?1 ) are: 8 > w Case I > > > W(q?1) =
> > > > > < > > > > > > > > > :
M X
w(m)q?m
m=0 PM ?m m=0 a(m)q P 1? M m=1 b(m)q?m
Case II
(19)
Case III
In Case I, the lter reduces to a scalar weight and we have the standard de nition of a neuron for feedforward networks. Case II corresponds to a Finite Impulse Response (FIR) lter in which the synapse forms a weighted sum of past values of its input. The resulting network forms a spatial and temporal distributed system. Case III represents the more general In nite Impulse Response (IIR) lter, in which feedback is permitted. In all cases, coecients are assumed to be adaptive. Figure 7 illustrates a network composed of FIR lter synapses realized as tap-delay lines. The scalar multiplication in the traditional synapse model has been replaced by a convolution. These networks have been utilized for a number of time-series and system identi cation problems (Wan 1993a,b,c).
5 The time domain operator q?1 is used instead of the more common z-domain variable z?1 . The z notation would imply an actual transfer function which does not apply in nonlinear systems.
10
Deriving the gradient descent rule for adapting lter coecients is quite formidable if we use a direct chain rule approach. However, using the construction rules described earlier, we may trivially form the reciprocal network also shown in Figure 7. By inspection we have il (k) = f 0 (sli (k))
X
j
ijl (k)
l+1 X MX 0 l = f (si (k)) wijl+1 (n)jl+1 (k + n) j n=0 X 0 l = f (s (k)) W l+1 (q+1 ) l+1 (k):
i
ij
j
(20)
j
Consideration of an output neuron at layer L yields
jL (k) = ?2ej (k)f 0 (sLj (k)): These equation de ne the algorithm known as temporal backpropagation (Wan 1993a,b). The algorithm may be viewed as a temporal generalization of backpropagation in which error gradients are propagated not by simply taking weighted sums, but by backward ltering. Note that in the reciprocal network, backpropagation is achieved through the reciprocal lters W(q+1 ). Since this is a noncausal lter, a delay for a few time steps in the actual weight update is necessary to implement the on-line adaptation. δy (k)
y(k) x(k)
b(1)
b(2)
b(3)
q-1
q-1
q-1
a(1)
a(2)
a(3)
b(1)
δx (k)
q
b(2)
q
a(1)
a(2)
b(3)
q a(3)
Figure 8: IIR lter (controller canonical form) and reciprocal IIR lter (observer canonical form). In the IIR case, each adaptive lter performs the following operation: y(k) =
M X
a(m)y(k ? m) +
M X
b(m)x(k ? m) =
PM ?m m=0 b(m)q PM ?m x(k)
(21) 1 ? m=1 a(m)q where x(k) corresponds to some internal activation value, and y(k) is the output of the synapse feeding a summing junction. A controller canonical realization (Kailath 1980) for the IIR lter is drawn in Figure 8. Using direct chain rule methods, Back and Tsoi (1991) have derived complicated algorithms for the IIR networks involving order N 2 computations. Using the reciprocal construction rules, however, we can easily determine how to propagate the delta terms through the IIR lter in only order N computations. By inspection from Figure 8 we have: PM M M +m X X m=0 b(m)q y (k) (22) a(m)x (k + m) + b(m)y (k + m) = P x (k) = M 1 ? m=1 a(m)q+m m=0 m=1 Note that the realization of the reciprocal lter corresponds to a noncausal observer canonical form (Kailath 1980). For the entire IIR network, we simply propagate error terms backward through the network in a m=1
m=0
11
manner symmetric to the forward propagation of lter terms. As with backpropagation-through-time, the network must be trained using a forward and backward sweep necessitating storage of all activation values at each step in time. Equation 20 for temporal backpropagation still applies with W(q+1 ) representing a noncausal IIR lter. y(k) κ1
x(k)
κ2
κ3
κ1
κ2
κ3
-1
-1
-1
q
κ1
q
κ2
κ1 δx(k)
q
κ2
q
x(k)
δx(k)
κ1
-1
-1
q
q
κ3 κ3
κ1
κ2
κ3
q
(b)
κ2
κ3
κ3
q
(a)
δy(k)
κ3
-1
q
y(k)
(c)
κ2
κ1
κ2
κ1
q
q
q
δy(k)
(d)
Figure 9: Lattice lters: (a) FIR, (b) reciprocal FIR, (c) IIR, (d) reciprocal IIR. In the case of IIR lters, stability of the system becomes an issue. Note that the poles of the forward IIR lter are reciprocal to the poles of the reciprocal lter. Stability monitoring can be made easier if we consider other realizations of IIR lters. Figure 9 shows an all-pole lattice IIR lter and its corresponding reciprocal structure. Also shown is a lattice implementation of an FIR lter6 . Stability is guaranteed for the IIR lattice if the magnitude of each coecient is less than 1 (Haykin 1991). Regardless of the choice of the lter realization, network reciprocity provides a simple uni ed approach for deriving a learning algorithm. The above examples allow us to extrapolate the following additional construction rule: Any linear subsystem H(q?1 ) in the original network is transformed to H(q+1 ) in the reciprocal system. This is just the generalization of replacing time delays q?1 with time advances q+1 .
4.5 Time-Delay Neural Networks Similar to FIR networks are time-delay neural networks (TDNN) illustrated in Figure 10. (Each circle in the gure represents a neuron. A line represents a layer of weights.) A TDNN consists of a feedforward layered network in which the outputs of a layer are buered several time steps and then fed fully connected to the next layer. Published methods for adapting the networks involve unfolding the network to a static equivalent and then using a batch update for a given training sequence (Waibel 1989; Waibel et al. 1989). The reciprocal network for the TDNN is shown in Figure 10. It implies a new more ecient on-line algorithm equivalent to 6 In related work (Beaufays and Wan 1994b), Network Reciprocity was used to derive an algorithm to minimize the output power at each stage of an FIR lattice lter. This provides an adaptive lattice predictor used as a decorrelating preprocessor to a second adaptive lter. The new algorithm is more eective than the Griths algorithm (Griths 1977).
12
y(k)
-2e(k)
-1
q
q
-1
q
-1
-1
q
q
-1
q
x(k)
δ(k)
q
q
q
q
Figure 10: TDNN and reciprocal network. temporal backpropagation for FIR networks. (The replacement of summing junctions by branching points and sigmoids by their derivatives is not detailed in the gure.) To completely specify the algorithm we need only add formal notation.
4.6 Neural Control Architectures u(k)
r(k) x(k-1)
x(k)
Plant
x(k-1)
P(k)
N(k)
q-1
δr (k)
δu (k)
δx1 (k)
δx2 (k) N’(k)
Plant Jacobian Px’(k), Pu’(k)
δp (k)
-2e(k) δx3 (k+1)
q
Figure 11: Neural network control architecture and reciprocal counterpart. For the nal set of examples, we consider feedback systems for neural control problems. A generic nonlinear plant may be described by x(k) = P (u(k); x(k ? 1)); (23) 13
where x(k) is the state vector for the system and u(k) is the control signal. The goal is to drive some of the states to desired values using a multi-layer neural network to generate the control signal: u(k) = N (r(k); x(k ? 1)); (24) where r(k) may constitute some external reference signal. It is assumed that full state information is available for feedback. This common neural control con guration is illustrated in Figure 11. The desired response for the system may be acquired using a model reference (e.g., a second order linear plant), or using a linear quadratic regulator type approach (Narendra and Parthasarathy 1990; Landau 1979). For terminal control problems such as robotics, the desired response may exist only at the nal time step, and the necessary trajectory of controlled states must be generated by the network (Plumer 1993; Bryson and Ho 1975). In order to adapt the weights of the controller, it is necessary to nd the gradient terms u (k) = @J=@u(k), which constitutes an eective error for the neural network. Directly from Figure 11 we have: p (k) = N 0(k + 1)u (k + 1) + Px0 (k + 1)p (k + 1) ? 2e(k) u (k) = Pu0 (k)p (k):
(25) (26)
These coupled equations are well known formulas for backpropagation-through-time applied to a controller structure (Nguyen and Widrow 1989; Werbos 1992). Px0 (k) = [@ x(k)=@ x(k ? 1)]T and Pu0 (k) = [@ x(k)=@u(k)]T are Jacobian matrices for the plant. Hence we assumed the availability of either a mathematical model for the plant or possibly a neural network model of it. If a neural network model exists, the product P 0 represents a backpropagation through the network model. u(k)
r(k) q-1
Controller
q-1
q-1
q-1
q-1
q-1
q-1
q-1
N(k)
Plant Model y(k)
P(k)
y(k-1)
q-1
Figure 12: Neural network control using nonlinear ARMA models. A limitation of this system is the need for full-state information x(k). More often only some observation y(k) of the plant output is available. Figure 12 illustrates a possible solution in which we have assumed a nonlinear Autoregressive Moving Average (ARMA) model for both the plant and controller. Alternatively, an FIR or IIR network could be used to replace the standard multi-layer network. Applying the principles of Network Reciprocity, we arrive at the system for gradient propagation illustrated in Figure 13. Equations can be written down simply by adding further notation.
14
δr (k)
δu (k) q
Controller
q
Plant Model
q q
q
δp (k)
-2e(k)
q
q
q
’ N(k)
’ P(k) q
Figure 13: Reciprocal network for control using nonlinear ARMA models.
5 Summary
The previous examples served to illustrate the ease in which algorithms may be derived using Network Reciprocity. One starts with a diagrammatic representation of the network of interest. A reciprocal network is then constructed by simply swapping summing junctions for branching points, continuous functions with derivative transmittances, and time delays with time advances (a linear subsystem H(q?1) is replaced with H(q+1 )). The nal algorithm is read directly o the reciprocal network. No messy chain rules are needed. The approach provides a uni ed framework for formally deriving gradient algorithm for arbitrary network architectures, network con gurations, and systems.
A Proof of Network Reciprocity We need to show that the method of Network Reciprocity constitutes a formal derivation for arbitrary network architectures. The outline of the proof is as follows: 1. A perturbation is applied to some speci c node a (k) in the network. It is shown that the perturbation propagates through a derivative network that is topologically equivalent to the original network. The derivative network is a time-dependent structure. 2. The derivative network is systematically unraveled in time to produce a linear time independent ow graph. Normalizing the ow graph signals by the perturbation, and taking the limit as the perturbation approaches zero, provides the output gradient @J=@a (k)4= (k). 3. Next, the principle of ow graph interreciprocity is evoked to reverse the signal ow through the unraveled network. Summing junctions are swapped with branching points, and inputs with outputs. 4. The reverse ow graph is raveled back in time to produce the reciprocal network. The input, originally corresponding to a perturbation, becomes an output providing (k). 5. By symmetry, it is argued that all signals in the reciprocal network correspond to @J=@ai (k)4=i (k). We now proceed with the full proof of the principle of Network Reciprocity: 15
1. We will initially assume that only univariate functions exist within the network. This is by no means restrictive. It has been shown (Hornik et. al 1989; Cybenko 1989; Irie and Miyake 1988) that a feedforward network with two or more layers and a sucient number of internal neurons can approximate any uniformly continuous multivariate function to an arbitrary accuracy. A feedforward network is, of course, composed of simple univariate functions and summing junctions. Thus any desired multivariate function in the overall network architecture is assumed to be well approximated using a univariate composition. We may completely specify the topology of a network by the set of equations aj (k) =
X
i
Tij ai (k)
8j
T 2 ff( ); q?1 g;
(27) (28)
where aj (k) is the signal corresponding to the node aj at time k. The sum is taken over all signals ai (k) which connect to aj (k), and Tij is a transmittance operator corresponding to either a univariate function (e.g., sigmoid function, constant multiplicative weight), or a delay operator. (The symbol is used to remind us that T is an operator whose argument is a.) The signals aj (k) may correspond to inputs (aj 4=xj ), outputs (aj 4=yj ), or internal signals to the network. Feedback of signals is permitted. The goal is to nd @J=@aj (k) for each node in the network. We cannot proceed, however, by simply taking derivatives of Equation 27 (the derivative is not de ned for a delay operator q?1). Therefore, we add to a speci c node a a perturbation a (k) at time k. The perturbation propagates through the network resulting in eective perturbations aj (k) for all nodes in the network. We would like to know how the perturbations are related through the network. Through a continuous univariate function in which aj (k) = f(ai (k)) we have, to rst order: j (k) a (k) = f 0 (a (k))a (k); (29) aj (k) = @a j i @ai (k) i where it must be clearly understood that aj (k) and ai (k) are the perturbations directly resulting from the external perturbation a (k). Through a delay operator, aj (k) = q?1 ai(k) = ai (k ? 1), we have: aj (k) = ai (k ? 1) = q?1ai (k):
(30)
Combining these two results with Equation 27 gives aj (k) =
X
i
Tij0 ai (k)
where we de ne
8j;
(31)
T 0 2 ff 0 (ai (k)); q?1g: Note that f 0 (ai (k)) is a linear time-dependent transmittance. Equation 31 de nes the derivative network which is topologically identical to the original network (i.e., one-to-one correspondence between signals and connections). Functions are simply replaced by their derivatives. This is a rather obvious result, and simply states that a perturbation propagates through the same connections and in the same direction as would normal signals. 16
∆a* (k)
∆a* (k)
q-1
q-1
k
k
∆y(k)
∆y(k)
Figure 14: a) Time dependent input/output system for the derivative network. b) Same system with all delays drawn externally. 2. The derivative network may be considered a time dependent system with input a (k) and outputs y(k) as illustrated in Figure 14a. Imagine now redrawing the network such that all delay operators q?1 are dragged outside the functional block (Figure 14b). Equation 31 still applies. Neither the de nition nor the topology of the network has been changed. However, we may now remove the delay operators by cascading copies of the derivative network as illustrated in Figure 15. Each stage has a dierent set of transmittance values corresponding to the time step. The unraveling process stops at the nal time K (K is allowed to approach 1). Additionally, the outputs y(n) at each stage are multiplied by ?2e(n)T and then summed P over all stages to produce a single output J 4= Kn=k ?2e(n)T y(n). k
k+1
k+2
K
∆a* (k)
∆y(k)
∆y(k+1) T
-2e (k)
T
-2e (k+1)
∆y(k+2) T
-2e (k+2)
∆y(K) T
-2e (K) ∆J
Figure 15: Flow graph corresponding to unraveled derivative network. Summarizing, we have formed a structure with input a (k) and output J. By removing the delay operators, the time index k may now be treated as simply a labeling index. The unraveled structure is in fact a time independent linear ow graph. By linearity, all signals in the ow graph can be divided by a (k) so that the input is now 1, and the output is J=a (k). In the limit of small a (k) K X (k) = T y(n) lim J=a lim ? 2 e (n) a (k) a (k)!0 a (k)!0 n=k K X @ y(n) ?2e(n)T @a = (k) n=k
17
(32) (33)
k
k+1
k+2
K
δ (k) *
-2e (k)
-2e (k+1)
-2e (k+2)
-2e (K) 1.0
Figure 16: Transposed ow graph corresponding to unraveled derivative network. K @ e(n)T e(n) X : (34) = n=k @a (k) Since the system is causal the partial of the error at time n with respect to a (k) is zero for n < k. Thus K @ e(n)T e(n) K @ e(n)T e(n) X X = (35) n=1 @a (k) n=k @a (k) = @a@J (36) (k) 4 = (k): (37) The term (k) is precisely what we were interested in nding. Unfortunately, we have at this point a rather unappealing structure. Furthermore, calculating all the i (k) terms would require separately propagating signals through the unraveled network with an input of 1 at each location associated with ai(k). The entire process would then have to be repeated at every time step. 3. The next step in the proof is to take the unraveled network (i.e., ow graph) and form its transpose. This is accomplished by reversing the signal ow direction, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging input and output nodes. The new ow graph is represented in Figure 16. From the work by Tellegen (1952) and Bordewijk (1956), we know that transposed ow graphs are a particular case of interreciprocal graphs. This means that the output obtained in one graph, when exciting the input with a given signal, is the same as the output value of the transposed graph, when exciting its input by the same signal. In other words, the two graphs have identical transfer functions. This basic property, which was rst presented in the context of electrical circuits analysis (Pen eld et al. 1970), nds applications in a wide variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al. 1984), and the duality between decimation in time and decimation in frequency formulations of the FFT algorithm in signal processing (Oppenheim and Schafer 1989). Flow graph interreciprocity was rst applied to neural networks to relate real-time-backpropagation and backpropagation-through-time by Beaufays and Wan (1994a). Returning to the problem at hand, if an input of 1:0 is distributed along the lower horizontal branch of the transposed graph, the nal output will equal (k). This (k) is identical to the output of our original
ow graph. 18
4. The transposed graph can now be raveled back in time to produce the reciprocal network: δ (k) *
q
q
k
-2e(k)
Since the direction of signal ow has been reversed, delay operators q?1 become advance operators q+1 . The node a (k) which was the original source of an input perturbation, is now the output (k) as desired. The outputs of the original network become inputs with value ?2e(k). Summarizing the steps involved in nding the reciprocal network, we start with the original network, form the derivative network, unravel in time, transpose, and then ravel back up in time. These steps are accomplished directly by starting with the original network and simply swapping branching points and summing junctions (rules 1 and 2), functions f for derivatives f 0 (rule 3), and q?1's for q+1 's (rule 5). 5. Finally we note that the selection of the speci c node a (k) is totally arbitrary. Had we started with any node ai (k), we would still have arrived at the same result. In all cases, the input to the reciprocal network would still be ?2e(k). Thus by symmetry, every signal in the reciprocal network provides i (k) = @J=@ai (k). This is exactly what we set out to prove for the signal ow in the reciprocal network. To tie up loose ends, consider isolating the set of signals ain (k) corresponding to the inputs of some desired multivariate function, and the set of signals aout (k) corresponding to outputs. Thus aout(k) = F(ain(k))), where by our earlier statements, it was assumed the function was explicitly represented by a composition of summing junctions and univariate functions. F is restricted to be memoryless and thus can not be composed of any delay operators. In the reciprocal network, ain (k) becomes in(k) and aout (k) becomes out(k). But
@ aout (k) T @J = F 0(a (k)) (k): = (38) in out @ ain (k) @ ain (k) @ aout (k) Thus any section within a network that contains no delays may be replaced by the multivariate function F( ), and the corresponding section is the reciprocal network is replaced with the Jacobian F 0(ain (k)). This veri es rule 4. QED. 4 @J in(k)=
Acknowledgements This work was funded in part by EPRI under contract RP801013 and by NSF under grant IRI 91-12531.
19
References Back, A., and Tsoi, A. 1991. FIR and IIR synapses, a new neural networks architecture for time series modeling. Neural Computation, vol. 3, no. 3, pages 375-85. Beaufays, F., and Wan, E. 1994a. Relating real-time backpropagation and backpropagation-through-time: an application of ow graph interreciprocity. Neural Computation, vol. 6, no. 2, 1994a. Beaufays, F., and Wan, E. 1994b. An Ecient First-Order Stochastic Algorithm for Lattice Filters, to appear in ICANN'94, Salerno, Italy. Bordewijk, J. 1956. Inter-reciprocity applied to electrical networks. Appl. Sci. Res. 6B, pages 1-74. Bryson, A., Ho, Y. 1975. Applied Optimal Control. Hemisphere Publishing Corp. NY. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, vol. 2, no. 4. Griewank, A., and Coliss, G., Editors. 1991. Automatic Dierentiation of Algorithms: Theory, Implementation, and Application. Proceedings of the rst SIAM workshop on automatic dierentiation, Brekenridge, Colorado. Griths, L. 1977. A continuously adaptive lter implemented as a lattice structure. In Proc. ICASSP, Hartford, Conn., pages 683-686. Haykin, S. 1991. Adaptive Filter Theory. Prentice Hall, Inc. Englewood Clis, New Jersey 07632. Hornik, K., Stinchombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks, vol. 2, pages 359-366. Irie, B., and Miyake, S. 1988. Capabilities of three-layered perceptrons. In Proceedings of the IEEE Second International Conference on Neural Networks, vol. I, San Diego, CA, pages 641-647, July. Kailath, T. 1980. Linear Systems. Prentice-Hall., Englewood Clis, N.J. Koch, C., and Segev, I. editors. 1989. Methods in Neuronal Modeling: From Synapses to Networks, MIT Press. Landau, I. 1979. Adaptive Control: The Model Reference Approach. Marcel Dekker, New York. LeCun, Y., Boser, B., et al. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, vol. 1, pages 541-551, winter. 20
MacGregor, R. 1987. Neural and Brain Modeling. Academic Press, New York. Narendra, K, and Parthasarathy, K. 1990. Identi cation and control of dynamic systems using neural networks. IEEE Trans. on Neural Networks, vol. 1, no. 1, pages 4-27. Nguyen, D., and Widrow, B. 1989. The truck backer-upper: an example of self-learning in neural networks, In Proceedings of the International Joint Conference on Neural Networks, II, Washington, DC, pages 357-363. Oppenheim, A., and Schafer, R. 1989. Digital Signal Processing. Prentice-Hall, Englewood Clis, NJ. Parker, D., 1982. Learning-logic. Invention Report S81-64, File 1, Oce of Technology Licensing, Stanford University, October. Pen eld, P., Spence, R., and Duiker. S. 1970 Tellegen's Theorem and Electrical Networks, MIT Press, Cambridge, Mass. Plumer, E. 1993. Optimal Terminal Control Using Feedforward Neural Networks. Ph.D. dissertation. Stanford University. Rall, B. 1981. Automatic Dierentiation: Techniques and Applications, Lecture Notes in Computer Science, Springer-Verlag. Ramo, S., Whinnery, J.R., and Van Duzer. T. 1984. Fields and Waves in Communication Electronics, Second Edition. John Wiley & Sons. Rumelhart, D.E., McClelland, J.L., and the PDP Research Group. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1. MIT Press, Cambridge, MA. Tellegen, D. 1952. A general network theorem, with applications. Philips Res. Rep. 7. pages 259-269. Waibel, A. 1989a. Modular construction of time-delay neural networks for speech recognition. Neural Computation, vol. 1, no.1, pages 39-46, Spring. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989b. Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pages 328-339, March. Wan, E. 1993a. Finite Impulse Response Neural Networks with Applications in Time Series Prediction. Ph.D. dissertation. Stanford University.
21
Wan, E. 1993b. Time series prediction using a connectionist network with internal delay lines. In A. Weigend and N. Gershenfeld, editors, Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley. Wan, E. 1993c. Modeling nonlinear dynamics with neural networks: examples in time series prediction. In Proceedings of the Fifth Workshop on Neural Networks: Academic/Industrial/NASA/Defense, WNN93/FNN93, San Francisco, pages 327-232, November. Wan, E., and Beaufays, F. 1994. Network Reciprocity: A Simple Approach to Derive Gradient Algorithms for Arbitrary Neural Network Structures. To appear in WCNN'94, San Diego, CA. Werbos, P. 1974. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard University, Cambridge, MA. Werbos, P. 1992. Neurocontrol and supervised learning: An overview and evaluation. In D. White and D. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, chapter 3. Van Nostrand Reinhold, New York. Werbos, P. 1990. Backpropagation through time: what it does and how to do it. Proc. IEEE, Special Issue on Neural Networks, vol. 2, pages 1550-1560. Williams, R., and Zipser D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computations, vol.1, pages 270-280.
22