An EM Approach to Learning Sequential Behavior

0 downloads 0 Views 437KB Size Report
May 23, 1994 - temporal credit assignment and actual parameters estimation. ... any parametric dynamical system, such as a recurrent neural network, .... associated to one of the states in Q, and all networks share the ... (6). Assuming a multinomial distribution for the states we have1 ..... and then, imposing the constraint. P.
An EM Approach to Learning Sequential Behavior Yoshua Bengio Dept. Informatique et Recherche Operationnelle Universite de Montreal, Montreal, Qc H3C-3J7 [email protected]

Paolo Frasconi Dipartimento di Sistemi e Informatica Universita di Firenze (Italy) [email protected]

Tech. Report. DSI 11/94 Universita di Firenze May 23 1994

0

Abstract

We consider problems of sequence processing and we propose a solution based on a discrete state model. We introduce a recurrent architecture having a modular structure that allocates subnetworks to discrete states. Di erent subnetworks are model the dynamics (state transition) and the output of the model, conditional on the previous state and an external input. The model has a statistical interpretation and can be trained by the EM or GEM algorithms, considering state trajectories as missing data. This allows to decouple temporal credit assignment and actual parameters estimation. The model presents similarities to hidden Markov models, but allows to map input sequences to output sequences, using the same processing style of recurrent networks. For this reason we call it Input/Output HMM (IOHMM). Another remarkable di erence is that IOHMMs are trained using a supervised learning paradigm (while potentially taking advantage of the EM algorithm), whereas standard HMMs are trained by an unsupervised EM algorithm (or a supervised criterion with gradient ascent). We also study the problem of learning long-term dependencies with Markovian systems, making comparisons to recurrent networks trained by gradient descent. The analysis reported in this paper shows that Markovian models generally su er from a problem of di usion of temporal credit for long-term dependencies and fully connected transition graphs. However, while recurrent networks exhibit a con ict between long-term information storing and trainability, these two requirements are either both satis ed or both not satis ed in Markovian models. Finally, we demonstrate that EM supervised learning is well suited for solving grammatical inference problems. Experimental results are presented for the seven Tomita grammars, showing that these adaptive models can attain excellent generalization.

1

1 Introduction For many learning problems, the data of interest have a signi cant sequential structure. Problems of this kind arise in a variety of applications, ranging from written or spoken language processing, to the production of actuator signals in control tasks. Feedforward networks are inadequate in many of these cases because of the absence of a memory mechanism that can retain past information in a exible way. Even if these models include delays in their connections (Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989), the duration of the temporal contingencies that can be captured is xed a priori by the architecture rather than being inferred from data. Recurrent networks, on the other hand, allow to model arbitrary dynamical systems and can store and retrieve contextual information in a very exible way, i.e., for durations that are not xed a priori and that can vary from one sequence to another. Up until the present time, research e orts on supervised learning for recurrent networks has been almost exclusively focused on gradient descent methods. In these cases, supervision de nes an error function and the weights of the network are iteratively updated by following the gradient of the error function. Numerous algorithms are available for computing the gradient. For example, the back-propagation through time algorithm (Rumelhart, Hinton, & Williams, 1986; Pearlmutter, 1989) is a straightforward generalization of back-propagation that allows to compute the gradient in fully recurrent networks. The real time recurrent learning algorithm (Kuhn, 1987; Robinson & Fallside, 1988; Williams & Zipser, 1989) is local in time and produces a partial gradient after each time step, thus allowing on-line weights updating. Another algorithm was proposed for training local feedback recurrent networks (Gori, Bengio, & De Mori, 1989; Mozer, 1989). It is also local in time, but requires computation only proportional to the number of weights, like back-propagation through time. Local feedback recurrent networks are suitable for implementing short-term memories but they have limited representational power for dealing with general sequences (Frasconi, Gori, & Soda, 1992). Despite their appealing features, training recurrent networks may not be an easy task. Practical diculties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals (Bengio, Simard, & Frasconi, 1994; Mozer, 1992). In fact, it can be proved that any parametric dynamical system, such as a recurrent neural network, will be increasingly dicult to train with gradient descent as the duration of the dependencies to be captured increases (Bengio et al., 1994). A mathematical analysis of the problem shows that either one of two conditions arises in such systems. In the rst case, the dynamics of the network allow it to reliably store bits of information for arbitrary durations, with bounded input noise; however, gradients with respect to an error at a given time step vanish exponentially fast as one propagates them backward in time. In the second case, the gradients can ow backward but the system is locally unstable and cannot reliably store bits of information. Previous work on alternatives training algorithms (Bengio, Frasconi, & Simard, 1993; Bengio et al., 1994) could suggest that the root of the problem lies in the essentially discrete nature of the process of storing information for an inde nite amount of time. A potential solution to this problem is to propagate, backward in time, targets in state space, rather than di erential error information. In order to gain some intuition about target propagation, suppose that an 2

oracle is available that provides targets for each internal state variable, and for each time step. In this case learning would be reduced to a static learning problem, namely the problem of learning the next-state and the output mappings that de ne the behavior of the dynamical system using the state space representation. Of course, such an oracle is not available in general. It essentially supposes prior knowledge of an appropriate state representation. However, we can conceive an iterative approach based on two repeated steps: a rst step approximates the oracle providing pseudo-targets, and a second step ts the parameters to the pseudo-target state trajectory and the actual targets derived from supervision. One of the rst related approaches is probably the moving target algorithm by Rohwer (1990), although in that case learning takes place in the joint space of targets and parameters, simultaneously. Extending previous work (Bengio & Frasconi, 1994), in this paper we propose a statistical approach to target propagation, based on the EM algorithm. We consider a parametric dynamical system with discrete states and we introduce a modular architecture, with subnetworks associated to discrete states. The architecture can be interpreted as a statistical model and can be trained by the EM or generalized EM (GEM) algorithms of Dempster, Laird, & Rubin (1977), by considering the internal state trajectories as missing data. In this way learning is factored into a temporal credit assignment subproblem and a static learning subproblem that consists in tting parameters to the next-state and output mappings de ned by the estimated trajectories. In order to iteratively tune parameters with the EM or GEM algorithms, the system propagates forward and backward a discrete distribution over the n states, resulting in a procedure similar to the Baum-Welsh algorithm used to train standard hidden Markov models (HMM) (Baum, Petrie, Soules, & Weiss, 1970; Levinson, Rabiner, & Sondhi, 1983). HMMs however adjust their parameters using unsupervised learning, whereas we use EM in a supervised fashion. Furthermore, the model presented here could be called Input/Output HMM, or IOHMM, because it can be used to learn to map input sequences to output sequences (unlike standard HMMs, which learn the output sequence distribution). Unlike HMMs, which represent a distribution P (y1t ) over a sequence y1t = y1    yt , IOHMMs represent a distribution P (y1t jxt1) over an output sequence y1t when an input sequence xt1 is given. Thus IOHMMs can perform sequence regression (y continuous) or classi cation (y discrete). Experiments on arti cial tasks (Bengio & Frasconi, 1994) have shown that EM recurrent learning can deal with long-term dependencies more e ectively than back-propagation through time and other alternative algorithms. However, the model used in (Bengio & Frasconi, 1994) has very limited representational capabilities and can only map an input sequence to a nal discrete state. In the present paper we describe an extended architecture that allows to fully exploit both input and output portions of data, as required by the supervised learning paradigm. In this way, general sequence processing tasks can be addressed, such as production, classi cation, or prediction. The paper is organized as follows. Section 2 is devoted to a circuit description of the architecture and its statistical interpretation. In section 3 we derive the equations for EM recurrent learning. In particular, we present an EM version of the learning algorithm that can be used for quantized inputs and linear subnetworks, and a GEM version for general multilayered subnetworks. The proposed model has interesting connections to standard HMMs 3

and to a recurrent version of the Mixture of Experts architecture by Jacobs, Jordan, Nowlan & Hinton (1991). In section 4 we discuss in some detail relationships among these models. In section 5 we analyze from a theoretical point of view the learning capabilities of the model in the presence of long-term dependencies, arguing that improvements can be achieved by adding an extra term to the likelihood function and/or by constraining the state transitions of the model. Finally, we report experimental results for a classical benchmark study in grammatical inference (Tomita's grammars). The results demonstrate that the model can achieve very good generalization using few training examples.

2 The Proposed Architecture We consider a discrete state dynamical system based on the following state space description:

qt = f (qt? ; ut) yt = g (qt; ut ) 1

(1)

where ut 2 Rm is the input vector at time t, yt 2 Rr is the output vector, and qt 2 Q = f1; 2; : : :; ng is a discrete state. These equations de ne a generalized Mealy nite state machine, in which inputs and outputs may take on continuous values. f () is referred to as the state transition function and g () is the output function. In this paper, we consider a probabilistic version of these dynamics, where the current inputs and the current state distribution are used to estimate the state distribution and the output distribution for the next time step. Admissible state transitions will be speci ed by a transition graph G = fQ; E~g, whose vertices correspond to the model's states. A transition from state j to state i is admissible if and only if there exists an edge eij 2 E~. We de ne the set of successors for each state j as = fi 2 Q : 9eij 2 E~g; Sj def The system de ned by equations (1) can be modeled by the recurrent architecture depicted in Figure 1. The architecture is composed by a set of state networks N j ; j = 1 : : :n and a set of output networks Oj ; j = 1 : : :n. Each one of the state and output network is uniquely associated to one of the states in Q, and all networks share the same input ut . Each state network Nj has the task of predicting the next state representation, based on the current input and given that qt?1 = j . Similarly, each output network Oj predicts the output of the system, given the current state and input. All the subnetworks are assumed to be static, i.e., they are de ned by means of algebraic mappings Nj (ut ; j ) and Oj (ut ; #j ), where j and #j are vectors of adjustable parameters (e.g., connection weights). We assume that these functions are di erentiable with respect to the parameters. The ranges of the functions Nj () may be constrained in order to account for the underlying transition graph. Each output 'ij;t of the state subnetwork Nj is associated to one of the successors i of state j . Thus the last layer of Nj has as many units as the cardinality of Sj . For convenience of notation, we suppose that 'ij;t are de ned for each i; j = 1; : : :; n and we impose the condition 'ij;t = 0 for each i not belonging to Sj . The softmax function (Bridle, 4

t

t

 ;t 1

O

O

1

'2;t



N

2

1

N

2

ut

Figure 1: The proposed recurrent architecture for a number of states n = 2. 1989a) is used in the last layer: a 'ij;t = Xe

ij;t

`2Sj

ea

`j;t

; j = 1; : : :; n; i 2 Sj

(2)

where aij;t are intermediate variables that can be thought of as the activations of the output units of subnetwork N j . In this way n X i=1

'ij;t = 1 8j; t:

(3)

The vector  t 2 Rn represents the internal state of the model and it is computed as a linear combination of the outputs of the state networks, gated by the previously computed internal state: n X t = j;t?1'j;t (4) j =1

where 'j;t = ['1j;t; : : :; 'nj;t]0 .

Output networks compete to predict the global output of the system  t 2 Rr : t

=

n X j =1

jtjt

(5)

where  jt 2 Rr is the output of subnetwork Oj . At this level, we do not need to further specify the internal architecture of the state and output subnetworks. Depending on the task, the designer may decide whether to include hidden layers and what activation rule to use for the hidden units. 5

2.1 A Probabilistic Model

The above architecture can be easily interpreted as a probability model. Let us de ne the vector of indicator variables z t as: if qt = i, (6) zit = 01 otherwise. Assuming a multinomial distribution for the states we have1 P(z t) =

n Y

itz

it

i=1 so that it = P(qt = i j ut ), having denoted with ut

(7)

time 1 to t, 1 1 the subsequence of inputs from P inclusively. The vector  0 is chosen to be a vector of initial state probabilities, i.e., i i;0 = 1. From equation (4) one recognizes that the subnetworks N j compute transition probabilities conditioned on the input sequence ut : 'ij;t = P(qt = i j qt?1 = j; ut ). As in neural networks trained to minimize the output mean squared error (MSE), the output  t of this architecture can be interpreted as an expected \position parameter" for the probability distribution of the output y t . However, in addition to being conditional on an input ut , this expectation is also conditional on the state qt : t

= E [yt j qt ; ut ]:

(8)

The actual form of the output density, denoted fY (yt;  t), will be chosen according to the task. For example a multinomial distribution is suitable for sequence classi cation, or for symbolic mutually exclusive outputs. Instead, a Gaussian distribution is adequate for producing continuous outputs. In the rst case we use a softmax function at the output of subnetworks Oj ; in the second case we use linear output units for the subnetworks Oj .

2.2 Conditional Dependencies

The random variables (for input, state, and output) involved in the probabilistic interpretation of the proposed architecture have a joint probability P(uT1 ; xT1 ; yT1 ). Without conditional independency assumptions, the amount of computation necessary to estimate the probabilistic relationships among these variables can quickly get intractable. Thus we introduce an independency model over this set of random variables. For any three random variables X; Y and Z , we say that X is conditionally independent on Y given Z , written I (X; Z; Y ), if P(X = x j Y = y; Z = z ) = P(X = x j Z = z )

for each pair (x; z ) such that P(X = x; Z = z ) > 0. A dependency model M is a mapping that assigns truth values to the predicate I (X; Z; Y ). Rather than listing a set of conditional 1 To simplify the notation, we write the probability P(X = x) that the discrete random variable X takes on the value x as P(x), unless this would introduce ambiguity. Similarly, if X is a continuous variable we use P(x)

to denote its probability density.

6

qt

qt?

1

qt

+1

ut?1

y t+1

yt

yt?1

ut

ut+1

Figure 2: Bayesian network expressing conditional dependencies among the random variables involved in the probabilistic interpretation of the recurrent architecture. independency assumptions, we prefer to express dependencies using a graphical representation. A dependency model can be represented by means of a directed acyclic graph according to the following de nitions (Pearl, 1988): De nition 1 (d-separation). Let X ; Y ; Z be three disjoint subsets of nodes in a DAG D = (S ; E~). Then Z is said to d-separate X from Y , denoted

hX j Z j YiD

if along every path between a node in X and a node in Y there is a node w satisfying one of the following two conditions: 1. w has converging arrows and none of w or its descendants are in Z ; 2. w does not have converging arrows and w is in Z .

De nition 2 (Independency maps) A DAG D = (S ; E~) is said to be an independency map (I-map) of a dependency model M if for every three disjoint sets of vertices X ; Y ; Z we have

hX j Z j YiD =) I (X ; Z ; Y )M: (9) D is said to be a minimal I-map of M if none of its arrows can be deleted without destroying its I-mapness. A minimal I-map of M is called a Bayesian network for the dependency model P. In practice a Bayesian network is constructed by allocating one node for each variable in S and by creating one edge X ! Y for each variable X that is believed to have a direct causal impact on Y . We shall use a Bayesian network to characterize the probabilistic dependencies among the variables involved in the probabilistic interpretation of the architecture:

Assumption 1 We suppose that the DAG G depicted in Figure 2 is a Bayesian network for the dependency model M associated to the variables uT1 ; q1T ; yT1 . 7

One of the most evident consequences of this independency model is that only the previous state and the current input are relevant to determine the next-state. This one-step memory property is analogue to the Markov assumption in hidden Markov models (HMM). In fact, the Bayesian network for HMMs can be obtained by simply removing the ut nodes (and arcs from them) from Figure 2. However, there are basic di erences between this architecture and standard HMMs, both in terms of computing style and learning. These di erences will be further discussed in 4.1.

3 A Supervised Learning Algorithm The learning algorithm for the proposed architecture is derived from the maximum likelihood principle. The training data are a set of P pairs of input/output sequences:

D def = f(uT (p); yT (p)); p = 1 : : :P g: 1

p

1

p

Let  denote the vector of parameters obtained by collecting all the parameters j and #i of the architecture. The likelihood function is then given by2 :

L(; D) def =

P Y p=1

P(yT1 (p) j uT1 (p); ): p

p

(10)

The output values (used here as targets) may also be speci ed intermittently. For example, in sequence classi cation tasks, one is only interested in the output yT at the end of each sequence. The modi cation of the likelihood to account for intermittent targets is straightforward. According to the maximum likelihood principle, the optimal parameters are obtained by maximizing (10). The optimization problem arising from this formulation of learning can be e ectively solved in the framework of parameter estimation with missing data where the missing variables are the state indicator variables z t (describing a path in state space). Let us rst brie y describe the EM algorithm.

3.1 The EM Algorithm

EM (estimation-maximization ) is an iterative approach to maximum likelihood estimation (MLE). Each iteration is composed of two steps: an estimation (E) step and a maximization (M) step. The aim is to maximize the log-likelihood function l(; D) = log L(; D) where  are the parameters of the model and D are the data. Suppose that this optimization problem would be simpli ed by the knowledge of additional variables Z , known as missing or hidden data. The set Dc = D[Z is referred to as complete data set (in the same context D is referred to as incomplete data set). Correspondingly, the log-likelihood function lc (; Dc) is referred to as the complete data likelihood. The function lc(; Dc) would be easily maximized if Z were known. However, since Z are not observable, lc is a random variable and cannot be 2 In the following, in order to simplify the notation, the sequence index p may be omitted. 8

maximized directly. Thus, the EM algorithm relies on integrating over the distribution of Z , with the auxiliary function

h

i

Q(; ^ ) = E lc(; Dc) j D; ^

(11)

which is the expected value of the complete data log-likelihood, given the observed data D and ^ computed at the end of the previous iteration. Intuitively, computing Q the parameters  corresponds to lling in the missing data using the knowledge of observed data and previous parameters. This function is deterministic and can be maximized. An EM algorithm then iterates the following two steps, for k = 1; 2; : : :, until a local maximum of the likelihood is found: Estimation: Compute Q(; (k)) = E [lc(; Dc) j D; (k)] (12) Maximization: Update the parameters as (k+1) = arg max Q(; (k)) In some cases, it is dicult to analytically maximize Q(; (k?1)), as required by the M step of the above algorithm, and we are only able to compute a new value (k) that produces an increase of Q. In this case we have a so called generalized EM (GEM) algorithm: Update the parameters as (k) = M ((k?1)) where M () is such that Q(M ((k?1)); (k?1))  Q((k?1); (k?1));

(13)

The following theorem guarantees the convergence of EM and GEM algorithms to a (local) maximum of the incomplete data likelihood:

Theorem 1

For each GEM algorithm

L(M (); D)  L(; D)

(14)

Q(M (); ) = Q(; ):

(15)

where the equality holds if and only if

Proof See (Dempster et al., 1977) 2

3.2 EM for Training the Recurrent Architecture

In order to apply EM to our case we begin by noting that the variables Z de ned in equation (6) are not observed. Knowledge of such variables (i.e., the model's state trajectories) would allow one to decompose the temporal learning problem into 2n static learning subproblems. Indeed, if z t were known (i.e., deterministic), the probabilities it would be either 0 or 1 and it would be possible to train each subnetwork separately, without taking into account any temporal dependency. This observation allows to link EM learning to the target propagation 9

approach discussed in the introduction. Note that if we used a Viterbi-like approximation (i.e., considering only the most likely path), we would indeed have 2n static learning problems at each epoch. Actually, we can can think of the E step of EM as an approximation of the oracle that provides target states for each time t, based on averaging over all values of Z , where the distribution of Z is conditioned on the values of the parameters at the previous epoch. The M step then ts the parameters for the next epoch to the estimated trajectories. In the sequel of this section we derive the learning equations for our architecture. Let us de ne the complete data as

Dc def = f(uT (p); yT (p); zT (p)); p = 1 : : :P g: 1

p

1

p

1

p

The corresponding complete data likelihood is

Lc (; Dc) =

P Y p=1

P(yT1 (p); zT1 (p) j uT1 (p); ): p

p

p

(16)

that can be decomposed as3 P(yT1 ; z T1 j uT1 ; ) = P(yT ; zT j y T1 ?1 ; zT1 ?1 ; uT1 ; )P(yT1 ?1 ; zT1 ?1 j uT1 ; ) =

P(yT ; zT j z T ?1; ut ); )P(yT1 ?1 ; zT1 ?1 j uT1 ?1; ):

(17)

The last equality follows from the conditional independency model that we have assumed. Iterating the decomposition we have:

Lc (; Dc) =

T P Y Y

p=1 t=1

P(y t; z t j z t?1; ut; )

that we can rewrite as P Y T Y

P(y t j z t ; ut ; )P(z t j z t?1 ut ; ) = p=1 t=1 P Y T Y N Y N Y P(yt j qt = i; ut ; )zit P(qt = i j qt?1 = j; ut; )zit zj;t?1 : p=1 t=1 i=1 j =1 Taking the logarithm we obtain the following expression for the complete data log-likelihood:

lc(; Dc) = log Lc (; Dc) = + 3

N X j =1

P X T X N X p=1 t=1 i=1

zit log P(yt j qt = i; ut ; ) (18)

zitzj;t? log P(qt = i j qt? = j; ut; ): 1

1

again we omit p.

10

Since lc (; Dc) depends on the hidden state indicator variables we cannot maximize directly (18). If zit were known, we would have solved the temporal credit assignment problem. To complete learning it would suce to extract from data the static mappings that produce the output and the next-state. In this situation EM helps to decouple static and temporal learning.

3.2.1 The Estimation Step

We can compute the expected value of lc(; Dc), conditional on data D and the \old" param^ , as follows: eters  n T X P X X ^ Q(; ) = ^it log P(yt j qt = i; ut ; ) p=1 t=1 i=1

+

(19)

n X ^ j =1

hij;t log P(qt = i j qt? = j; ut ; ) 1

where the weighting terms hij;t def = E [zitzj;t?1 j uT1 ; yT1 ; ] are the elements of the autocorrelation matrix for consecutive state pairs. The hat in ^it and ^hij;t means that these variables are ^. computed using the old parameters  In order to compute hij;t we introduce the following probabilities, borrowing the notation from HMM literature: it def = P(yt1 ; qt = i j ut1 ); (20) = P(yTt j qt = i; uTt ): (21) it def We have: P(y t1; qt = i j ut1 ) = X P(y1t?1 j yt ; qt?1 = `; qt = i; ut1)P(yt j qt?1 = `; qt = i; ut1)P(qt?1 = `; qt = i j ut1 ) (22) `X t ? 1 t ? 1 = P(y1 ; qt?1 = ` j u1 )P(yt j qt = i; ut)P(qt = i j qt?1 = `; ut) `

and thus

it = fY (yt; it)

Moreover P(yTt j qt = i; uTt ) =

X `

X `

'i`(ut) `;t? :

(23)

1

P(yTt+1 j yt ; qt+1 = `; qt = i; uTt )

T T  P( (24) Xyt j qtT = `; qt = i; utT)P(qt = ` j qt = i; ut ) P(yt j qt = `; ut )P(yt j qt = i; ut)P(qt = ` j qt = i; ut ) = +1

`

+1

+1

+1

+1

11

+1

+1

and thus

it = fY (yt; it)

X `

Then we can express hij;t as:

'`i(ut ) `;t : +1

(25)

+1

hij;t = P(qt = i; qt? = j j yT ; uT ) = (Bayes) T T qt = i j qt? = j; uT )P(qt? = j j uT ) = P(y j qt = i; qt? = j; u )P( P(yT j uT ) T = j; uT )'ij (ut )j;t? = P(y j qt = i; qtL? ( ; D) 1

1

1

1

1

1

1

1

1

1

1

1

1

1

(26)

1

1

where P(yT1 j qt = i; qt?1 = j; uT1 ) = P(yTt j y1t?1; qt = i; qt?1 = j; uT1 )P(y 1t?1 j qt = i; qt?1 = j; uT1 ) (27) t ? 1 t ? 1 = P(yTt j qt = i; uTt ) P(y1 ; qt?1 = j jt?u11 ) = it j;t?1 j;t?1 P(qt?1 = j j u1 ) and nally: X?1'ij(ut) ; hij;t = it j;t (28) i

having computed the likelihood as

L(; D) = P(yT j uT ) = 1

1

X i

iT

P(yT1 ; qT = i j uT1 ) =

X i

iT :

(29)

3.2.2 The Maximization Step

Each iteration of the EM algorithm requires to maximize Q(; (k)). As explained below, if the subnetworks are linear this can be done analytically. In general, however, if the subnetworks have hidden sigmoidal units, or use a softmax function to constrain their outputs to sum to one, the maximum of Q cannot be found analytically. In these cases we can resort to a GEM algorithm, that simply produces an increase in Q, for example by gradient ascent. Although Theorem 1 guarantees the convergence of GEM algorithms to a local maximum of the likelihood, their convergence may be signi cantly slower compared to EM. In several experiments we noticed that convergence can be accelerated with stochastic gradient ascent, i.e., updating the parameters after each sequence, rather than accumulating the gradients of the auxiliary function over the whole training set.

3.2.3 EM for Linear Subnetworks and Symbolic Data

We describe now a procedure that applies a true EM algorithm when the inputs are quantized and the subnetworks behave like lookup tables addressed by the input symbols. For simplicity, 12

we restrict the analysis to classi cation tasks and we suppose that targets are speci ed as desired nal states for each sequence. Thus, no output subnetworks need to be used in this particular application of the algorithm. The likelihood function then has the form

L(; D) def =

P Y

p=1

P(qT? (p) j uT1 (p); ) p

(30)

where qT? (p) is the nal target state for the p-th sequence. Symbolic inputs can be encoded using index vectors. In particular, if A = f1; : : :; mg is the input alphabet, the symbol k is encoded by the vector u having 1 in the k-th position and 0 elsewhere. Let us suppose that each state network has a single linear layer. Then, it is immediate to verify that the weights of subnetwork Nj have the meaning of probabilities of transition from state j , conditional on the input symbol, i.e. wijk = P(qt = i j qt?1 = j; t = k) (31) where t is the symbol encoded by the input vector ut. In order to preserve consistency with the probabilistic interpretation of the model, such weights must be nonnegative and X wijk = 1 8j = 1; : : :; n 8k = 1; : : :; m: (32) i2Sj

This constraint can be easily be incorporated in the maximization of the likelihood by introducing the new function ^)+ J (; ^ ; ) def = Q(; 

0 m X m X @ j =1 k=1

1?

X i2Sj

1 wijkA jk

(33)

where jk are Lagrange multipliers. Taking the derivatives of this function with respect to the weights we nd P X @J = X ^hij;t 1 ? jk (34) @w w ijk

ijk

p=1 t:t =k

where the second sum is extended to all the time steps for which the input symbol t takes on the value k. The above expression is zero if P X X ^hij;t

wijk = p

t t =k jk

=1 :

P

i2Sj wijk = 1 P X X

(35)

we obtain the reestimation formulae ^it^j;t?1 t: =k ^q ;T wijk = Xp=1X (36) P X ^ ^ it j;t?1 : i2S p=1 t: =k ^q ;T

and then, imposing the constraint

? T

t

j

t

13

? T

3.2.4 GEM for Nonlinear Subnetworks

We consider now the general case of nonlinear (for example, multilayered) subnetworks. Since direct optimization of Q is not possible, a GEM algorithm must be used. A very general way of producing an increase in Q is to use gradient ascent. The derivatives of Q with respect to the parameters can be easily computed as follows. Let jk be a generic weight in the state subnetwork N j . From equation (19) we have T X P X @Q(; ^ ) = X ^hij;t 1 @'ij;t (37) @ ' @ jk

p=1 t=1 i2Sj

ij;t

jk

where the partial derivatives @' @ can be computed using back-propagation. Similarly, denoting with #ik a generic weight of the output subnetwork Oi , we have: r T X P X @Q(; ^ ) = X ^i;t @ log fY (y y ; it) @i`;t (38)  @# @ @# ij;t jk

ik

i`;t

p=1 t=1 `=1

ik

where @@# are also computed using back-propagation. Intuitively, the parameters are updated as if the estimation step of EM had provided soft targets for the outputs of the 2n subnetworks, for each time t. i`;t ik

4 Comparisons

4.1 Standard Hidden Markov Models

It appears natural to nd similarities between the recurrent architecture described so far and standard HMMs (Baum et al., 1970; Levinson et al., 1983). A standard HMM is a parametric model of a stochastic process generated by an underlying nite state Markov chain. An output distribution is associated to each state or to each state transition. In standard HMMs there are no inputs. The dynamic behavior (i.e., state transitions) is controlled by a stochastic matrix A 2 Rn;n that is constant over time, with aij = P(qt = i; qt?1 = j ). At each time step, the model emits an observation yt with a probability P(ytjqt = i) (state emitting models) or P(yt jqt = i; qt?1 = j ) (transition emitting models). The most typical applications of standard HMMs are in automatic speech recognition problems (Levinson et al., 1983; Rabiner, 1989). In these cases each lexical unit is associated to one model Mi . During recognition each model computes the probability P(y1T j Mi ) of having generated the observed acoustic sequence y1T . During training the parameters are adjusted to maximize the probability that a correct model Mi generates the acoustic observations associated to instances of the i-th lexical unit. Training data are thus used in an unsupervised way on a model that is constrained by knowledge of the correct classi cation. The likelihood is maximized using the Baum-Welsh algorithm, which is actually an EM algorithm. Dynamic programming techniques may be used to decode the most likely sequence of states. This most likely state sequence can be also used during training (Viterbi algorithm) to approximate the estimation step of EM. 14

The architecture proposed in this paper di ers from standard HMMs in two respects: computing style and learning. In our case sequences are processed exactly as in recurrent networks, i.e., an input sequence is synchronously transformed into an output sequence. This computing style is real-time and predictions of the outputs are available as soon as the input sequence is being processed. This architecture thus allows one to implement all three fundamental sequence processing tasks: production, prediction, and classi cation. Finally, transition probabilities in standard HMMs are stationary, i.e. states form a homogeneous Markov chain. In our case transition probabilities are conditional on the input and thus depend on time, resulting in an inhomogeneous Markov chain. The other fundamental di erence is in the learning procedure. While interesting for their capabilities of modeling sequential phenomena, a major weakness of standard HMMs is their poor discrimination power due to unsupervised learning. An approach that has been found useful to improve discrimination in HMMs is based on maximum mutual information (MMI) training (Brown, 1987). When using MMI, the parameters for a given model are adjusted taking into account the predictions yielded by all the models and not only the prediction yielded by the model for the correct class (as with the MLE criterion). It has been pointed out that supervised learning and discriminant learning criteria like MMI are actually strictly related (Bridle, 1989b). Although the parameter adjusting procedure we have de ned is based on MLE, y T1 is used as desired output in response to the input uT1 , resulting in discriminant supervised learning. Finally, it is worth mentioning that a number of hybrid approaches have been proposed to integrate connectionist approaches into the HMM framework. For example in (Bengio, De Mori, Flammia, & Kompe, 1992) the observations used by the HMM are generated by a recurrent neural network. In (Bourlard & Wellekens, 1990; Bourlard & Morgan, 1993) a feedforward network is used to estimate state probabilities, conditional to the acoustic sequence. A common feature of these algorithms and the one proposed in this paper is that neural networks are used to extract temporally local information whereas a Markovian system integrates long-term constraints.

4.2 Second Order Recurrent Networks

There are some analogies between our architecture and second order recurrent networks that encode discrete states (Giles & Omlin, 1992). A second order network with n state units and m inputs evolves according to the equation

0n m 1 X X xit = f @ wijkxj;t? uk;tA : j =1 k=1

1

(39)

Suppose that this network encodes, using the continuous vector xt, a set of n discrete states, using the coding function xit = 1 if qt = i and xit = 0 if qt 6= i. For example such encoding criterion is used in (Giles & Omlin, 1992). If the weights and the inputs are such that the state xt?1 is actually a valid code for the discrete state at time t ? 1, then we have xt = f (Wj ut ) (40) 15

where Wj is an n by m matrix fwijk; i = 1 : : :n; k = 1 : : :mg and j = qt?1 is the state encoded in xt?1. Under these encoding assumptions, the previous state selects the weight matrix Wj to be used to predict the next state, given the input. Thus the second order network behaves like a modular architecture, similar to the one we have described, in which distinct subnetworks are activated at time t depending on the discrete state at time t ? 1. Moreover, assuming that our architecture uses single layered state subnetworks, the analogy also a ects the number of P parameters and their meaning: the larger is k wijkukt, the stronger is the prediction of being in state i at time t given state j at time t ? 1 and the input ut . This analysis, however, is only valid in the limit condition of saturated state units. If this condition is not met then the two models tend to exhibit a di erent behavior: the state representation in the second order net tends to become more distributed, while the IOHMM recurrent architecture preserves modularity.

4.3 Adaptive Mixtures of Experts

Adaptive mixtures of experts (ME) (Jacobs et al., 1991) and hierarchical mixtures of experts (HME) (Jordan & Jacobs, 1992) have been introduced as a divide and conquer approach to supervised learning in static connectionist models. A (hierarchical) mixture of experts is composed by a modular set of subnetworks (experts) that compete to gain responsibility in modeling outputs in a given input space region.PThe system output y is obtained as a linear combination of the experts' outputs yj : y = j gj y j where the weights gj are computed as parametric function of the inputs by a separate subnetwork (gating network) that assigns responsibility to di erent experts for di erent regions of the input space. t

 Gating Network

O

1

O

2

ut

Figure 3: The mixture of controllers (MC) architecture (Cacciatore & Nowlan, 1994). Recently, Cacciatore & Nowlan (1994) have proposed a recurrent extension to the ME architecture, called mixture of controllers (MC), in which the gating network has feedback connections, thus allowing to take temporal context into account. The MC architecture is shown in Figure 3. Our IOHMM architecture (i.e., Figure 1) can clearly be interpreted as a special case of the MC architecture, in which the gating network has a modular structure and 16

second order connections. In practice, Cacciatore & Nowlan used a one layer rst-order gating net, resulting in a model with weaker capacity. However, the more signi cant di erence lies in the organization of processing. In the MC architecture modularity is exploited at the output prediction level, whereas in our architecture modularity is exploited at the state prediction level as well. Another potentially important di erence lies in the presence of a saturating nonlinearity in the recurrence loop of the MC architecture. On the other hand, the recurrence loop of IOHMMs is purely linear. The non-linearity in the loop makes very dicult the learning of long-term context (Bengio et al., 1994) (see next section for a discussion of learning long-term dependencies in IOHMMs). Learning in the MC architecture uses approximated gradient ascent to optimize the likelihood, in contrast to EM supervised learning proposed by Jordan & Jacobs (1994) for the HME. The approximation of gradient is based on one step truncated back-propagation through time (somehow similar to Elman's approach (1990)) and allows online updating for continually running sequences, which is useful for a control problem. As earlier shown in this paper, the interpretation of the state sequences of our architecture as missing data allows to optimize the likelihood by using a GEM algorithm. t

t

 ;t

Gating Network

1

O

1

O

2;t



N

2

1

N

2

ut

Figure 4: One more conceivable recurrent version of the ME architecture by Jacobs, Jordan, Nowlan & Hinton (1991). It is possible to conceive yet another recurrent version of mixture of experts, as shown in Figure 4. In this case the state prediction at time t is computed by a gating network that processes the current input ut and a state representation t . Two sets of expert subnetworks are allocated to predict the next state and the output. Such experts might in turn have a modular structure, yielding a hierarchical recurrent mixture. In the architecture we have proposed the gating network is replaced by an identity function, i.e. t =  t . A potential advantage of the architecture shown in Figure 4 is that it may exploit dimensionality reduction in the state representation, so that n discrete states may be encoded in a vector  t 2 Rs with s < n. The gating network would become a kind of state decoder. This organization is more parsimonious in terms of number of parameters, without reducing the number of underlying 17

discrete states.

5 Learning Long-term Dependencies with Markovian Models: Theoretical Analysis In the following two sections we address more closely the issue of temporal credit assignment in Markovian Models such as IOHMMs and HMMs. Generally speaking, data accessible to a supervised learning problem presents long-term dependencies if the desired outputs at a given time t is signi cantly a ected by the past inputs at a time   t. Accurate modeling of these data is typically dicult. For example in the case of IOHMMs, the learning procedure should be able to translate supervision information into a sequence of posterior distributions for discrete states, that should re ect as closely as possible the next-state mapping of the dynamical system we are modeling. In previous work (Bengio et al., 1994) we found theoretical reasons for the diculty in training parametric non-linear dynamical systems to capture long-term dependencies. For such systems, the dynamical evolution is controlled by an iterated map at = M (at?1; ut ), with at a continuous state vector. Systems described by such equation include recurrent neural network architectures. The main result stated that either long-term storing or gradient propagation would be harmed, depending on whether kM 0 k, the norm of the Jacobian of the state transition function, was less or greater than one. If kM 0k < 1 then the system is endowed with robust dynamical attractors that can be used to reliably store pieces of information for arbitrary duration. However, gradients will vanish exponentially as they are propagated backward in time. If kM 0k > 1 then gradients do not vanish, but information about past inputs is gradually lost and can be easily deleted by noisy events. In the present paper we introduced a connectionist model whose dynamical behavior is controlled by the equation  t = (ut ) t?1, where  t can be interpreted as a probability distribution de ned over a set of discrete states. Systems described by such an equation include HMMs, although the matrix  is constant in standard HMMs. In this case, the norm of the Jacobian of the state transition function is constrained to be exactly one because (ut ) is a stochastic matrix. The analysis that follows indicates that like in recurrent networks, learning in non-deterministic Markovian models generally becomes increasingly dicult as the span of the temporal dependencies increases. However, a very important qualitative di erence is that long-term storing and temporal credit assignment are not necessarily incompatible in Markovian models: they either both occur or are both impractical. The sad news is that they both occur in the very special case of an essentially deterministic model. If most of the eigenvalues of (ut) are signi cantly less than one, then the probability distribution  t tends to become independent of the initial condition and information cannot be retained in the long-term. At the same time, credit assignment is harmed by a problem of di usion during the backward phase. On the contrary, if most of the eigenvalues of (ut ) are close to one (i.e., the models tends toward a deterministic behavior), then both storing and credit assignment are more e ective. In the remainder of this section we propose a theoretical framework for studying long-term 18

learning in Markovian models. The next section is devoted to experimental results. Before introducing the formal analysis, let us provide some intuitions about the practical consequences of long-term dependencies in training Markovian models. To simplify our description, let us consider a sequence classi cation problem with a discrete target variable y accessible at the last step of each training sequence. If the sequences are long and contain relevant class information at their beginning, then clearly the task exhibit long-term dependencies. Key variables in the temporal credit assignment problem are the probabilities jt of state j at time t, given the observed target y . These variables are computed during the estimation step of EM. They actually correspond to the gradient of the likelihood with respect to the state probabilities jt, i.e., they indicate how the probabilities associated to each state at each time step should increase in order to increase the total likelihood. In some early experiments we have noticed that jt tend to become independent of j for t very far away from supervision. Clearly, this re ects a situation of maximum uncertainty. If all jt are the same for a given t, then all the states at this time step are equally responsible for the nal likelihood and no change of parameters will take place to change state probabilities. This represents a serious diculty for propagating backwards in time e ective temporal credit information, and makes very dicult learning in the presence of long-term dependencies.

5.1 Mathematical Preliminaries 5.1.1 De nitions

A matrix A is said to be non-negative, written A  0, if Aij  0 8i; j . Positive matrices are de ned similarly. A non-negative square matrix A 2 Rnn is called row stochastic (or simply P n stochastic in the following) if j =1 Aij = 1 8i = 1 : : :n: A non-negative matrix is said to be row [column] allowable if every row [column] sum is positive. An allowable matrix is both row and column allowable. Stochastic matrices intervene in the forward and backward dynamics of standard HMMs and the IOHMM architecture we have introduced in this paper. The generic element Aij correspond to the probability that qt = j given qt?1 = i (also conditional on the external input for IOHMMs). A non-negative matrix can be associated to the directed transition graph G that constrains the Markov chain. Let us de ne the incidence matrix corresponding to a given non-negative matrix A, as the 0{1 matrix A~ that replaces all the positive entries of A by ones. The incidence matrix of A thus corresponds to the directed graph G = fQ; E~g: A~ji = 1 i eij 2 E~. Some interesting properties of nonnegative matrices can be equivalently expressed algebraically or in terms of the topology of G .

De nition 3 (Irreducible Matrix) A non-negative n  n matrix A is said to be irreducible if for every pair i; j of indices, there exists a positive integer m = m(i; j ) s.t. (Am )ij > 0.

A matrix A is irreducible if and only if the associated graph is strongly connected (i.e., there exists a path between any pair of states i; j ). If 9k s.t. (Ak )ii > 0 , d(i) is called the period of index i if it is the greatest common divisor (g.c.d.) of those k for which (Ak )ii > 0. In an irreducible matrix all the indices have the same period d, which is called the period of the 19

matrix. The period of a matrix is the g.c.d. of the lengths of all cycles in the associated transition graph. De nition 4 (Primitive matrix) A non-negative matrix A is said to be primitive (or aperiodic) if there exists a positive integer k s.t. Ak > 0. An irreducible matrix is either aperiodic (primitive, i.e., of period 1), or periodic (imprimitive). A primitive stochastic matrix is necessarily allowable.

5.1.2 The Perron-Frobenius Theorem Theorem 2 (See (Seneta, 1981), Theorem 1.1.) Suppose A is an n  n non-negative primitive matrix. Then there exists an eigenvalue r such that: 1. r real and positive; 2. with r can be associated strictly positive left and right eigenvectors; 3. r > jj for any eigenvalue  6= r; 4. the eigenvectors associated with r are unique to constant multiples. 5. If 0  B  A and is an eigenvalue of B , then j j  r. Moreover, j j = r implies B = A. 6. r is simple root of the characteristic equation of A.

A simple consequence of the theorem for stochastic matrices is the following: Corollary 1 Suppose A is a primitive stochastic matrix. Thendefits largest eigenvalue is 1 and there is only one corresponding left eigenvector, which is 1 = [1; 1   1]0. Furthermore, all other eigenvalues < 1. Proof. 10A = 10 by de nition of stochastic matrices. This eigenvector is unique and all other eigenvalues < 1 by the Perron-Frobenius Theorem. If A is stochastic but imprimitive with period d, then A has d eigenvalues of module 1 which are the d complex roots of 1.

5.2 Training Stationary HMMs Theorem 3 (See (Seneta, 1981), Theorem 4.2.) If A is a primitive stochastic matrix, then as t ! 1, At ! 1v 0 where v 0 is the unique stationary distribution of the Markov chain.

The rate of approach is geometric.

Thus if A is primitive, then lim t!1 At converges to a matrix whose eigenvalues are all 0 except for  = 1 (with eigenvector 1), i.e., the rank of this product converges to 1, i.e., its rows are equal. A consequence of theorem 3 is that it is very dicult to train ordinary 20

hidden Markov models, with a primitive transition matrix, to model long-term dependencies in observed sequences. The reason is that the distribution over the states at time t > t0 becomes gradually independent of the distribution over the states at time t0 as t increases. It means that states at time t0 become equally responsible for increasing the likelihood of an observation at time t. This corresponds in the backward phase of the EM algorithm for training HMMs to a di usion of credit over all the states. In practice we train HMMs with nite sequences. However, training will become more and more numerically ill-conditioned as one considers longer term dependencies. Consider two events e0 (occurring at t0 ) and et (occurring at t), and suppose there are also \interesting" events occurring in between. Let us consider the overall in uence of states at times  < t upon the likelihood of the observations at time t. Because of the phenomenon of di usion of credit, and because gradients are added together, the in uence of intervening events (especially those occurring shortly before t) will be much stronger than the in uence of e0. Furthermore, this problem gets geometrically worse as t increases. Clearly a positive matrix is primitive. Thus in order to learn long-term dependencies, we would like to have zeros in the matrix of transition probabilities. Unfortunately, this generally supposes prior knowledge of an appropriate connectivity graph.

5.3 Coecients of ergodicity

To study products of non-negative matrices and the loss of information about initial state in Markov chains (particularly in the non-stationary case), we introduce the projective distance between vectors x and y : d(x0; y 0) def = max ln( xxiyyj ): i;j Clearly, some contraction takes place

j i

when d(x0A; y 0A)  d(x0; y 0).

De nition 5 Birkho 's contraction coecient B (A), for a non-negative column-allowable matrix A, is de ned in terms of the projective distance: 0 0 B (A) def = sup d(dx(xA;0; yy 0)A) : x;y>0;x6=y

Dobrushin's coecient 1(A), for a stochastic matrix A, is de ned as follows: X 1(A) def = 21 sup jaik ? ajk j: i;j k

Both are proper ergodicity coecients, satisfying 0   (A)  1 and they are zero if and only if A has identical rows. Furthermore, it can be shown (Seneta, 1981) that  (A1A2)   (A1 ) (A2).

5.4 Products of Stochastic Matrices

Let A(1;t) def = A1A2    At?1At denote a forward product of stochastic matrices A1 ; A2;    At . From the properties of B and 1 , if  (At) < 1; t > 0 then lim t!1  (A(1;t)) = 0, i.e., A(1;t) 21

has rank 1 and identical rows. Weak ergodicity is then de ned in terms of a proper ergodic coecient  such as B and 1: De nition 6 (Weak Ergodicity) The products of stochastic matrices A(p;r) are weakly ergodic if and only if for all t0  0 as t ! 1,  (A(t ;t)) ! 0. 0

Theorem 4 (See (Seneta, 1981), Lemma 3.3 and 3.4.) Let A non-negative and allowable matrices, then the products A the following conditions both hold:

;t

(1 )

;t

a forward product of are weakly ergodic if and only if (1 )

1. 9t0 s.t. A(t ;t) > 0 8t  t0 0

2.

(t0 ;t) Ai;k (t0 ;t) Aj;k

! Wij (t) > 0 as t ! 1, i.e., rows of A t ;t tend to proportionality. ( 0 )

In the case of stochastic matrices, row-proportionality is equivalent to row-equality since rows sum to 1. Note that limt!1 A(t ;t) does not need to exist in order to have weak ergodicity. If such a limit exists and it is a matrix with all rows equal, then the product is said to be strongly ergodic. 0

5.5 Canonical Decomposition

Any non-negative matrix A can be rewritten by relabeling its indices in the following canonical decomposition (Seneta, 1981), with diagonal blocks Bi , Ci and Q:

0B BB 0 BB .. BB . A=B BB 0. BB .. @ 0

1

0

B

.. .

2

 .. . 0

0  0 1 0  0 C CC .. CC . Cs+1 0    0 C CC .. CC . C       Cr 0 A

 

(41)

L L       Lr Q where Bi and Ci are irreducible, Bi are primitive and Ci are periodic. Let us de ne the corresponding sets of states as SB , SC , SQ . Note that Q might be reducible, but the groups of states in SQ leak into the B or C blocks, i.e., SQ represents the transient part of the state space. 1

i

2

i

This decomposition is illustrated with an example in Figure 5(a). In the case of stationary Markov models, and in the case of non-stationary Markov models in which all the At share the same incidence matrix A~t , because P (qt 2 SQ jqt?1 2 SQ ) < 1, limt!1 P (qt 2 SQ jq0 2 SQ ) = 0. Other observations can be made for these two general types of matrices (i.e., xed topology). Because the Bi are primitive, we can apply Theorem 2, and starting from a state in SB , all information about an initial state at t0 is gradually lost. i

22

L1

L2

2

2

L3

0

Primitive sets (SB1 ; SB2 )

L4

1

0

3

1 3

5

Periodic sets (SC3 ; SC4 )

G1

4

G2

Transient set (SQ )

(a)

(b)

Figure 5: (a): Transition graph corresponding to the canonical decomposition. (b): Periodic graph G1 becomes aperiodic G2 when adding loop with states 4,5.

5.6 Periodic Graphs

A more dicult case is the one of (A(t ;t) )jk with initial state j 2 SC (a periodic block). Let di be the period of the ith periodic block Ci. It can be shown (Seneta, 1981) that taking d products of periodic matrices with the same incidence matrix and period d yields a blockdiagonal matrix whose d blocks are primitive (aperiodic). Thus C (t ;t) retains information about the initial block in which qt was. However, for every such block of size > 1, information will be gradually lost about the exact identity of the state within that block. This is best demonstrated through a simple example. Consider the incidence matrix represented by the graph G1 of Figure 5(b). It has period 3 and the only non-deterministic transition is from state 1, which can yield into either one of two loops. When many stochastic matrices with this graph are multiplied together, information about the loop in which the initial state was is gradually lost (i.e. if the initial state was 2 or 3, this information is gradually lost). What is retained is the phase information, i.e., in which block (f0g, f1g, or f2,3g) of a cyclic chain was the initial state. This suggests that it will be easy to learn about the type of observations associated to each block of a cyclic chain, but it will be hard to learn anything else. It is also interesting to remark the following concerning the previous graph. Suppose the sequences to be modeled are slightly more complicated, requiring an extra loop of period 4 instead of 3, e.g. with the graph G2 of Figure 5(b). In that case the matrix turns out to be primitive, i.e., all information about the initial state will be gradually lost. 0

i

0

5.7 Learning Long-Term Dependencies: a Discrete Problem?

We might wonder if, starting from a positive stochastic matrix, the learning algorithm could learn the topology, i.e., replace some transition probabilities by zeroes. Let us consider the 23

update rule for transition probabilities in the EM algorithm:

Aij

A

@L

P ijA@A @L : j ij @A

(42)

ij

ij

Starting from Aij > 0 we can obtain a new Aij = 0 only if @A@L = 0. But this is not possible, thus the EM training algorithm will not exactly obtain zero probabilities. Transition probabilities might however approach 0. Again, this suggests that prior knowledge, rather than learning, is responsible for the important elements of the topology, and for establishing the long-term relations between elements of the observed sequences. It is also interesting to ask in which conditions we are guaranteed that there will not be any di usion (of in uence in the forward phase, and credit in the backward phase of training). It requires that some of the eigenvalues other than 1 = 1 have a norm that is also 1. This can be achieved with periodic matrices C (of period d), which have d eigenvalues that are the d roots of 1 on the complex unit circle. To avoid any loss of information also requires that C d = I be the identity, since any diagonal block of C d with size more than 1 will yield to a loss of information (because of di usion in primitive matrices). This can be generalized to reducible matrices whose canonical form is composed of periodic blocks Ci with Cid = I . The condition we are describing actually corresponds to a matrix with only 1's and 0's. If A~t is xed, it would mean that the Markov chain is also stationary. It appears that many interesting computations can not be achieved with such constraints (i.e., only allowing one or more cycles of the same period and a purely deterministic and stationary Markov chain). Furthermore, if the parameters of the system are the transition probabilities themselves (as in ordinary HMMs), such solutions correspond to a subset of the corners of the 0{1 hypercube in parameter space. Away from those solutions, learning is mostly in uenced by short term dependencies, because of di usion of credit. Furthermore, as seen in equation 42, algorithms like EM will tend to stay near a corner once it is approached. This suggests that discrete optimization algorithms, rather continuous local algorithms, may be more appropriate to explore the (legal) corners of this hypercube. ij

6 Learning Long-term Dependencies with Markovian Models: Experimental Results 6.1 Di usion: Numerical Simulations

Firstly, we wanted to measure how (and if) di erent kinds of products of stochastic matrices converged, for example to a matrix of equal rows. We ran 4 simulations, each with an 8 states non-stationary Markov chain but with di erent constraints on the transition graph: 1) G fully connected; 2) G is a left-to-right model (i.e., A~ is upper triangular); 3) G is left-to-right but only one-state skips are allowed (i.e., A~ is upper bidiagonal); 4) At are purely periodic with period 4. The elements of the matrices At were generated randomly and the result averaged over 10 trials. Results shown in Figure 6(a) con rm the convergence towards zero of the 24

5

Periodic

0 -5 Left-to-right

-10 logτ

-15 -20 -25

Full connected

t=1

t=2

t=3

t=4

Left-to-right (triangular)

-30 -35 -40

0

5

10

15 T

20

25

30

(a)

(b)

Figure 6: (a) Convergence of Dobrushin's coecient. (b) Evolution of products A(1;t) for fully connected graph. Matrix elements are visualized with gray levels. ergodicity coecient4 , at a rate that depends on the graph topology. In Figure 6(b), we represent visually the convergence of fully connected matrices, in only 4 time steps, towards equal columns.

6.2 Training Experiments

To evaluate how di usion impairs training, a set of controlled experiments were performed, in which the training sequences were generated by a simple stationary HMM with long-term dependencies, depicted in Figure 7(a). Two branches generate similar sequences except for the rst and last symbol. The extent of the long-term context is controlled by the self transition probabilities of states 2 and 5,  def = P (qt = 2jqt?1 = 2) = P (qt = 5jqt?1 = 5). Span or \halflife" is log(:5)= log(), i.e., span = :5). Following (Bengio et al., 1994), data was generated with varying span of long-term dependencies. For each series of experiments, varying the span, 20 di erent training trials were run per span value, with 100 training sequences5 . Training was stopped either after a maximum number of epochs (200), of after the likelihood did not improve signi cantly, i.e., (L(t) ? L(t ? 1))=jL(t)j < 10?5, where L(t) is the logarithm of the likelihood of the training set at epoch t. If the HMM is fully connected (except for the nal absorbing state) and has just the right number of states, trials almost never converge to a good solution (1 in 160 did). Increasing the number of states and randomly putting zeroes helps. The randomly connected HMMs had 3 times more states than the generating HMM and random connections were created with 20% probability. Figure 7(b) shows the average number of converged trials for these di erent except for the experiments with periodic matrices, as expected it appeared sucient since the likelihood of the generating HMM did not improve much when trained on this data 4

5

25

100

Correct topology given Randomly connected, 24 states

1 0

2

3

1

2

3

4

2

5

4

5

6

0

0

7

% Converged

80 Fully connected, 40 states

60

40

Fully connected, 16 states

20

0 0.1

1

(a)

Fully connected, 24 states

10 Span

100

1000

(b)

Figure 7: (a): Generating HMM. Numbers out of state circles denote output symbols. (b): Percentage of convergence to a good solution (over 20 trials) for various series of experiments as the span of dependencies is increased. types of HMM topology. A trial is considered successful when it yields a likelihood almost as good or better than the likelihood of the generating HMM on the same data. In all cases the number of successful trials rapidly drops to zero beyond some value of span. An interesting observation is that in many cases, the training curve goes through one or more very at plateaus. Such plateaus could be explained by the di usion problem: the relative gradient with respect to some parameters is very small (thus the algorithm appears to be in a xed point). These plateaus can become a very serious problem when their slope approaches numerical precision or their length becomes unacceptable.

6.3 Reducing Credit Di usion by Penalized Likelihood

The theory outlined in section 5 suggests that the undesired di usion of temporal credit depends on the fast convergence of the rank of A(1;t) to 1 as t increases. The rate of rank lossage can be reduced by controlling the modules of the eigenvalues of the transition matrices At. As explained in 5.7, the ideal condition for credit assignment is a 0{1 transition matrix, whose eigenvalues are on the unitary complex circle. Since the determinant of a matrix equals the product of its eigenvalues, a simple way to reduce the di usion e ect is to add a penalty term to the log likelihood, as follows:

l(; D) +

P X T X p=1 t=1

jdettj

(43)

where the constant weights the in uence of the penalty term. In this case, the maximization step of EM will require gradient ascent (i.e., we need to use a GEM algorithm). The 26

1

 Class 0

Class 1

(a)

(b)

Figure 8: (a): Transition graph used in the two-sequences problem. (b): the corresponding recurrent architecture. contribution of the penalty term to the gradient can be computed using the relationship @ jdettj = jdet j ?1  : (44) t

@'ij;t

6.4 Comparisons with Recurrent Networks

t

ji

We present here results on two other problems for which one can control the span of input/output dependencies: The 2-sequence problem and the Parity problem. These two simple benchmarks were used in (Bengio et al., 1994) to compare the long-term learning capabilities of recurrent networks trained by back-propagation and ve other alternative algorithms. The 2-sequence problem is the following: classify an input sequence, at the end of the sequence, in one of two types, when only the rst N elements (N = 3 in our experiments) of this sequence carry information about the sequence class. Uniform noise is added to the sequence. For the rst 6 methods (see Tables 1{4) we used a fully connected recurrent network with 5 units (with 25 free parameters). For the EM algorithm, we used a 7-state system with a sparse connectivity matrix (an initial state, and two separate left-to-right submodels of three states each to model the two types of sequences, as shown in Figure 8a). No output subnetworks are required in this case and supervision may be expressed in terms of desired nal state, as done in 3.2.3. The resulting architecture is shown in Figure 8b. Other experiments with a full transition matrix yield much worse results, although improvement were obtained by introducing the penalty term (43). The parity problem consists in producing the parity of an input sequence of 1's and -1's (i.e., a 1 should be produced at the nal output if and only if the number of 1's in the input 27

Table 1: 2-sequences problem: Final classi cation error with respect to the maximum sequence length. 5 10 back-prop 58 56 p-n 2 3 time-weighted p-n 0 0 multigrid 2 6 discrete err. prop. 6 16 simulated anneal. 6 0 EM 0 0

20 43 10 9 1 29 7 0

50 100 53 50 25 29 34 14 3 6 23 22 4 11 0 0

Table 2: 2-sequences problem: # sequence presentations with respect to the maximum sequence length. back-prop p-n time-weighted p-n multigrid discrete err. prop. simulated anneal. EM

5 2.9e3 5.1e2 5.4e2 4.1e3 6.6e2 2.0e5 3.2e3

10 3.0e3 1.1e3 4.3e2 5.8e3 1.3e3 3.9e4 4.0e3

20 2.9e3 1.9e3 2.4e3 2.5e3 2.1e3 8.2e4 2.9e3

50 3.0e3 2.6e3 2.9e3 3.9e3 2.1e3 7.7e4 3.2e3

100 2.8e3 2.5e3 2.7e3 6.4e3 2.1e3 4.3e4 2.9e3

Table 3: Parity problem: Final classi cation error with respect to the maximum sequence length. 3 5 10 20 50 100 500 back-prop 2 20 41 38 43 p-n 3 25 41 44 40 47 time-weighted p-n 26 39 43 44 multigrid 15 44 45 discrete err. prop. 0 0 0 5 simulated anneal. 3 10 0 EM 0 6 0 14 0 12

Table 4: Parity problem: # sequence presentations with respect to the maximum sequence length. back-prop p-n time-weighted p-n multigrid discrete err. prop. simulated anneal. EM

3 5 9 20 50 100 500 3.6e3 5.5e3 8.7e3 1.6e4 1.1e4 2.5e2 8.9e3 8.9e3 7.7e4 1.1e4 1.1e5 4.5e4 7.0e4 3.4e4 8.1e4 4.2e3 1.5e4 3.1e4 5.0e3 7.9e3 1.5e4 5.4e4 5.1e5 1.2e6 8.1e5 2.3e3 1.5e3 1.3e3 3.2e3 2.6e3 3.4e3

28

is odd). The target is only given at the end of the sequence. For the rst 6 methods we used a minimal size network (1 input, 1 hidden, 1 output, 7 free parameters). For the EM algorithm, we used a 2-state system with a full connectivity matrix. In this task we found that performance could be drastically improved by using a stochastic gradient ascent in a way that helps the training algorithm get out of local optima. The learning rate is decreased when the likelihood improves but it is increased when the likelihood remains at (the system is stuck in a plateau or local optimum). Initial parameters were chosen randomly for each trial. Noise added to the input sequence was also uniformly distributed and chosen independently for each training sequence. We considered two criteria: (1) the average classi cation error at the end of training, i.e., after a stopping criterion has been met (when either some allowed number of sequence presentations has been performed or the task has been learned), (2) the average number of function evaluations needed to reach the stopping criterion. In the tables, \p-n" stands for pseudo-Newton (Becker & Cun, 1989). Time-weighted pseudo-newton is a variation in which derivatives with respect to the instantiation of a parameter at a particular time step are weighted by the inverse of the corresponding second derivatives (Bengio et al., 1994). Multigrid is similar to simulated annealing with constant temperature 0. The discrete error propagation algorithm (Bengio et al., 1994) attempts to propagate backwards discrete error information in a recurrent network with discrete units. Each column of the tables corresponds to a value of the maximum sequence length T for a given set of trials. The sequence length for a particular training sequence was picked randomly within T=2 and T . Numbers reported are averages over 20 or more trials. The results in Tables 1{4 clearly show that EM recurrent learning can achieve better performance than those obtained with the other algorithms.

7 Regular Grammar Inference In this section we describe an application of our architecture to the problem of grammatical inference. In this task the learner is presented a set of labeled strings and is requested to infer a set of rules that de ne a formal language. It can be considered as a prototype for more complex language processing problems. However, even in the \simplest" case, i.e., regular grammars, the task can be proved to be NP-complete (Angluin & Smith, 1983). Many researchers (Giles, Miller, Chen, Sun, Chen, & Lee, 1992; Pollack, 1991; Watrous & Kuhn, 1992) have approached grammatical inference with recurrent networks. These studies demonstrate that second-order networks can be trained to approximate the behavior of nite state automata (FSA). However, memories learned in this way appear to lack robustness and noisy dynamics become dominant for long input strings. This has motivated research to extract automata rules from the trained network (Giles et al., 1992; Watrous & Kuhn, 1992). In many cases, it has been shown that the extracted automaton outperforms the trained network. Although FSA extraction procedures are relatively easy to devise for symbolic inputs, they may be more dicult to apply in tasks involving a subsymbolic or continuous input space, such as in speech recognition. Moreover, the complexity of the discrete state space produced by the FSA extraction procedure may 29

Table 5: De nitions of the seven Tomita grammars Grammar 1 2 3 4 5 6 7

De nition 1? (10)? string does not contain 12n+102m+1 as a substring string does not contain 000 as a substring string contains an even number of 01's and 10's number of 0's - number of 1's is a multiple of 3 0? 1? 0? 1?

grow intolerably if the continuous network has learned a representation involving chaotic attractors. Other researchers have attempted to encourage a nite-state representation via regularization (Gori, Maggini, & Soda, 1993) or by integrating clustering techniques in the training procedure (Das & Mozer, 1994). We report experimental results on a set of regular grammars introduced by Tomita (1982) and afterwards used by other researchers to measure the accuracy of inference methods based on recurrent networks (Das & Mozer, 1994; Giles et al., 1992; Gori et al., 1993; Pollack, 1991; Watrous & Kuhn, 1992). The grammars use the binary alphabet f0; 1g and are reported in Table 5. For each grammar, Tomita also de ned a small set of labeled strings to be used as training data. One of the diculties of the task is to infer the proper rules (i.e., to attain perfect generalization) using these impoverished data. Since the task is to classify each sequence in two classes (namely, accepted or rejected strings), we used a scalar output and we put supervision at the last time step T . The nal output yT was modeled as a Bernoulli variable, i.e. fY (yT ; T ) = Ty (1 ? T )1?y , with yT = 0 if the string is rejected and yT = 1 if it is accepted. During test we adopted the criterion of accepting the string if T > 0:5. It is worth mentioning that nal states cannot be used directly as targets |as done in 6.4 | since there can be more than one accepting or rejecting state. In (Giles et al., 1992) this problem is circumvented by appending a special \end" symbol to each string. However, in our case this would increase the number of parameters. In this application we did not apply external inputs to the output networks. This corresponds to modeling a Moore nite state machine. Thus each output network reduces to one unit fed by a bias input: each output network computes a constant function of the last state reached by the model. The system output is a combination of these, weighted by  t, the state distribution at time t. Given the absence of prior knowledge about plausible state paths, we used an ergodic transition graph in which each state transition is allowed (i.e., Sj = Q for all j ). Each state network was composed of a single layer of n neurons with a softmax function at their outputs. Input symbols were encoded by two-dimensional index vectors (i.e., ut = [1; 0]0 for the symbol 0 and ut = [0; 1]0 for the symbol 1). The total number of free parameters is thus 2n2 + n. In the experiments we measured convergence and generalization performance using di erT

30

T

Grammar #1

Grammar #2

1.00

1.0

0.95 0.90

0.80

0.85 0.80

0.60

0.75 0.70

0.40

0.65 0.60

0.20 2

4

3

6

5

7

8

3

4

# states

5

6

7

8

7

8

# states

Grammar #3

Grammar #4

1.0

1.0

0.80

0.80

0.60

0.60

0.40

0.40

0.20

0.20

0.00

0.00 3

4

5

6

7

8

3

# states

4

5

6

# states

Figure 9: Convergence and generalization attained by varying the number of discrete states n in the model. Results are averaged over 20 trials. Frequency of convergence to 0 classi cation errors on the training set.  Frequency of convergence to k  2 classi cation errors on the training set. Vertical bars show the corresponding accuracies on the test data. The triangles (/) denote the generalization accuracy for the best and the worst trial. (continued : : :).

31

Grammar #6

Grammar #5 1.0

1.0

0.80 0.80

0.60 0.60 0.40

0.40 0.20

0.00

0.20 3

4

5

6

7

8

3

# states

4

5

# states

Grammar #7 1.0

0.80

Watrous & Kuhn best Accuracy best trial Accuracy worst trial Convergence e=0 Convergence e

Suggest Documents