dynamical recurrent networks - Semantic Scholar

5 downloads 0 Views 1MB Size Report
the same time, we have seen a migration of DRNs across an ever increasing area of ... for some time) have dreamed of a more general and comprehensive ...
Chapter

1

DYNAMICAL RECURRENT NETWORKS John R Kolen and Stefan C. Kremer

1.1 INTRODUCTION How do you handle a stream of input patterns whose interpretation may depend on the patterns that preceded it? Many researchers have been interested in patterns that extended over time and connectionist mechanisms, also known as neural networks, that store salient information between input presentations. Problems in control, language processing, and time-series prediction begged for connectionist solutions. One solution to the problem of temporal processing is an approach utilizing dynamical recurrent networks (DRNs). DRNs are connectionist networks whose computational units assume activations based on the activation history of the network. The distinguishing characteristic of DRNs is their ability to map sequences of input vectors distributed across time into output vector sequences. In this sense, DRNs can be viewed as vector-sequence transducers. Although feedforward networks can also generate such sequences, DRNs differ in that they can generate different output vectors for the same input pattern—a choice depending on the existing input—history-dependent context. Just as there are many ways for connectionist techniques to classify, restore, or cluster patterns, there are many DRN approaches for solving temporal problems. One may process moving windows in time, as Sejnowski and Rosenberg did with NETtalk (Sejnowski and Rosenberg, 1987). The units in the network might simply have self-connections to provide decay of activation (Jordan, 1986a). This process could be extended to larger units of recurrent information passing through multiple layers of processing (Elman, 1990). The search space for DRN solutions is enormous due to the choices of handling temporal information and how that information will be processed. This search space contains enough varieties of DRNs to require some sort of catalog to guide the novice through all the combinations of mechanisms and applications. This book, or more appropriately, this field guide, fills this need. Why a field guide? Field guides, such as those employed by bird watchers and other admirers of nature, generally catalog various species of animals and their habitats. They also help their readers gain a better insight into the relationships among the various species and their environments by drawing attention to the relevant features that have been adopted to thrive in various niches. The DRN literature has recently witnessed a dramatic proliferation of species and behaviors, with every conference and journal spawning new architectures and algorithms. At the same time, we have seen a migration of DRNs across an ever increasing area of problem domains. While there have been many specialized papers published on architectures, issues, and applications, novice DRN watchers (and even those who have been working in the field for some time) have dreamed of a more general and comprehensive reference to overview the field. For this reason, we were compelled to document our forays into artificial intelligence, control theory, and connectionism in search of DRNs. Like other field guides, this book begins by teaching the reader to be a better observer. This lesson is supplemented by examining the challenges that DRNs face. In addition to the

Chapter 1 • Dynamical Recurrent Networks architectures and approaches, we emphasize the issues driving the development of this class of network structures. Knowing these issues outfits the reader with tools to understand and evaluate the merits of the architectures described in the book. These tools will help the reader analyze and exploit other architectures, both current and future, as well. This field guide also describes the most common species a researcher or developer is likely to encounter during typical DRN—watching trips to the forest of scientific and engineering literature. We will identify salient features of the DRNs and their various instantiations that help one to better understand these networks and the relations between the various species. Toward this end, we have attempted to present all relevant details in consistent notation and terminology. Although it is important to know the architectural and algorithmic details of the different species of DRNs, it is probably more important to know their natural habitats. This information is especially critical to those trying to implement real systems based on these networks. We will, therefore, address the critical issue of how to apply these systems to real problems. Understanding the demands of various applications and how they can be met will permit the user to judiciously select and tune DRNs for new applications. We have enlisted a diverse collection of researchers and practitioners in the area of DRNs to write various chapters of the guide. Their years of experience and differing viewpoints provide a more rounded presentation of the material than would be possible in a book with only one or two authors. Since there are almost as many reasons for researching DRNs as there are researchers, it is essential that a guide to DRNs reflect these differing objectives. Although we could try to paraphrase the reasons for their interest in DRNs, we believe that only the practitioners themselves can do justice to such an explanation.

1.2 DYNAMICAL RECURRENT NETWORKS Our world is full of time-dependent processes. The daily cycle of light and dark, the changing of seasons, the periodic beats of our hearts, are all examples of systems that change over time. Almost any real system you can think of has some notion of temporal changes, although we often do not think of them in that way. Consider a light switch. Although there is a minuscule delay between closing the switch and the light turning on, we can ignore the signal propagation delay owing to the speed of light and consider the system to be time independent. These simplifications allow us to reason about complex systems and devices getting bogged down in detail. In some systems, however, the time-dependent behavior is clearly important. A vending machine must release both your beverage and your change after you deposit a sufficient number of coins. Since you cannot simultaneously deposit more than one coin at a time, the machine must remember how much money had been deposited since the last sale. The behavior is context dependent since the machine will react differently to a coin depending on the previous coinage. A simple finite-state automaton can be used in this situation to store the current total coinage. Such a machine will have several distinct states, one for each possible increment of coinage it may receive. The representation of the coinage is the state of thefinite-statemachine. It will accept nickels, dimes, and quarters as "input" until the total amount is reached. If more than this amount has been fed to the machine, it will automatically release enough coinage to maintain the correct amount of money for the beverage. Once this state has been achieved, the user selects an item (via the Select input), the beverage is released, and the machine returns to its initial state. The light switch and vending machine represent two disjoint classes of machines defined by the time dependence of their behavior. In digital circuit theory, this distinction also underlies

Section 1.2 • Dynamical Recurrent Networks

5

the division between combinatorial and sequential circuits. In combinatorial circuits, the behavior of the mechanism is totally determined by the current input. A NAND gate, for instance, should respond only to the voltage levels present at response time. Sequential circuits, on the other hand, are sensitive to the temporal history of their inputs. The prime example of a sequential circuit is the fundamental memory unit, the flip-flop. In order to perform to its specification, the flip-flop contains an internal state that changes according to the inputs it receives. Internal state allows input signals to affect future output behavior. Unfortunately, this definition is not broad enough; one could argue that the physical implementation of an NAND gate has an internal state. The electric signals cannot travel faster than the speed of light, so any output behavior must be in the future of the input event. The same can be said for multilayered feedforward neural networks. The activations of the hidden units constitute the internal state of the network. If one looks at the behavior in the limit (i.e., the action of the device if the inputs were to remain constant for a very long period of time), we do see a difference. The NAND gate will experience short-lived transients and will settle on a stable output. The network will also produce its own transients due to the pipelined nature of its processing, but once the input exits the pipeline, the network will generate the same behavior over time. Usually, the transients are considered a side effect of the implementation and are not considered part of the functional specification. Those systems with very short transients (relative to the time course of interest) can be considered to lack a temporal component like the NAND gate. Like digital circuits, connectionist models consist of a number of processing units connected with one or more other processing units. A processing unit displays activity, o. This activity is dependent on the activity of the processing units connected to it. The connections can be either excitatory or inhibitory. That is, if a processing unit has an excitatory connection with another highly activated unit, it too will raise its activity level. Inhibitory connections, on the other hand, decrease the activity level of the processing unit when the other unit is active. Connections are weighted, that is, they have a numerical value associated with them that scales the input arriving on that connection. The signs of these weights indicate excitatory or inhibitory connections. The quantity wij refers to the connection strength from unit j to unit /. The collection-weighted input to a unit is the net input, ]P WijOj. The activation of a unit is computed from the net unit using an activation function. Common activation functions include the sigmoid, g(x) = sigmoid (x) = yq^r, and the hyperbolic tangent, g(x) = tanh(x). The most common connectionist architecture is the feedforward network. These networks consist of one or more layers of processing units. These layers are arranged linearly with weighted connections between each layer. All connections point in the same direction—a previous layer feeds into the current layer and that layer feeds into the next layer. Information flows from the first, or input, layer to the last, or output, layer. The input layer is distinguished by its lack of incoming connections. Similarly, the output layer is deprived of outgoing connections. The remaining layers possess both types of connectivity and are referred to as hidden layers. Feedforward networks share many properties with combinatorial digital circuits. First, the fundamental discrete logic gates, such as AND, OR, NOT, and NAND, are implementable as linear combinations subjected to a high-gain transfer function (e.g., g(x) = sigmoid(50.x;)). More imporantly, both display directed acyclic connectivity graphs. Cyclic connectivity is the magic behind the digital memory device, the flip-flop. As such, neither mechanism can store state information from one input presentation to the next. Thus, feedforward networks merely transform representations. The real power of such networks comes from selecting vector representations that embody the desired topological relationships. The success of an application based on feedforward neural networks rests on the designer knowing these constraints ahead of time. The designer, for instance, should select

6

Chapter 1 • Dynamical Recurrent Networks representation vectors which ensure that all "noun" vectors should occupy a small region of word vector space. This constraint, and others like it, have guided researchers in constructing representations and is one of many programming aspects of neural networks. The problems to which feedforward networks have been applied have one constraint in common. These tasks are temporally independent: the "what" of current input unambiguously determines the current output independent of "when" it occurs. As demonstrated above, many problems are context dependent and thus demand connectionist architectures capable of encoding, storing, and processing context for later use. A class of connectionist models known as dynamical recurrent networks (DRNs) is often brought to bear in these situations. In DRNs, the current activation of the network can depend on the input history of the system and not just the current input. Unlike their feedforward brethren, DRNs offer a diverse set of processing unit connectivity. Units can connect to themselves. Layers of units can project connections to previous layers of the feedforward network. DRNs can even exhibit arbitrary connectivity. All of these patterns share one thing in common—somewhere in the connectivity graph lies a cycle. Some DRNs, however, have acyclic graphs. These networks utilize time delays to spread the effective input across time. Their output is determined by a moving window of previous inputs. Hence, the current output of the network depends on the network's input history. These networks are considered DRNs as well. The networks with cycles have the potential to dynamically encode, store, and retrieve information much like sequential circuits. As such, DRNs can be thought of as neural state machines. The recurrent connectivity of the network produces cycles of activation within the network. This connectivity allows the network to have a short-term memory of previous experiences, as these experiences may have some effect on the cycling activation and may later affect the processing of the network well after the stimuli have passed. This idea, known as reverberation, can be traced to the work of McCulloch and Pitts (1943) (which also prompted the development of sequential circuits andfiniteautomata). The memory component of DRNs suggests that they would serve as excellent candidates for solving problems involving temporal processing. The connectionist techniques used to solve these problems are diverse. The DRN implementor faces several difficult questions. How will the processing units be connected? Is the connectivity pattern sufficient to solve my problem? How do I find the connectivity weights? By addressesing the architectures, capabilities, algorithms, limitations, and applications of DRNs, we hope the reader will be able to find answers to these, and other, questions regarding this fascinating mechanism.

1-3 OVERVIEW The book is divided into five parts. The first part presents an overview of DRN architectures. Once the reader has become familiar with the range of speciation in the DRN phylum, we turn to exploring their computational capabilities in Part II. The third part introduces algorithms for inserting and extracting knowledge from DRNs. Part IV then discusses the computational and representational limitations of the sequence transducers. We conclude the book with several field studies of DRNs in their natural habitats—applications. 1.3.1 Architectures DRNs are the connectionist approach to the processing of spatio-temporal data. For instance, we may need to predict the next element of time series, decide on an action, or clean a noisy signal, given historical information. To address such problems, DRNs must be sensitive to the sequential properties of their inputs. Internal memory, capable of storing salient

Section 1.3 • Overview

7

information about the previous input vectors, is a necessary feature of DRNs. As shown in Part I, this memory can take on many forms. From shift registers to recurrent connections, the memory form plays an important part in separating the variety of DRN species. In the first chapter of our architectural explorations, we introduce mechanisms that can change how their internal states are computed during the training process. Examples of these architectures include simple recurrent neural networks (Elman, 1990), Jordan networks (Jordan, 1986a), and second-order recurrent networks (e.g., Giles et al., 1990). Another approach to processing time-varying signals is to store a fixed-length input history, that is, a buffer, and then apply a conventional technique (e.g., linear transformation or multilayered perceptron) to the contents of the buffer. The famous NETtalk experiment (Sejnowski and Rosenberg, 1987) employed this technique for mapping sequences of text to phoneme sequences. Since that time, a number of enhancements have improved on the approach. For instance, the buffers can be moved into the network as in the time delay neural network (TDNN) approach (Lang et al., 1990). Likewise, the recurrent output can be buffered in the case of nonlinear with exogenous input (Nonlinear Auto/tegressive with exogenous inputs—NARX) networks (Lin et al., 1995). Our first exposure to architectures will observe members from the delay buffer genus. In Chapter 4, the delay buffer notion is generalized to the memory kernel (de Vries and Principe, 1992). A memory kernel consists of a short-term memory subsystem and a generic predictor. The latter is assumed to be a static feedforward mechanism. The short-term memory, however, is a sequence of one or more memory modules that are governed by some transfer function between modules. For instance, the standard shift register is realized by a sequence of modules, with the identity function as their transfer function. Other mechanisms can be constructed as well: moving averages, gamma filters, Laguerre filters, and the standard IIR/FIR filter. Given this diversity, we will spend some time identifying and cataloging the memory kernel species. 1.3.2 Capabilities After presenting the various DRN species, we turn to exploring their computational capabilities. Understanding how DRNs process information is very important when deciding which architecture to use. For instance, certain DRNs have difficulties learning context-free languages owing to the nature of their state transition functions. From our description above, DRNs are computational systems with time-varying responses. Mechanisms such as these are most generally described as dynamical systems. Chapter 5 will introduce definitions crucial to understanding DRNs as dynamical systems. Three key notions will be elaborated: time/state-space taxonomy, attractor behavior, and iterated function systems. The first notion classifies dynamical systems by their treatment of time and space. While dynamical systems can exhibit a wide variety of behaviors, one can classify these behaviors qualitatively—fixed point, periodic, quasiperiodic, and chaotic. Finally, the similarity between iterated function systems and DRNs is identified. The next two chapters address the issue of how dynamical recurrent networks can represent discrete states, implement state transitions, and thus act as finite-state automata or finite-state machines. The authors of Chapter 6 describe how to represent finite-state automata in networks with hard-limiting discriminant functions. The concept of, and necessity for, state splitting in single-layer first-order architectures is explained, and second-order alternatives are discussed. The representational capabilities of several other architectures, such as the recurrent cascade correlation network (Fahlman, 1991), are examined as well. Although the previous chapter dealt with networks that used step functions to calculate activations, the next chapter concerns stable state encodings in DRNs with continuous activation functions. Such encodings are crucial to the implementation of finite-automata in DRNs.

Chapter 1 • Dynamical Recurrent Networks Without stability, the network could enter an infinite set of computational states (a situation exploited in Chapter 8). The authors of Chapter 7 use a bounding argument to define the stable encodings of different automata in a variety of network architectures (Carrasco et al., 2000). It is then possible to directly encode finite-state machines into a variety of commonly used DRNs networks. These encodings rely entirely on small weights. Large weights saturate the processing units and effectively produce processing units that behave like digital logic gates. Chapter 8 in our exploration of the capabilities of DRNs steps to the next level in the Chomsky hierarchy: pushdown automata and context-free languages. The DRNs capable of finite-state behavior rely on state clustering—neatly partitioned neighborhoods of activation vectors representing the state of the automaton. The DRNs of this chapter, however, do not partition nicely. In fact, these networks take advantage of the sensitivity of initial conditions of the state transfer function to encode an infinite set of states capable of supporting a pushdown automata (Rodriguez et al., 1999). Several experiments that demonstrate the DRN's ability to generalize in the direction of the context-free interpretation of a finite set of training examples are described as well (Wiles and Elman, 1995). The importance of these capabilities becomes evident when DRNs are applied to natural language processing problems (see Chapter 17). In the final chapter on the representational power of DRNs, we see how analog DRNs can implement computational mechanisms equivalent to Turing machines (Siegelmann and Sontag, 1991). Although analog DRNs cannot branch on values (e.g., if JC is greater than 0, go to another part of the program), do not have a classical memory structure (e.g., tapes or registers), and only have a fixed number of processing units, they still can compute the range of functions that a digital computer can. The trick is to fractally encode the Turing machine "tape" in the continuous activation of a processing unit. If the DRN has real valued weights, the network can actually compute functions beyond those realized by Turing machines and recursive functions (Siegelmann, 1999). 1.3.3 Algorithms The previous parts of the field guide provide structural and functional descriptions of many DRNs. In order to use these mechanisms to solve problems, we must employ one or more algorithms to assist in the insertion and/or extraction of knowledge stored in the networks. We begin this part with a survey of learning algorithms for DRNs with hidden units, placing the various techniques into a common framework. We discuss fixed-point learning algorithms, such as recurrent back propagation (Williams and Zipser, 1989), and non-fixedpoint algorithms, namely, back propagation through time (Rumelhart et al., 1986), Elman's history cutoff (Elman, 1990), and Jordan's output feedback architecture (Jordan, 1986a). Forward propagation, an online technique that uses adjoining equations, is also discussed. We discuss the advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones, and we continue with some "tricks of the trade"1 for training, using, and simulating continuous time and recurrent neural networks. After examining some simulation results, we address the issues of computational complexity and learning speed. In Chapter 10, we focus on methods for injecting prior knowledge into DRNs (Giles and Omlin, 1993). Typically, this knowledge is in the form of a nondeterministic finite-state machine. Using the encodings described in Chapters 6 and 7, one can seed the training of a DRN with a good guess of the target machine. This insertion will result in improved generalization and reduced training times. Insertion algorithms for several architectures are described, including first-order, second-order, and Radial Basis Function (RBF)-based recurrent neural 'Additional connectionist model tricks can be found in Orr and Miiller, 1998.

Section 1.3 • Overview

9

nets. In addition, the authors consider the implementation of fuzzy finite-automata in DRNs (Omlin et al., 1998). While Chapter 11 focuses on the issue of representing and inserting a desired behavior pattern into a DRN, Chapter 12 tackles the inverse problem: extracting the behavior from a DRN. Since DRNs can represent finite-state automata, we begin by describing a technique for extracting automata from DRNs (Giles et al., 1992b) that approximates the network's functional (input/output) behavior. This technique is based on the tendency for the state vectors of DRNs to cluster. The clusters are identified by quantizing the state space. Transitions between clusters are generated by exploring the state dynamics of the DRN through feeding the network input sequences and recording the state transitions. We go on to examine the limitations of this approach and the degree to which extraction is noise dependent. Finally, we examine an application automata extraction in the context of financial forecasting (Lawrence et al., 1997). This application suggests a novel approach to trading strategies in financial markets. 1.3.4 Limitations In this part, we pause to reflect on the limitations of the DRN architecture and algorithms for manipulating those architectures. First, we learn to evaluate potential benchmark problems. Difficulties with gradient-based learning algorithms are then explored. Finally, we address limitations and processing limitations due to noisy computation. Evaluating DRN training algorithms is a difficult task. Although there are several benchmark sets (e.g., the Tomita languages—see Tomita, 1982), one difficulty lies in interpreting the results. Did the network find the solution because of a good training algorithm, or do satisfactory solutions litter weight space? Chapter 13 addresses this question by looking at the naive learning algorithm called random weight guessing (RG) (Hochreiter and Schmidhuber, 1997c). Although RG cannot be viewed as a reasonable learning algorithm, it often outperforms more complex methods on widely used benchmark problems. One reason for RG's success is that the solutions to many of these benchmarks are dense in weight space. Thus, a potential benchmark set should contain problems in which RG fails to work; otherwise we could not distinguish a good algorithm from one that thrashes around in weight space. This test should serve to improve the quality of benchmarks for evaluating novel DRN training methods. Numerous papers have focused on the difficulty of training standard DRNs to learn behaviors involving long-term dependencies. Chapter 14 focuses on the dynamics of error signals during gradient-based training of DRNs. In feedforward networks, the backward flow of error information is either amplified or reduced, depending on the eigenvalues of the weight matrices. This situation is not much of a problem when the error flows through two or three weight matrices. DRN error signals, however, repeatedly pass through the weights determining the next-state calculation. As such, the path integral over all error signals either decays or explodes exponentially as they are repeatedly propagated through the state weights. This condition can prevent a network from learning behaviors with significant time lags (five to ten time steps). For instance, DRNs have difficulties learning to latch information for later use. This chapter provides an analysis of this important problem and discusses possible remedies. The final chapter looks at two important factors that limit the utility of DRNs: the output function and computational stability. First, the output function of the DRN determines the shape of the state space used to classify input vectors. A single perceptron cuts the space by a hyperplane, while an RBF unit utilizes circular decision boundaries. By examining the VC dimension of the output function, one can establish a computational hierarchy for DRNs similar to those found in other limited computational systems. A DRN with quadratic decision

10

Chapter 1 • Dynamical Recurrent Networks boundary, for instance, will be able to perform language recognition tasks that those with linear decision boundaries cannot. Second, this chapter examines two types of computational stability in the presence of noisy state transitions. The first assumes that the DRN must always perform perfectly in the presence of noise (Casey, 1996). Under this assumption, states are no longer points but clouds of points. If the dimensionality of this cloud is the same as the state space itself, the network can only perform finite-state calculations. The second stability assumption places a bound on the error rate of the DRN (Maass and Orponen, 1998). Again, the DRN is limited to finite-state behavior. The number of states in DRNs under this limitation, however, grows much faster than the first condition as the amplitude of the noise is decreased. Given that DRN implementations, both hardware and software, are affected by noise, it is important to understand these limitations before applying DRN solutions to a problem. 1.3.5 Applications We have devoted the final part of this field guide to studying the DRN in its natural habitat. Observing DRNs applied to real problems can help the reader understand the benefits and limitations of selecting DRNs as a solution tool. Our application areas range from financial portfolio management to modeling linguistic behaviors. Throughout these habitats, we will encounter a wide range of DRN architectures and training methods. They will be controlling physical systems, performing cognitive tasks, predicting time series, and storing data structures. The first problem domain we visit is the application of DRNs to control. In this application, the DRN examines the current state of a system, or plant, and the plant's input to adjust the behavior of the plant to achieve some goal. For instance, a DRN controller could maintain the plant's output at a constant level despite variations in the input. Chapter 16 describes an approach to training two classes of DRNs—time-lagged recurrent neural networks and recurrent multilayered perceptrons—in the control field. The methods described in this chapter are applied to two realistic examples. The first application involves controlling a multiple-input multiple-output plant so that its outputs track two independent reference signals. The second example addresses the problem of financial portfolio optimization from the control perspective. The next habitat we find for DRNs is the area of cognitive science. This field has been an incubator for many connectionist ideas, most notably parallel distributed processing (Rumelhart and McClelland, 1986). One would be hard pressed to find an area of cognitive science devoid of connectionist influences. One especially bountiful area has been natural language processing. Many DRN advances have been driven by the attempts to build connectionist models in the cognitive domain, despite vocal opposition (e.g., Fodor and Pylyshyn, 1988). In this chapter, we examine the role DRNs can play in modeling several phenomena of sentence processing—frequency sensitivity, garden-path effects, and phrase structure and memory. These models are then compared with more traditional linguistic models of the same phenomena. From linguistics we turn to the general problem of modeling dynamical systems. The next two chapters present the modeling of time series, such as those encountered in economic forecasting, by finite unfolding in time. The first chapter addresses such topics as domainspecific data preprocessing (e.g., Refenes, 1994), outliers suppression, square augmented processing units (Flake, 1998), estimating time-series derivatives (Weigend and Zimmermann, 1998), and combining estimations (Perrone, 1993). These methods are combined into both a feedforward network and a simple recurrent network. While feedforward networks have been used extensively in forecasting, DRNs are appropriate when the dynamical system in question possesses a significant autonomous component. Chapter 18 continues the dynamical system modeling discussion. Specifically, it addresses the question of how to combine state-space reconstruction (Packard et al., 1980) and

Section 1.4 • Conclusion

11

forecasting in a neural network framework. Starting from Takens' Theorem (Takens, 1981), networks search for an optimal state-space reconstruction such that the forecasts are more accurate. The network solves this problem by unfolding the data stream in both space and time. The theory provides a novel view on the detection of invariants, on the nonlinearity-noise dilemma, and on the concept of intrinsic time. While most of this book deals with the problem of transducing or classifying inputs consisting of ordered sequences of vectors, more sophisticated, higher order structures can also be used. Chapter 19 examines the application of DRNs to the processing of labeled graphs (Frasconi et al., 1997). This capability is important because the natural encoding of some problems are graph-like. For instance, the network could encode the parse tree of a sentence. The resulting vector representation could be stored, classified, or transformed. This chapter finishes with the presentation of applications in structural pattern recognition, computational chemistry, and search control for deduction systems. We conclude the book with a synopsis of the major themes and highlights of the book. It discusses open problems that remain to be addressed and identifies some of the emerging directions that the field will follow in the future. 1.4 CONCLUSION This chapter provides an outline of the contents of the DRN field guide. It sets the stage for the remainder of the book by contrasting feedforward networks with DRNs.