Optimization via Efficient Learning in CNNs - University of Memphis

5 downloads 498 Views 507KB Size Report
SRNNs as a generalization of the adaptive learning rate concept. The proposed procedure ... Illustration of the 2D maze environment. The obstacles are marked by ... input is applied over many time steps and the output is read after the initial ...
Optimization via Efficient Learning in CNNs Cognitively-Motivated Temporal Discount Functions in SRNNs

Robert Kozma

Roman Ilin*

Department of Mathematical Sciences The University of Memphis Memphis, TN 38152, USA [email protected]

Department of Mathematical Sciences The University of Memphis Memphis, TN 38152, USA [email protected]

Abstract—Cellular Neural Networks (CNNs) are universal computing machines embodying basic computational principles of cortical tissues. Simultaneous Recurrent Neural Networks (SRNNs) have shown clear advantages in solving complex optimization and decision making problems. Based on biological intuition, we introduce temporal discount functions in training SRNNs as a generalization of the adaptive learning rate concept. The proposed procedure results in drastic, 3-5-fold acceleration of learning, demonstrated through the maze navigation problem. Keywords-Simultaneous Recurrent Neural Network (SRNN), Back-Propagation Through Time (BPTT), Adaptive Learning Rate, Temporal Discount Function, Maze Navigation.

INTRODUCTION Cellular Simultaneous Recurrent Neural Networks (SRNNs) have been suggested to be function approximators more powerful than Mutli-Layer Perceptrons (MLPs), in particular when solving approximate dynamic programming problems. The modern approach to reinforcement learning builds on the concept of dynamic programming, which allows to plan for the best course of action in a multistage decision problem [1]. Given a Markovian decision process with N possible states and the immediate expected cost of transition between any two states i and j denoted by c(i,j) the optimal cost-togo function for each state satisfies the following Bellman‟s optimality equation: J*(i) = min {c(i, (i)) +

j

pij( )J*(j)}

are necessary to achieve greater power as function approximators. A special kind of recurrent network called cellular simultaneous neural network (SRNN) has been designed to solve the 2D maze navigation problem. On one hand, this is an easy problem, which can be solved by direct application of the Bellman equation. On the other hand, this problem can be solved using carefully designed cellular SRNN, as shown in [4]. The real issue is how to train the network for this task. In [3], training by backpropagation through time took in the order of 10,000 training epochs to learn 6-8 mazes. In an alternative approach, Extended Kalman Filter has been used for efficient training of SRNN [9, 10]. In the present approach, we apply temporal discount functions in combination with Back-propagation Through Time (BPTT). Temporal discount functions correspond to the cognitively-grounded intuition that the impact of distant time instances on the present behavior should inevitably diminish beyond a finite time horizon. Our results support such intuition. Moreover, the speed of convergence has been improved by 3-5-fold in comparison with earlier results. In this work we first describe the applied mathematical apparatus involving cellular simultaneous neural networks trained by BPTT. We introduce the novel training method based on temporal discount function approach. We define the 2D simulated maze navigation problem and describe the results of training using discount functions. Finally, we compare the results obtained by various training methods and conclude with the advantages of the proposed learning approach..

(1)

J(i) is the total expected cost from the initial state i and is the discount factor. The cost J depends on the policy μ, which is the mapping between the states and actions causing state transitions. The optimal expected cost results from the optimal policy . Finding such policy directly from Eq. 1 is possible using recursive techniques but computationally expensive as the number of states of the problem grows. The concept of approximate dynamic programming (ADP) refers to techniques used to estimate the exact solution to the Bellman‟s optimality equation, usually by means of neural networks. Such networks can be utilized in intelligent control, see for example [2]. It has been asserted that feedforward networks are not powerful enough [3], and recurrent networks *Present affiliation: Air Force Research Laboratory, Sensors Directorate, Hanscom AFB, MA 01731, USA. Partial support to complete this work has been provided by a FedEx Institute of Technology Research Grant.

FORMULATION OF THE OPTIMIZATION PROBLEM A. 2D Maze Navigation Problem The generalized maze navigation consists of finding the optimal path from any initial position to the goal in a 2D grid world. An example of such a world is given in Fig. 1. Each cell of the grid can be either a clear cell or an obstacle. The path can be easily obtained if we compute the J cost-togo function using the Bellman‟s equation. Here, the immediate cost c(i,j) is always 1, and the probabilities pij can only take values of 0 or 1. Various 2D mazes have been generated randomly for training purposes. Examples of computed J functions for 5x5 square mazes are given in Fig. 2. The obstacles are distinguishes with a high cell value of 25.

Figure 1. Illustration of the 2D maze environment. The obstacles are marked by black squares, the open fields are white. A wall has been added along the perimeter of the 5x5 maze to prevent escaping from the maze. The target is market by x.

Figure 3. Illustration of a recurrent NN using a Generalized Multi-Layer Perceptron (GMLP) core function [6]. Superscript t refers to the current training or testing epoch; subscript n shows the current iteration of the SRN. The network input xt is applied over many network iterations. The output yt n gradually converges to a steady value which is taken to be the output of the network. An example of the ynt sequence is given on the graph below.

C. Cellular Network

Figure 2. Representation of the maze in numerical form. 0 stands for the target, and every cell value gives the distance from the target. The large value of 25 indicates obstacles. In actual calculations, we add a ring of cells with value of 25 for the circular wall around the maze.

B. Simultaneous Recurrent Neural Networks SRNNs can be used for static functional mapping, similarly to the MLPs. They are different from more widely known time lagged recurrent networks (TLRN) in that the input is applied over many time steps and the output is read after the initial transitions have died out and the network is in equilibrium state. The concept is illustrated in Fig. 3. The SRNN is expected to be a more powerful function approximator than the feedforward MLP [3] due to the presence of massive recurrent connections, in the style of brains. Recurrent connections are needed for complex dynamic behaviors which make the brain to such a powerful computational device [5]. Technically, the core of the SRNN does not have to be a network. It can be any differentiable nonlinear function. This point is very important because the computer code for cellular SRNN can be made extremely flexible by allowing to plug in any functional form as long as the feedback function for propagating the derivatives back though the system is specified. We envision such computer code to be available in the near future.

The goal is to build an SRNN that learns the functional mappings given in Fig. 2, and generalizes this knowledge to other mazes. The input to the network is the set of obstacle and goal locations. The idea of cellular network is to utilize the symmetry of the problem. If we transform the maze problem to a torus and allow the maze to wrap around its edges, the problem becomes symmetric with respect to shifts up or down, right or left. Therefore, the network must also obey this symmetry. The network is designed in such a way that the same basic architecture is repeated for each cell of the maze. In case of a 5 x 5 maze, we have 49 inputs, counting the walls around the maze. We have 49 identical networks, each of which takes the type of its corresponding maze cell (clear, goal, or obstacle) as its input xt and the previous values of its output and the outputs of the 4 neighboring cells as its recurrent inputs yt. The cells of the network can talk to each other by means of the neighbor links, however only the immediate neighbors are connected. The cellular architecture drastically decreases the number of adjustable weights in the network. The design also seems appropriate because its computation resembles the recursive nature of the Bellman equation. III. LEARNING THROUGH BPTT A. Regular BPTT Algorithm Training of recurrent networks can be done using backpropagation through time. BPTT extends the classical backpropagation by „unfolding‟ the recurrent network. Imagine that instead of recurring back to themselves, the recurrent links of the network feed forward into a copy of the same network. Let us keep making many copies like this for 10, 20 iterations. If the network comes to steady state, the outputs will stop changing after finite number of iterations. In

that case we can stop replicating the network and say that our unfolded multilayer feed-forward network is equivalent to the original recurrent network. This unfolded network can now be trained using regular backpropagation. The only problem is that the weights in each ”layer” must stay the same; we cannot adjust each weight independently as we would in a MLP. Usually, weight adjustment is done by summing up all the derivatives and making one change corresponding to the sum. In the case of cellular SRNN, the derivatives also have to be summed over each cell of the maze. Such summations impair the efficiency of learning. As a result, BPTT can be successfully applied to the maze navigation but the learning is slow [6]. B. Training using Adaptive Learning Rate Denote network weights by vector W. A particular weigh is W(α) where α is the weight‟s index. The operation of recurrent net includes a fixed number of forward iterations, p. The BPTT algorithm computes the gradient of the error function with respect to the weights in each iteration. Denote the current iteration as q. The component of the gradient for weight α at iteration q is denoted by F_W(α,q). Training consists of gradient descent with some learning rate LR. The weight is adapted by the sum of derivatives from all iterations as follows: W(α)=W(α)+ LR(t)

q=1,p

F_W(α,q)

Algorithm A S1=F_W(t)∙F_W(t-1)

2.

S2=F_W(t-1)∙F_W(t-1)

3.

LR(t)=LR(t-1)(a+b)

if S1>S2

LR(t)=LR(t-1)(a-b)

if S1S2 if S1