Recurrent Neural Network

12/27/2017

RNN

Recurrent Neural Network This is a pure numpy implementation of word generation using an RNN

We're going to have our network learn how to predict the next words in a given paragraph. This will require a recurrent architecture since the network will have to remember a sequence of characters. The order matters. 1000 iterations and we'll have pronouncable english. The longer the training time the better. You can feed it any text sequence (words, python, HTML, etc.)

What is a Recurrent Network? Feedforward networks are great for learning a pattern between a set of inputs and outputs.

http://localhost:8888/notebooks/Desktop/pro/MTech%202%20project/Presentations/recurrent_neural_network-master/RNN.ipynb

1/17

12/27/2017

RNN


2/17

12/27/2017

RNN

temperature & location height & weight car speed and brand But what if the ordering of the data matters?


3/17

12/27/2017

RNN

Alphabet, Lyrics of a song. These are stored using Conditional Memory. You can only access an element if you have access to the previous elements (like a linkedlist). Enter recurrent networks We feed the hidden state from the previous time step back into the the network at the next time step.


4/17

12/27/2017

RNN

So instead of the data flow operation happening like this

input -> hidden -> output it happens like this

(input + prev_hidden) -> hidden -> output wait. Why not this?

(input + prev_input) -> hidden -> output Hidden recurrence learns what to remember whereas input recurrence is hard wired to just remember the immediately previous datapoint


5/17

12/27/2017

RNN

RNN Formula

It basically says the current hidden state h(t) is a function f of the previous hidden state h(t-1) and the current input x(t). The theta are the parameters of the function f. The network typically learns to use h(t) as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to t. Loss function


6/17

12/27/2017

RNN

The total loss for a given sequence of x values paired with a sequence of y values would then be just the sum of the losses over all the time steps. For example, if L(t) is the negative log-likelihood of y (t) given x (1), . . . , x (t) , then sum them up you get the loss for the sequence

Our steps Initialize weights randomly Give the model a char pair (input char & target char. The target char is the char the network should guess, its the next char in our sequence) Forward pass (We calculate the probability for every possible next char according to the state of the model, using the paramters) Measure error (the distance between the previous probability and the target char) We calculate gradients for each of our parameters to see their impact they have on the loss (backpropagation through time) update all parameters in the direction via gradients that help to minimise the loss Repeat! Until our loss is small AF

What are some use cases? Time series prediction (weather forecasting, stock prices, traffic volume, etc. ) Sequential data generation (music, video, audio, etc.)

Other Examples -https://github.com/anujdutt9/RecurrentNeuralNetwork (https://github.com/anujdutt9/RecurrentNeuralNetwork) (binary addition)

What's next? 1 LSTM Networks 2 Bidirectional networks 3 recursive networks

The code contains 4 parts Load the trainning data encode char into vectors Define the Recurrent Network Define a loss function Forward pass Loss Backward pass Define a function to create sentences from the model Train the network Feed the network Calculate gradient and update the model parameters Output a text to see the progress of the training

Load the training data The network need a big txt file as an input. The content of the file will be used to train the network. http://localhost:8888/notebooks/Desktop/pro/MTech%202%20project/Presentations/recurrent_neural_network-master/RNN.ipynb

7/17

12/27/2017

RNN

I use Methamorphosis from Kafka (Public Domain). Because Kafka was one weird dude. I like. In [3]: data = open('kafka.txt', 'r').read() chars = list(set(data)) data_size, vocab_size = len(data), len(chars) print 'data has %d chars, %d unique' % (data_size, vocab_size) data has 137629 chars, 81 unique

Encode/Decode char/vector Neural networks operate on vectors (a vector is an array of float) So we need a way to encode and decode a char as a vector. We'll count the number of unique chars (vocab_size). That will be the size of the vector. The vector contains only zero exept for the position of the char wherae the value is 1. So First let's calculate the vocab_size: In [5]: char_to_ix = { ch:i for i,ch in enumerate(chars)} ix_to_char = { i:ch for i, ch in enumerate(chars)} print char_to_ix print ix_to_char {'\n': 0, 'C': 31, '!': 3, ' ': 4, '"': 5, '%': 6, '$': 7, "'": 8, ')': 9, '(': 10, '*': 11, '-': 12, ',': 13, '/': 2, '.': 15, '1': 16, '0': 17, '3': 18, '2': 19, '5': 20, '4': 21, '7': 22, '6': 23, '9': 24, '8': 25, ';': 26, ':': 27, '?': 28, 'A': 29, '@': 30, '\xc 3': 1, 'B': 32, 'E': 33, 'D': 34, 'G': 35, 'F': 36, 'I': 37, 'H': 38, 'K': 39, 'J': 40, 'M': 41, 'L': 42, 'O': 43, 'N': 44, 'Q': 45, 'P': 46, 'S': 47, 'R': 48, 'U': 49, 'T': 50, 'W': 51, 'V': 52, 'Y': 53, 'X': 54, 'd': 59, 'a': 55, 'c': 56, 'b': 57, 'e': 58, '\xa7': 14, 'g': 60, 'f': 61, 'i': 62, 'h': 63, 'k': 64, 'j': 65, 'm': 66, 'l': 67, 'o': 68, 'n': 69, 'q': 70, 'p': 71, 's': 72, 'r': 73, 'u': 74, 't': 75, 'w': 76, 'v': 77, 'y': 78, 'x': 79, 'z': 80} --------------------------------------------------------------------------NameError Traceback (most recent call last) in () 2 ix_to_char = { i:ch for i, ch in enumerate(chars)} 3 print char_to_ix ----> 4 print char_to_char NameError: name 'char_to_char' is not defined

Then we create 2 dictionary to encode and decode a char to an int In [ ]:

Finaly we create a vector from a char like this:


8/17

12/27/2017

RNN

The dictionary defined above allosw us to create a vector of size 61 instead of 256. Here and exemple of the char 'a' The vector contains only zeros, except at position char_to_ix['a'] where we put a 1. In [7]: import numpy as np vector_for_char_a = np.zeros((vocab_size, 1)) vector_for_char_a[char_to_ix['a']] = 1 print vector_for_char_a.ravel() [ 0. 0. 0. 0. 0.

0. 0. 0. 1. 0.

0. 0. 0. 0. 0.

0. 0. 0. 0. 0.

0. 0. 0. 0. 0.

0. 0. 0. 0. 0.

0. 0. 0. 0. 0.

0. 0. 0. 0. 0.

0. 0. 0. 0. 0. 0. 0. 0. 0.]

0. 0. 0. 0.

0. 0. 0. 0.

0. 0. 0. 0.

0. 0. 0. 0.

0. 0. 0. 0.

0. 0. 0. 0.

0. 0. 0. 0.

0. 0. 0. 0.

Definition of the network The neural network is made of 3 layers: an input layer an hidden layer an output layer All layers are fully connected to the next one: each node of a layer are conected to all nodes of the next layer. The hidden layer is connected to the output and to itself: the values from an iteration are used for the next one. To centralise values that matter for the training (hyper parameters) we also define the sequence lenght and the learning rate In [9]: #hyperparameters

File "", line 5 learning_rate 1e-1 ^ SyntaxError: invalid syntax

In [12]: #model parameters hidden_size = 100 seq_length = 25 learning_rate = 1e-1 Wxh = np.random.randn(hidden_size, vocab_size) * 0.01 #input to hidden Whh = np.random.randn(hidden_size, hidden_size) * 0.01 #input to hidden Why = np.random.randn(vocab_size, hidden_size) * 0.01 #input to hidden bh = np.zeros((hidden_size, 1)) by = np.zeros((vocab_size, 1)) http://localhost:8888/notebooks/Desktop/pro/MTech%202%20project/Presentations/recurrent_neural_network-master/RNN.ipynb

9/17

12/27/2017

RNN

The model parameters are adjusted during the trainning. Wxh are parameters to connect a vector that contain one input to the hidden layer. Whh are parameters to connect the hidden layer to itself. This is the Key of the Rnn: Recursion is done by injecting the previous values from the output of the hidden state, to itself at the next iteration. Why are parameters to connect the hidden layer to the output bh contains the hidden bias by contains the output bias You'll see in the next section how theses parameters are used to create a sentence.

Define the loss function The loss is a key concept in all neural networks training. It is a value that describe how good is our model. The smaller the loss, the better our model is. (A good model is a model where the predicted output is close to the training output) During the training phase we want to minimize the loss. The loss function calculates the loss but also the gradients (see backward pass): It perform a forward pass: calculate the next char given a char from the training set. It calculate the loss by comparing the predicted char to the target char. (The target char is the input following char in the tranning set) It calculate the backward pass to calculate the gradients This function take as input: a list of input char a list of target char and the previous hidden state This function outputs: the loss the gradient for each parameters between layers the last hidden state

Forward pass The forward pass use the parameters of the model (Wxh, Whh, Why, bh, by) to calculate the next char given a char from the trainning set. xs[t] is the vector that encode the char at position t ps[t] is the probabilities for next char alt text hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars or is dirty pseudo code for each char


10/17

12/27/2017

RNN

hs = input*Wxh + last_value_of_hidden_state*Whh + bh ys = hs*Why + by ps = normalized(ys)

Backward pass The naive way to calculate all gradients would be to recalculate a loss for small variations for each parameters. This is possible but would be time consuming. There is a technics to calculates all the gradients for all the parameters at once: the backdrop propagation. Gradients are calculated in the oposite order of the forward pass, using simple technics. goal is to calculate gradients for the forward formula: hs = input*Wxh + last_value_of_hidden_state*Whh + bh ys = hs*Why + by The loss for one datapoint

How should the computed scores inside f change tto decrease the loss? We'll need to derive a gradient to figure that out. Since all output units contribute to the error of each hidden unit we sum up all the gradients calculated at each time step in the sequence and use it to update the parameters. So our parameter gradients becomes :


11/17

12/27/2017

RNN

Our first gradient of our loss. We'll backpropagate this via chain rule

The chain rule is a method for finding the derivative of composite functions, or functions that are made by combining one or more functions.


12/17

12/27/2017

RNN


13/17

12/27/2017

RNN

In [14]: def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ #store our inputs, hidden states, outputs, and probability values xs, hs, ys, ps, = {}, {}, {}, {} #Empty dicts # Each of these are going to be SEQ_LENGTH(Here 25) long dicts i.e. 1 vector per time(seq) step # xs will store 1 hot encoded input characters for each of 25 time steps (26, 25 times) # hs will store hidden state outputs for 25 time steps (100, 25 times)) plus a -1 indexed initial state # to calculate the hidden state at t = 0 # ys will store targets i.e. expected outputs for 25 times (26, 25 times), unnormalized probabs # ps will take the ys and convert them to normalized probab for chars # We could have used lists BUT we need an entry with -1 to calc the 0th hidden layer # -1 as a list index would wrap around to the final element xs, hs, ys, ps = {}, {}, {}, {} #init with previous hidden state # Using "=" would create a reference, this creates a whole separate copy # We don't want hs[-1] to automatically change if hprev is changed hs[-1] = np.copy(hprev) #init loss as 0 loss = 0 # forward pass for t in xrange(len(inputs)): xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation (we place a 0 vector as the t-th input) xs[t][inputs[t]] = 1 # Inside that t-th input we use the integer in "inputs" list to set the correct hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss) # backward pass: compute gradients going backwards #initalize vectors for gradient values for each set of weights dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(xrange(len(inputs))): #output probabilities dy = np.copy(ps[t]) #derive our first gradient dy[targets[t]] -= 1 # backprop into y #compute output gradient - output times hidden states transpose #When we apply the transpose weight matrix, #we can think intuitively of this as moving the error backward #through the network, giving us some sort of measure of the error #at the output of the lth layer. #output gradient dWhy += np.dot(dy, hs[t].T) #derivative of output bias dby += dy #backpropagate! dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw #derivative of hidden bias dWxh += np.dot(dhraw, xs[t].T) #derivative of input to hidden layer weight dWhh += np.dot(dhraw, hs[t-1].T) #derivative of hidden layer to hidden layer weight dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: http://localhost:8888/notebooks/Desktop/pro/MTech%202%20project/Presentations/recurrent_neural_network-master/RNN.ipynb

14/17

12/27/2017

RNN

np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

Create a sentence from the model In [17]: #prediction, one full forward pass def sample(h, seed_ix, n): """ sample a sequence of integers from the model h is memory state, seed_ix is seed letter for first time step n is how many characters to predict """ #create vector x = np.zeros((vocab_size, 1)) #customize it for our seed char x[seed_ix] = 1 #list to store generated chars ixes = [] #for as many characters as we want to generate for t in xrange(n): #a hidden state at a given time step is a function #of the input at the same time step modified by a weight matrix #added to the hidden state of the previous time step #multiplied by its own hidden state to hidden state matrix. h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh) #compute output (unnormalised) y = np.dot(Why, h) + by ## probabilities for next chars p = np.exp(y) / np.sum(np.exp(y)) #pick one with the highest probability ix = np.random.choice(range(vocab_size), p=p.ravel()) #create a vector x = np.zeros((vocab_size, 1)) #customize it for the predicted char x[ix] = 1 #add it to the list ixes.append(ix) txt = ''.join(ix_to_char[ix] for ix in ixes) print '----\n %s \n----' % (txt, ) hprev = np.zeros((hidden_size,1)) # reset RNN memory #predict the 200 next characters given 'a' sample(hprev,char_to_ix['a'],200) ---HD')JsuLNgDw!4ejckLaA6zAt*(d'�5O;p5DoglJ!u-RN'v/:XaByA1Q"BWcO*Tip:AO/O%Wl/o)�Q1 (DC.�3ztLPh/9RJ' ?W"p hpXoCIQgN3!j0-@49z9Hit5Kc$mn'L8 :T.T;2.bKd(GS$7 ID8@?iitb2"OSubD7wJyh3@(:5hEmKX4T?i;;Ceq,BmY2yoPHx ----

Training http://localhost:8888/notebooks/Desktop/pro/MTech%202%20project/Presentations/recurrent_neural_network-master/RNN.ipynb

15/17

12/27/2017

RNN

This last part of the code is the main trainning loop: Feed the network with portion of the file. Size of chunk is seq_lengh Use the loss function to: Do forward pass to calculate all parameters for the model for a given input/output pairs Do backward pass to calculate all gradiens Print a sentence from a random seed using the parameters of the network Update the model using the Adaptative Gradien technique Adagrad

Feed the loss function with inputs and targets We create two array of char from the data file, the targets one is shifted compare to the inputs one. For each char in the input array, the target array give the char that follows. In [15]: p=0 inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] print "inputs", inputs targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]] print "targets", targets inputs [43, 69, 58, 4, 66, 68, 73, 69, 62, 69, 60, 13, 4, 76, 63, 58, 69, 4, 35, 73, 58, 60, 68, 73, 4] targets [69, 58, 4, 66, 68, 73, 69, 62, 69, 60, 13, 4, 76, 63, 58, 69, 4, 35, 73, 58, 60, 68, 73, 4, 47]

Adagrad to update the parameters This is a type of gradient descent strategy

step size = learning rate The easiest technics to update the parmeters of the model is this: param += dparam * step_size Adagrad is a more efficient technique where the step_size are getting smaller during the training. It use a memory variable that grow over time: mem += dparam * dparam and use it to calculate the step_size: step_size = 1./np.sqrt(mem + 1e-8) In short:


16/17

12/27/2017

RNN

mem += dparam * dparam param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

Smooth_loss Smooth_loss doesn't play any role in the training. It is just a low pass filtered version of the loss: smooth_loss = smooth_loss * 0.999 + loss * 0.001 It is a way to average the loss on over the last iterations to better track the progress

So finally Here the code of the main loop that does both trainning and generating text from times to times:


17/17

Recurrent Neural Network

Recurrent Neural Network

Suggest Documents

Recurrent-neural-network-based implementation

Recurrent Neural Network Based Personalized

training simultaneous recurrent neural network

Network of Recurrent Neural Networks

Adaptive Recurrent Neural Network Enhanced

Deep Gate Recurrent Neural Network

Bidirectional Recurrent Neural Network with

simultaneous recurrent neural network trained with non-recurrent

Design of Oscillatory Recurrent Neural Network

Improving Recurrent Neural Network Performance ... - Osaka University

IMPROVING PERFORMANCE OF RECURRENT NEURAL NETWORK ...

Conversion of Recurrent Neural Network ... - Idiap Publications

Incorporating Side Information into Recurrent Neural Network ...

Temporal Overdrive Recurrent Neural Network - arXiv

multi-context-recurrent neural network for load

Accelerating Recurrent Neural Network Training - arXiv

Recurrent Neural Network Grammars - Association for Computational ...

Fractional Integrated Recurrent Neural Network - Hikari Ltd.

Bidirectional Recurrent Convolutional Neural Network for Relation ...

Stochastic Recurrent Neural Network for Speech

A Problem Specific Recurrent Neural Network for

Dynamic Adaptation of Recurrent Neural Network ... - CiteSeerX

recurrent convolutional neural network regression for

Incremental Recurrent Neural Network Dependency Parser with ...