as input to any and all of our models. Much of the earlier ..... puts to the hidden layer to produce a prediction output ..... http://cs224d.stanford.edu/syllabus.html. 78.
DEEP LEARNING FOR NLP - TIPS AND TECHNIQUES Dr. Moloy De
NLP The goal of NLP (Natural Language Processing) is to be able to design algorithms to allow computers to ”understand” natural language in order to perform some task.
1
NLP Tasks 1. Spell Checking (Easy) 2. Keyword Search (Easy) 3. Finding Synonyms (Easy) 4. Parsing information from websites, documents, etc. (Medium) 5. Machine Translation (e.g. Translate Chinese text to English) (Hard) 6. Semantic Analysis (What is the meaning of query statement?) (Hard) 7. Coreference (e.g. What does ‘he’ or ‘it’ refer to given a document?) (Hard) 8. Question Answering (e.g. Answering Jeopardy questions) (Hard) 2
Input to NLP Model The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any and all of our models. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we first need to have some notion of similarity and difference between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Euclidean, etc). 3
Basic Word Vector There are roughly 13 million words (tokens) in English. Let us consider a vector space where |V | with |V | =13 million where each word is represented by the unit vector (vector having one entry as ’1’ and other entries as ’0’) that has ’1’ in the i-th place, i being the position of the word in the sorted list. Each vector is also called the Hot Word Vector of that word. This is just to show the feasibility of representing the collection of ’words’ as a vector space. However the dimension of the vector space is too large to work with. 4
Word-Word Co-occurrence Matrix Consider a big enough collection of English documents. Consider the collection of lines in all those documents. Consider the sorted list of the distinct collection of all the English words in all those documents that will again be of size roughly 13 million (|V |). The Word-Word Co-occurrance Matrix X|V |×|V | has the (i, j)-th entry defined as xi,j =count of lines having both i-th and j-th words. We will use SVD (Singular Value Decomposition) to reduce the dimension of X matrix. 5
Singular value Decomposition The singular value decomposition of an m × n matrix X is a factorization of the form X = P ΣQT where P is an m×m Orthogonal matrix (P P T = P T P = I), Σ is an m × n rectangular diagonal matrix with nonnegative sorted real numbers on the diagonal, and Q is an n × n orthogonal matrix.
6
SVD Performed on X Observe the singular values (the diagonal entries in the resulting Σ matrix), and cut them off at some index k based on the desired percentage variance captured: Pk i=1 σii P|V | i=1 σii
We then only consider the updated matrix X obtained from remultiplication of the corresponding partition matrices of U, Σ and V to be our word embedding matrix. This would thus give us a k-dimensional (much less than |V |) representation of every word in the vocabulary. 7
Pros and Cons This method gives us word vectors that are more than sufficient to encode semantic and syntactic (part of speech) information but is associated with many other problems. Iteration based methods solve many of these issues in a far more elegant manner.
8
Iteration Based Methods Let us step back and try a new approach. Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context.
9
Language Models Unigram Model: P (w(1), w(2), · · · , w(n)) =
n Y
P (w(i))
i=1
Bigram Model: P (w(1), w(2), · · · , w(n)) =
n Y
P (w(i)|w(i−1))
i=2
General Model for context C: P (w(1), w(2), · · · , w(n)) = Y
P (w(i)|w(i−C), · · · , w(i−1), w(i+1), · · · , w(i+C))
i 10
Softmax Function The softmax function, or normalized exponential, is a generalization of the logistic function that ’squashes’ a K-dimensional vector z of arbitrary real values to a K-dimensional vector σ(z) of real values in the range (0, 1). The function is given by ezj
σ(z)j = PK
k=1
ezk
1≤j≤K
Since the components of the vector σ(z) sum to one and are all strictly between zero and one, they represent a categorical probability distribution. 11
Gradient Descent Gradient descent is based on the observation that if the multivariable function F (x) is defined and differentiable in a neighborhood of a point a, then F (x) decreases fastest if one goes from a in the direction of the negative gradient of F at a, which is −∇F (a). b = a − ∇F (a)
12
Sigmoid Function A sigmoid function is a mathematical function having an ’S’ shape (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function shown in the first figure and defined by the formula σ(t) =
1 1 + e−t
Clearly σ 0(t) = σ(t) (1 − σ(t))
13
Continuous Bag of Words Model (CBOW) CBOW Model is used in predicting a center word form the surrounding context, i.e. given the context (w(i−C), · · · , w(i−1), w(i+1), · · · , w(i+C)) to find the word w(i) that maximises the probability P (w(i)|w(i−C), · · · , w(i−1), w(i+1), · · · , w(i+C))
14
Skip-Gram Model Skip-Gram Model is used in predicting surrounding words form the center word, i.e. given the center word w(i) to find the words (w(i−C), · · · , w(i−1), w(i+1), · · · , w(i+C) that maximise the probability P (w(i−C), · · · , w(i−1), w(i+1), · · · , w(i+C)|w(i))
15
Negative Sampling Consider a pair (w, c) of word and context. Denote by P (D = 1|w, c) the probability that (w, c) came from the corpus data and then P (D = 0|w, c) will be the probability that (w, c) did not come from the corpus data. Lets model P (D = 1|w, c) with the sigmoid function as 1 P (D = 1|w, c) = 1 + e−vw .vc
16
Intrinsic Evaluation (1) Evaluation on a specific, intermediate task (2) Fast to compute performance (3) Helps understand subsystem (4) Needs positive correlation with real task to determine usefulness
17
Word Vector Analogies A popular choice for intrinsic evaluation of word vectors is its performance in completing word vector analogies. In a word vector analogy, we are given an incomplete analogy of the form: a : b :: c :?
The intrinsic evaluation system then identifies the word vector which maximizes the cosine similarity: (wb − wa + wc)T wx d = argmax x ||wb − wa + wc|| 18
Semantic Word Vector Analogies Input Chicago Chicago Chicago Abuja Abuja Abuja
: : : : : :
Illinois Illinois Illinois Nigeria Nigeria Nigeria
:: :: :: :: :: ::
Houston Philadelphia Phoenix Accra Algiers Amman
Result Produced : Texas : Pennsylvania : Arizona : Ghana : Algeria : Jordan
19
Syntactic Word Vector Analogies Input bad bad bad dancing dancing dancing
: : : : : :
worst worst worst danced danced danced
:: :: :: :: :: ::
big bright cold decreasing describing enhancing
Result Produced : biggest : brightest : coldest : decreased : described : enhanced
20
Extrinsic Evaluation (1) Is the evaluation on a real task (2) Can be slow to compute performance (3) Unclear if subsystem is the problem, other subsystems, or internal interactions (4) If replacing subsystem improves performance, the change is likely good
21
Classification - An Extrinsic Task Most NLP extrinsic tasks can be formulated as classification tasks. For instance, given a sentence, we can classify the sentence to have positive, negative or neutral sentiment. In named-entity recognition (NER), given a context and a central word, we want to classify the central word to be one of many classes. For the input, ‘Jim bought 300 shares of Acme Corp. in 2006’, we would like a classified output ‘Jim [Person] bought 300 shares of Acme Corp. [Organization] in 2006 [Time].’ 22
Softmax Classification Let the word vector space V be of dimension d. Let us try to classify the words in C classes. Let yc, 1 ≤ c ≤ C denote the representitives (unit vectors) of the classes. Consider the matrix WC×d whose j-th row is yj . Then the Softmax Classification Probability that the word vector x belongs to class j is given by e(Wj. x) P (j|x) = P (W x) c e c. 23
Cross-Entropy Loss Function The cross entropy for the distributions p and q over a given set is defined as follows: H(p, q) = Ep[− log q] = H(p) + DKL(p||q) where H(p) is the entropy of p, and DKL(p||q) is the KullbackLeibler divergence of q from p. For discrete p and q this means X
H(p, q) = −
p(x) log q(x)
x
The situation for continuous distributions is analogous: H(p, q) = −
Z X
p(x) log q(x)dx 24
Single Word Classification Using the Cross-entropy loss function, we calculate the loss of such a training example as: C X
C X
e(Wj. x) − yj log P (j|x) = − yj log P (W x) c e c. j=1 j=1
25
Single Word Classification Of course, the above summation will be a sum over (C − 1) zero values since yj is 1 only at a single index implying that x belongs to only 1 correct class. Thus, let us define k to be the index of the correct class. Thus, we can now simplify our loss to be: e(Wk. x) − log P (W x) c e c.
26
Single Word Classification We can then extend the above loss to a dataset of N points: e(Wk(i). x) − log P (W x) c e c. i=1 N X
The only difference above is that k(i) is now a function that returns the correct class index for example x(i).
27
Single Word Classification Now the parameter θ to be trained or learned looks like T , W T , · · · , W T , x(1)T , x(2)T , · · · , x(|V |)T )T θ = (W.1 .2 .d
Which is of length Cd + |V |d, where C denotes the number of classes, d denotes the length of a word vector and V denotes the vocabulary space.
28
Regularization To reduce overfitting risk, we introduce a regularization term which poses the Bayesian belief that the parameter θ should be small is magnitude, i.e. close to zero: Cd+|V |d
X e(Wk(i). x) 2 log P − +λ θm (W x) c e c. m=1 i=1 N X
Minimizing the above cost function reduces the likelihood of the parameters taking on extremely large values just to fit the training set well and may improve generalization if the relative objective weight λ is tuned well. 29
Window Classification Meaning of a word depends very much on the context. Hence instead of considering a single word we consider a window of size two on both of its sides. So x now gets replaced by xwindow where (i)
xwindow = (x(i−2)T , x(i−1)T , x(i)T , x(i+1)T , x(i+2)T )T and θ changes accordingly. We use special four special tokens for two words at the beginning of a sentence and two words at the end of the sentence. 30
Non-linear Classifiers - Neural Networks A neuron is a generic computational unit that takes n inputs and produces a single output. What differentiates the outputs of different neurons is their parameters, also referred to as their weights. One of the most popular choices for neurons is the ‘sigmoid’ or ‘binary logistic regression’ unit. This unit takes an n-dimensional input vector x and produces the scalar activation output a. This neuron is also associated with an n-dimensional weight vector, w, and a bias scalar, b. The output of this neuron is then: 1 a= T 1 + e−(w x+b) 31
A Single Layer of Neurons We extend the idea above to multiple neurons by considering the case where the input xn×1 is fed as an input to multiple such neurons. If we refer to the different neurons weights as Wm×n and the biases as bm×1 we can say the respective activations are am×1: a = σ(z) = σ(W x + b)
32
Scoring Non-linear decisions can often not be captured by inputs fed directly to a Softmax function but instead require the scoring of the intermediate layer. We can thus use another matrix Um×1 to generate an unnormalized score for a classification task from the activations: s = U T x = σ(W x + b)
33
Maximum Margin Objective Function We will discuss a popular error metric known as the Maximum Margin Objective. The idea behind using this objective is to ensure that the score computed for ‘true’ labeled data points is higher than the score computed for ‘false’ labeled data points. J = max(1 − s + sc, 0) that we want to minimise where s = σ(W x + b) and sc = σ(W xc + b). Clearly ∂J ∂J =− = −1 ∂s ∂sc 34
Regularization Like most classifiers, neural networks are also prone to overfitting which causes their validation and test performances to be suboptimal. We can implement regularization such that the loss with regularization, JR , is calculated to be: JR = J + λ
X
||W (i)||
i
35
Gradient Descent Since we typically update parameters using gradient descent, or a variant such as SGD, we typically need the gradient information for any parameter as required in the update equation: θ(t+1) = θ(t) − ∇θ(t) J and continue till θ(t) stabilises.
36
Multilayer Neural Network x is an input to the neural network. s is the output of the neural network. The j-th neuron of layer k receives the input z (k) and (k) produces the scalar activation output aj using the weight matrix W (k). a(k) = σ(z (k)) = σ(W (k)a(k−1) + b(k)) W (k) is the transfer matrix that maps the output from the k-th layer to the input to the (k + 1)-th. 37
Backpropagation So the parameters in the model are (W (k), b(k)), k = 1, 2, · · · Now, ∇W (k) = δ (k+1)a(k)T where the Error Vector δ (k) is given by δ (k) = σ 0(z (k)) ◦ (W (k)T δ (k+1)) where ◦ denotes elementwise product. And, ∇b(k) = σ 0(z (k+1))W (k+1) 38
Sigmoid Neuron Unit This is the default choice and the activation is given by σ(z) =
1 1 + e−z
where σ(z) ∈ (0, 1) and σ 0(z) = σ(z)(1 − σ(z))
39
Tanh Neuron Unit The tanh functionis an alternative to the sigmoid function that is often found to converge faster in practice. ez − e−z tanh(z) = z = 2σ(2z) − 1 −z e +e where tanh(z) ∈ (−1, 1) and tanh0(z) = 1 − tanh2(z)
40
Hard Tanh Neuron Unit The hard tanh function is sometimes preferred over the tanh function since it is computationally cheaper. It does however saturate for magnitudes of z greater than 1. -1 : z < 1 hardtanh(z) = z : −1 ≤ z ≤ 1 1 : z>1 where hardtanh(z) ∈ [−1, 1] and hardtanh0(z) =
(
1 0
: :
−1 ≤ z ≤ 1 otherwise
41
Soft Sign Neuron Unit The soft sign function is another nonlinearity which can be considered an alternative to tanh since it too does not saturate as easily as hard clipped functions. z softsign(z) = 1 + |z| where softsign(z) ∈ (−1, 1) and softsign0(z) =
sgn(z) (1 + z)2
42
Rectified Linear Neuron Unit The ReLU (Rectified Linear Unit) function is a popular choice of activation since it does not saturate even for larger values of z and has found much success in computer vision applications. rect(z) = max(0, z) and rect0(z) =
(
1 0
: :
z>0 otherwise
43
Leaky Rectified Linear Neuron Unit Traditional ReLU units by design do not propagate any error for non-positive z the leaky ReLU modifies this such that a small error is allowed to propagate backwards even when z is negative. leaky(z) = max(z, k.z) where 0 < k < 1 and leaky0(z) =
(
1 k
: :
z>0 otherwise
44
AdaGrad Training Rate AdaGrad is an implementation of standard stochastic gradient descent (SGD) with one key difference: the learning rate can vary for each parameter. The learning rate for each parameter depends on the history of gradient updates of that parameter in a way such that parameters with a scarce history of updates are updated faster using a larger learning rate. In other words, parameters that have not been updated much in the past are likelier to have higher learning rates now. ∇θ J (t) θ(t) = θ(t−1) − qP t τ 2 τ =1 (∇θ J ) In this technique, we see that if the RMS of the history of gradients is extremely low, the learning rate is very high. 45
Language Models Language models compute the probability of occurrence of a number of words in a particular sequence. The probability of a sequence of T words {w1, w2, · · · , wT } is denoted as P (w1, w2, · · · , wT ). Since the number of words coming before a word, wi, varies depending on its location in the input document, P (w1, w2, · · · , wT ) is usually conditioned on a window of n previous words rather than all previous words: P (w1, w2, · · · , wT ) = ≈
T Y i=1 T Y
P (wi|w1, w2, · · · , wi−1) P (wi|wi−(n−1), · · · , wi−1)
i=1 46
n-gram Models Bigram Model count(w2, w1) P (w2|w1) = count(w1) Trigram Model count(w3, w2, w1) P (w3|w1, w2) = count(w2, w1) These models are not practical as the memory requirement explodes with n. 47
Recurrent Neural Networks (RNN) Unlike the conventional translation models, where only a finite window of previous words would be considered for conditioning the language model, Recurrent Neural Networks (RNN) are capable of conditioning the model on all previous words in the corpus.
48
Recurrent Neural Networks (RNN) Below introduces the RNN architecture where a hidden layer at a time-step t holds a number of neurons, each of which performing a linear matrix operation on its inputs followed by a non-linear operation (e.g. tanh()). At each time-step, the output of the previous step along with the next word vector in the document, xt, are inputs to the hidden layer to produce a prediction output yˆ and output features ht. ht = W f (ht−1) + W (hx)xt yˆ = W (S)f (ht) 49
Recurrent Neural Networks (RNN) x1, · · · , xt−1, xt, xt+1, · · · , xT : the word vectors corresponding to a corpus with T words. ht = σ(W hhht−1 + W hxxt): the relationship to compute the hidden layer output features at each time-step t xt ∈ Rd: input word vector at time t. W hx ∈ RDh×d: weights matrix used to condition the input word vector xt W hh ∈ RDh×Dh : weights matrix used to condition the output of the previous time-step, ht−1 50
Recurrent Neural Networks (RNN) ht−1 ∈ RDh : output of the non-linear function at the previous time-step, t − 1. h0 ∈ RDh is an initialization vector for the hidden layer at time-step t = 0. σ(): the non-linearity function (sigmoid here) yˆt = softmax(W (S)ht): the output probability distribution over the vocabulary at each time-step t. Essentially, yt is the next predicted word given the document context score so far (i.e. ht−1) and the last observed word vector x(t). Here, W (S) ∈ R|V |×Dh and yˆ ∈ R|V | where V is the vocabulary. 51
Recurrent Neural Networks (RNN) The loss function used in RNNs is often the cross entropy error as the sum over the entire vocabulary at time-step t. J (t)(θ) = −
|V | X
yt,j × log(ˆ yt,j )
j=1
The cross entropy error over a corpus of size T is: J=
T X t=1
J (t)(θ) = −
|V | T X X
yt,j × log(ˆ yt,j )
t=1 j=1
52
Recurrent Neural Networks (RNN) Perplexity is a measure of confusion where lower values imply more confidence in predicting the next word in the sequence compared to the ground truth outcome. Perplexity = 2J
53
Deep Bidirectional RNNs It is possible to make predictions based on future words by having the RNN model read through the corpus backwards. Irsoy et al. shows a bi-directional deep neural network; at each time-step, t, this network maintains two hidden layers, one for the left-to-right propagation and another for the rightto-left propagation. To maintain two hidden layers at any time, this network consumes twice as much memory space for its weight and bias parameters. ˆ is generated through The final classification result, yt, combining the score results produced by both RNN hidden layers. 54
RNN Language Translation Model Traditional translation models are quite complex; they consist of numerous machine learning algorithms applied to different stages of the language translation pipeline. RNNs can be adopted as a replacement to traditional translation modules. Encoder stage ht = φ(h(t − 1), xt) = f (W (hh)ht−1 + W (hx)xt) Decoder stage ht = φ(ht−1) = f (W (hh)ht−1) yt = softmax(W (S)ht) 55
RNN Language Translation Model One may naively assume this RNN model. That along with the crossentropy function can produce high-accuracy translation results. 1 X max N × log(pθ (y (θ)|x(n))) θ N n=1
56
RNN Language Translation Model Extension I: Train different RNN weights for encoding and decoding. This decouples the two units and allows for more accuracy prediction of each of the two RNN modules.
57
RNN Language Translation Model Extension II: Compute every hidden state in the decoder using three different inputs: (1)The previous hidden state (standard) (2)Last hidden layer of the encoder (c = hT ) (3)Previous predicted output word, yˆt−1 Combining the above three inputs transforms the f function in the decoder function to the one in equation below. ht = φ(ht−1, c, yt−1)
58
RNN Language Translation Model Extension III: Ttrain deep recurrent neural networks using multiple RNN layers as discussed earlier. Deeper layers often improve prediction accuracy due to their higher learning capacity. Of course, this implies a large training corpus must be used to train the model.
59
RNN Language Translation Model Extension IV: Train bi-directional encoders to improve accuracy similar to what was discussed earlier.
60
RNN Language Translation Model Extension V: Given a word sequence ABC in German whose translation is XY in English, instead of training the RNN using ABC → XY , train it using CBA → XY . The intutition behind this technique is that A is more likely to be translated to X. Thus, given the vanishing gradient problem discussed earlier, reversing the order of the input words can help reduce the error rate in generating the output phrase.
61
Gated Recurrent Units (GRU) Beyond the extensions discussed so far, RNNs have been found to perform better with the use of more complex units for activation. So far, we have discussed methods that transition from hidden state h(t−1) to h(t) using an affine transformation and a point-wise nonlinearity.
62
Gated Recurrent Units (GRU) Here, we discuss the use of a gated activation function thereby modifying the RNN architecture. Although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies.
63
Gated Recurrent Units (GRU) Mathematically this is how a GRU uses h(t−1) and x(t) to generate the next hidden state h(t).
(Update Gate) z (t) = σ(W (z)x(t) + U (z)h(t−1)) (Reset Gate) r(t) = σ(W (r)x(t) + U (r)h(t−1)) (New Memory) ˜ h(t) = tanh(r(t) ◦ U h(t−1) + W x(t)) (Hidden State) h(t) = (1 − z (t)) ◦ ˜ h(t) + z (t) ◦ h(t−1) 64
Gated Recurrent Units (GRU) New memory generation: A new memory ˜ h(t) is the consolidation of a new input word x(t) with the past hidden state h(t−1). Anthropomorphically, this stage is the one who knows the recipe of combining a newly observed word with the past hidden state h(t−1) to summarize this new word in light of the contextual past as the vector ˜ h(t).
65
Gated Recurrent Units (GRU) Reset Gate: The reset signal r(t) is responsible for determining how important h(t−1) is to the summarization (t) . The reset gate has theability to completely diminh˜ ish past hidden state if it finds that h(t−1) is irrelevant to the computation of the new memory.
66
Gated Recurrent Units (GRU) Update Gate: The update signal z (t) is responsible for determining how much of h(t−1) should be carried forward to the next state. For instance, if z (t) ≈ 1, then h(t−1) is almost entirely copied out to h(t). Conversely, if z (t) ≈ 0, then mostly the new memory ˜ h(t) is forwarded to the next hidden state.
67
Gated Recurrent Units (GRU) Hidden state: The hidden state h(t) is finally generated using the past hidden input h(t−1) and the new memory generated ˜ h(t) with the advice of the update gate.
68
Long-Short-Term-Memories (LSTM) Long-Short-Term-Memories are another type of complex activation unit that differ a little from GRUs. The motivation for using these is similar to those for GRUs however the architecture of such units does differ.
69
Long-Short-Term-Memories (LSTM) (Input Gate) i(t) = σ(W (i)x(t) + U (i)h(t−1)) (Forget Gate) f (t) = σ(W (f )x(t) + U (f )h(t−1)) (Output/Exposure Gate) o(t) = σ(W (o)x(t) + U (o)h(t−1))
70
Long-Short-Term-Memories (LSTM) (New Memory Cell) ˜ c(t) = tanh(W (c)x(t) + U (c)h(t−1)) (Final memory Cell) c(t) = f (t) ◦ ˜ c(t−1) + i(t) ◦ ˜ c(t) (Hidden State) h(t) = o(t) ◦ c(t)
71
Long-Short-Term-Memories (LSTM) New memory generation: This stage is analogous to the new memory generation stage we saw in GRUs. We essentially use the input word x(t) and the past hidden state h(t−1) to generate a new memory ˜ c(t) which includes aspects of the new word x(t).
72
Long-Short-Term-Memories (LSTM) Input Gate: We see that the new memory generation stage doesnt check if the new word is even important before generating the new memory this is exactly the input gates function. The input gate uses the input word and the past hidden state to determine whether or not the input is worth preserving and thus is used to gate the new memory. It thus produces i(t) as an indicator of this information.
73
Long-Short-Term-Memories (LSTM) Forget Gate: This gate is similar to the input gate except that it does not make a determination of usefulness of the input word instead it makes an assessment on whether the past memory cell is useful for the computation of the current memory cell. Thus, the forget gate looks at the input word and the past hidden state and produces f (t).
74
Long-Short-Term-Memories (LSTM) Final memory generation: This stage first takes the advice of the forget gate f (t) and accordingly forgets the past memory c(t−1). Similarly, it takes the advice of the input gate i(t) and accordingly gates the new memory ˜ c(t). It then sums these two results to produce the final memory c(t).
75
Long-Short-Term-Memories (LSTM) Output/Exposure Gate: This is a gate that does not explicitly exist in GRUs. Its purpose is to separate the final memory from the hidden state. The final memory c(t) contains a lot of information that is not necessarily required to be saved in the hidden state. Hidden states are used in every single gate of an LSTM and thus, this gate makes the assessment regarding what parts of the memory c(t) needs to be exposed/present in the hidden state h(t). The signal it produces to indicate this is o(t) and this is used to gate the pointwise tanh of the memory. 76
Advanced Topics (1) Recursive Neural Networks (2) Convolutional Neural Networks
77
Reference Deep Learning for Natural Language Processing http://cs224d.stanford.edu/syllabus.html
78