Learning Gaussian Word Representations with Neural Networks Eduardo Alfredo Brito Chacón
Thesis submitted for the degree of Master of Science in Artificial Intelligence, option Engineering and Computer Science Thesis supervisor: Prof. dr. Marie-Francine Moens Assessor: Prof. dr. ir. Johan Suykens Mentor: Geert Heyman
Academic year 2015 – 2016
c Copyright KU Leuven
Without written permission of the thesis supervisor and the author it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilize parts of this publication should be addressed to the Departement Computerwetenschappen, Celestijnenlaan 200A bus 2402, B-3001 Heverlee, +32-16-327700 or by email
[email protected]. A written permission of the thesis supervisor is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in scientific contests.
Preface This thesis concludes my Master of Science in Artificial Intelligence at KU Leuven. Through the work on this thesis, I learned how to build a neural network model designed for NLP-related tasks which was published in a recent research paper and how to evaluate it. Despite my initial lack of experience with neural networks, I could develop the network myself starting with the source code of a similar model. Thanks to this process, I could directly observe the importance of each of the components of the proposed model by analyzing the impact of each new implemented feature. Also during this period, I could also learn to cope with unexpected difficulties that reproducing published research work may involve, specially when not all the necessary information is available. I would like to thank my daily supervisor Geert Heyman for always providing valuable feedback when I needed it to push forward this thesis (even by answering emails during the weekend before the submission deadline). I would also like to thank my thesis promotor Professor Sien Moens for making possible that I could work for a year on such an interesting topic. And last but not least, I would like to express my gratitude to my family and friends for keeping my motivation high in a very intensive year, specifically to my parents who, remotely from Spain, never stopped supporting me during this time. Eduardo Alfredo Brito Chacón
i
Contents Preface
i
Abstract
iii
List of Figures
iv
List of Tables
v
List of Abbreviations and Symbols
vi
1 Introduction
1
2 Background 2.1 Statistical Language Modeling . . . . . . . . . 2.2 Neural Networks . . . . . . . . . . . . . . . . 2.3 The Continuous Skip-gram model . . . . . . . 2.4 Energy-based Learning for Word Embeddings 2.5 Dimensionality Reduction for Visualization . 2.6 Related Work . . . . . . . . . . . . . . . . . .
. . . . . .
3 3 4 8 10 10 12
3 Approach 3.1 Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . .
13 13 18
4 Experiments 4.1 Training Corpus . . . . . . . . 4.2 Preprocessing . . . . . . . . . . 4.3 Subsampling of Frequent Words 4.4 Word Similarity . . . . . . . . . 4.5 Hyper-parameter Tuning . . . . 4.6 Specificity and Uncertainty . . 4.7 Visualization . . . . . . . . . .
. . . . . . .
21 21 21 22 23 24 26 27
5 Results 5.1 Similarity Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Specificity and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 5.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29 29 30 32
6 Conclusion
35
Bibliography
37
ii
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
. . . . . .
. . . . . . .
Abstract Word embeddings map each word type to a point in a vector space, enabling operations on words as simple linear combinations of vectors. Thanks to recent neural network based models that can train them efficiently, they have been very successful in various NLP tasks in the past years due to the rich linguistic properties that these distributed word representations can capture. However, they cannot measure any uncertainty of the represented words or any asymmetric relationship between words such as entailment. Hence, a model was proposed that learns multivariate Gaussian distributions instead of just vectors. We reproduce this model and evaluate the word similarity and word specificity that these alternative embeddings can achieve, obtaining less impressive results than those claimed by the authors and at the cost of a significant increase in complexity compared to previous models. Additionally, we show how to visualize these word representations plus all the necessary details to implement such a model, whose lack in extensive documentation hardens the work considerably.
iii
List of Figures Example of Gaussian embeddings from which entailment can be inferred (extracted from [46]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2.1 2.2 2.3
Perceptron: ANN with one single neuron. . . . . . . . . . . . . . . . . . Multilayer perceptron with one hidden layer. . . . . . . . . . . . . . . . The skip-gram model architecture (extracted from [31]). . . . . . . . . .
5 5 8
4.1
Instability of the quality of the embeddings trained on the validation set with respect to the hyper-parameter C. For each value of C, the other hyper-parameters were constant as presented in table 5.2. A higher correlation value shows a better performance in the similarity test. . . .
25
1.1
5.1
Visualizations of the words “madrid”, “spain”, “paris” and “france” after dimensionality reduction with kernel PCA with RBF kernel parameter σ 2 = 50. The means of the words are at the left of each label. The plotted lines are level lines where the density function is valued 0.05. . . 32 5.2 Visualizations of the words “bach”, “composer”, “classical”, “baroque”, “famous”, “composer” and “man” after dimensionality reduction with kernel PCA with RBF kernel parameter σ 2 = 100. The means of the words are at the left of each label. The plotted lines are level lines where the density function is valued 0.021. . . . . . . . . . . . . . . . . . . . . 33 5.3 Visualizations of the words “baudouin”, “king”, “elizabeth” and “queen” after dimensionality reduction with kernel PCA with RBF kernel parameter σ 2 = 1. The means of the words are at the left of each label. The plotted lines are level lines where the density function is valued 0.15. 34
iv
List of Tables Similarity tests results from wordvectors.org of embeddings trained on the validation set taking the original subsampling formula (equation 4.1) and the implemented in word2vec (equation 4.2). . . . . . . . . . . . . . 4.2 Hyper-parameters from our model with their respective default values in word2vec (when defined) and the values used in [46] for evaluation. The values marked with ’*’ were obtained after contacting the authors. The values between brackets were assumed to be the same as in the word2vec paper [33], in line with the evaluation description from [46]. . . . . . . . 4.3 Results of our similarity evaluation compared with the values published in [46] applying the hyper-parameters suggested by the authors with the two available energy functions (EL and KL) and two different learning rates η. A higher correlation means better performance in the similarity task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1
Similarity results of our approach (Brito) compared with the values published in [46] for the skip-gram model with vectors size 50 (SG-50) and 100 (SG-100); and with the Gaussian embeddings model (equation 4.2). The learned word embeddings were trained used the shown hyper-parameters. The value of the other hyper-parameters is identical to those displayed in table 4.3. . . . . . . . . . . . . . . . . . . . . . . . 5.2 Some words found within the 100 nearest words according to the cosine distance sorted by their variance. The learned word embeddings were trained used the shown hyper-parameters. The value of the other hyper-parameters is identical to those displayed in table 4.3. . . . . . .
23
24
26
5.1
30
31
v
List of Abbreviations and Symbols Abbreviations ANN EL KL MLP n.a. NLP PCA RBF SGD
Artificial Neural Network Expected Likelihood Kullback-Leibler Multilayer Perceptron Not available Natural Language Processing Principal Component Analysis Radial Basis Function Stochastic Gradient Descent
Symbols µi Σi ˆx Σ ∂E ∂θ A> C
Mean of the multivariate Gaussian distribution i Covariance matrix of the multivariate Gaussian distribution i Estimated covariance matrix of the vectors x ∈ Rd Gradient of function E with respect to parameter θ
Transpose of the matrix A Regularization hyper-parameter: maximum `2 -norm of the mean vector E Energy function DKL KL divergence LmL Max-margin loss function parametrized with margin mL M Regularization hyper-parameter: maximum covariance m Regularization hyper-parameter: minimum covariance N (x; µ, Σ) Multivariate normal probability density function with mean µi and covariance matrix Σi evaluated at point x
vi
Chapter 1
Introduction “You shall know a word by the company it keeps” [16]. In line with this idea, the distributional hypothesis states that words that occur in similar contexts tend to have similar meanings [17]. Assuming this, several models have been developed that exploit word co-occurrence to find word representations. Among them, word embeddings are distributed vector representations, which are dense, low-dimensional, real-valued and can capture latent features of the word [45]. All the dimensions of the vector participate in the representation of an specific word, but also each dimension participates in the representation of all words of the defined vocabulary. Interestingly, the latent features encode syntactical and semantical information so that some natural language processing (NLP) tasks can be solved by simple linear vector operations due to the distributed nature of the word representations. For example, we can answer analogy questions just with additions and subtractions: if we subtract from a vector representing the word “Madrid” the vector corresponding to “Spain” and we add the vector “France”, the resulting vector should be very close to the vector “Paris” [33]. These vector word representations have also the advantage that they can be trained efficiently by simple (shallow) neural network architectures such as the continuous skip-gram model [31]. Word embeddings have been very successful in various NLP tasks such as machine translation [23, 32], semantic and syntactic language modeling [33]; and named entity recognition [37] among many others. Some researchers even explicitly recommend to switch from more traditional models that rely on counting co-occurrence to word embedding models, which are set to predict word contexts [8]. Despite the high expressivity of vector representations, mapping a word to a point in a vector space has two main limitations: the lack of an uncertainty measure about the word embedding and the impossibility of asymmetric comparisons [46]. The latter is specially relevant to perform meaning entailment. For example, since Bach was a composer (and thus also a man), we would expect that the words “Bach”, “composer” and “man” are close to each other in the vector space. However, we cannot entail neither “composer” from “Bach” nor “man” from “composer” just by comparing their respective vectors with distance measures since any well defined distance (like the Euclidean distance or the cosine distance) has to be symmetric. 1
1. Introduction
Figure 1.1: Example of Gaussian embeddings from which entailment can be inferred (extracted from [46]).
The model proposed by Vilnis and McCallum [46] overcomes these limitations by using Gaussian distributions. This way, we can imagine words not just as points in the space but as ellipsoids whose center is the mean of the respective Gaussian distribution. Recalling the previous example, the embedding for “man” should comprehend the embedding of “composer”, since the latter is a specific case of the former. Although Gaussian embeddings have some advantages such as the already mentioned, they involve a cost in model complexity which, at the end, may involve not finding good word representations, as we detail later in this text. In fact, if we consider only the tasks that word vectors already perform, they do not show any significant improvement. The main aim of this work is to reproduce the proposed model in [46] and evaluate it in two NLP tasks: word similarity and word specificity. Due to the omission of some important information in the published paper and since their source code was not released, our work implicates developing our own code based on word2vec and completing those missing details such as the derivation of the error gradients to adapt the neural network. Furthermore, this text serves as a detailed explanation of the presented models and techniques, whose interpretation may be tricky due to the somewhat obscure and sometimes incomplete available literature, specially for readers that are not familiar with distributed word representations based on neural network models. Chapter 2 is aimed to introduce the concepts that are necessary for a full understanding of the text, focusing on the basics of artificial neural networks and on the continuous skip-gram model as example of state-of-the-art word embedding model. We continue with chapter 3 explaining the model proposed in [46] and completing the missing details that are necessary to implement it. The presented approach is applied the experiments shown in chapter 4, whose results are displayed in chapter 5. Finally, we end with the conclusions of our work.
2
Chapter 2
Background We learn our word representations through unsupervised energy-based learning on an artificial neural network architecture. This chapter provides some theoretical insights of the applied methods in the scope of training word embeddings that can be helpful for a better understanding of the approach presented in chapter 3. Additionally, section 2.6 provides an overview of the related work our model is based on.
2.1
Statistical Language Modeling
When we build a language model, we usually distinguish between word types, which define the vocabulary of the model; and word tokens, which are occurrences of the defined word types in a specific text. We can thus define a training corpus as a sequence of tokens (each one belonging to a word type). We define the context of a word token (which we usually refer to as central word in this scope) as the set of tokens surrounding the central word. Since rule-based language models stopped being used in the beginning of the 1990s, language modeling has been dominated by probabilistic models such as the N-gram models. An N-gram consists of a sequence of N consecutive word tokens taken from a document. For example, from the sentence “Bach was a German composer and musician of the Baroque period”, “Bach was” and “was a” are 2-grams (also more commonly named bigrams); and “a German composer” and “German composer and” are 3-grams (also more frequently called trigrams). N-gram models predict the next word from the previous N − 1 tokens working with estimated N-gram probabilities [22] i.e. they assume that the occurrence of a word depends exclusively on the N − 1 preceding tokens. This is equivalent to model language as a Markov process in which each state is defined by the last N − 1 encountered tokens. Thanks to this Markov assumption, we can simplify the problem of calculating the probability of the next word to appear given the previous words of the document P (wn |w1n−1 ) to just n−1 counting the occurrences of the N − 1 preceding words as a sequence C(wn−N +1 ) and n−1 the occurrences of this sequence followed by the word to be predicted C(wn−N +1 wn ): 3
2. Background
n−1 n−1 C(wn−N C(wn−N +1 wn ) +1 wn ) n−1 P (wn |w1n−1 ) ≈ P (wn |wn−N ) = = P +1 n−1 n−1 C(wn−N +1 ) w C(wn−N +1 w)
(2.1)
For example, we can estimate the probability of “composer and musician” by counting its occurrences and the occurrences of “composer and”: P (“composer and musician”) ≈
2.2
C(“composer and musician”) C(“composer and”)
(2.2)
Neural Networks
Artificial neural networks (ANN), also simply called neural networks, are mathematical models inspired by the structure of biological neural networks. An ANN consists of a set of interconnected neurons that models an underlying function. By transforming an input vector through the neurons, the output approximates the modeled function. We will see that, under certain conditions, we can train neural network models that can approximate any continuous function arbitrary well.
2.2.1
Perceptrons
According to the McCulloch-Pitts model of a neuron [29], a neuron can be defined by a non-linear activation function (also called transfer function) f (x) : R → R and a weight vector w ∈ RN , where N is the dimension of the input vector. The dot product of its weight vector with a presented input vector plus a bias term b ∈ R is used as input of the activation function, whose result is the final neuron output. Putting all these pieces together, a neuron calculates the following function: y(x) = f (wx> + b) = f (
N X
wi xi + b)
(2.3)
i
An ANN model using only one neuron is the perceptron, displayed in figure 2.1.
2.2.2
Multilayer Perceptrons
Most of the interest in neural networks decayed in the 1970s after it was proven that perceptrons cannot learn certain functions, such as the XOR Boolean function [35]. However, we can construct a more expressive model by interconnecting layers of perceptrons like in figure 2.2. Such a configuration is called multilayer perceptron (MLP). Following the matrix notation already used in equation 2.3, a standard MLP with one hidden layer of Nh neurons with linear output layer activation function computes the following function: y(x) = W f (V x + b)
(2.4)
where x ∈ RNi is the input vector, V ∈ RNh ×Ni is a matrix containing all weight vectors of the hidden layer, b ∈ RNh is the bias vector, W ∈ RNo ×Nh is the output 4
2.2. Neural Networks
Figure 2.1: Perceptron: ANN with one single neuron.
Figure 2.2: Multilayer perceptron with one hidden layer.
layer weight matrix and f (x) : RNh → RNo is the vector resulting of evaluating the non-linear activation function at outputs of the hidden layer neurons. The relationship among these elements can be visualized in figure 2.2. Research on ANNs had to wait until the 1980s to boom again after the proof that MLPs can be universal approximators [18, 20] and that these can be trained with the backpropagation method [40]. More specifically, “a standard multilayer feedforward neural network with locally bounded piecewise continuous activation function can approximate any continuous function to any degree of accuracy if and only if the network’s activation function is not a polynomial” (Leshno theorem [26]). These results entail that a MLP with just one hidden layer can approximate a function as precisely as we want by providing enough neurons in the hidden layer. 5
2. Background
2.2.3
Learning
Before an ANN can perform any task successfully, it needs to be trained to learn the target function. The following points from the training phase are relevant to understand our approach presented in chapter 3.
Loss Function In order to evaluate “how good” (or “how bad”) a neural network is modeling the underlying function, we need to define a loss function (also called cost function or objective function). The learning process of an ANN consists in minimizing this function (or maximizing it, depending on the sign convention). A loss function measures somehow how the network output is deviating from the target function. Examples of loss functions can be the mean squared error, which is a very common loss function for regression; or cross-entropy for classification tasks [12].
Stochastic Gradient Descent The probably most popular algorithm to train ANNs is Stochastic Gradient Descent (SGD). In a first step, a training example is presented to the network (as a vector) and the output of the network is computed. In a second step, the output is evaluated with a loss function L to calculate the error of the network output compared with the target function. Then, this error is propagated from the output layer backwards so that the weights can be updated to reduced the error when the next training example is presented. This is the so-called backpropagation step [40]. More precisely, after each training example x is presented, each weight wt at time t is updated to a new weight wt+1 as follows: wt+1 = wt − η
∂L (x) ∂wt
(2.5)
∂L (x) is an approximated gradient of the loss function L with respect to the ∂wt weight wt evaluated at the presented training example x and η ∈ R+ is the so-called learning rate. The learning rate determines “how fast” we want to adapt the network weights. In the simplest SGD formulation, this is constant during the whole training phase. Nonetheless, it is usually progressively reduced along the training time. We take the minus sign because we want to minimize the loss function. If we wanted to maximize it instead, the sign would be the opposite. Normally, the training set presented to the network needs to be processed several times so that the network can achieve satisfactory results. We call each of this iterations a training epoch. We can either fix the number of training epochs a priori or continue training until a stop criterion is reached. For example, we can stop learning when the loss function does not decrease on a validation set any more. where
6
2.2. Neural Networks AdaGrad The Adaptive Gradient Algorithm (AdaGrad) [13] is an extension of SGD that adapts the learning rate to each weight locally. Whereas SGD requires a fixed learning rate η (or a decreasing one along time, but always common to all neuron weights), AdaGrad keeps the sum of squares of previously calculated updates for each weight θ at each time t. Keeping track of these values makes it possible to update each weight θ at moment t according to the following formula: θt+1 = θt − √
η ◦g Gt +
(2.6)
Gt+1 = Gt + g 2 where Gt is the sum of squares of previously computed gradients for the weight θ, g is the error gradient for θ and is just a constant to avoid divisions by zero.
Mini-batch Learning So far we have seen that with SGD and AdaGrad, a training example is presented to the network, the network output is computed, the error gradients are calculated and then the weight values are adapted according to the deviation between the output and the target value. This learning mode in which each example causes an update in the network is called online learning. Instead of updating after each input, we can also wait until n examples are processed before calculating the correspondent gradients to update the network. We refer to each these groups of n inputs as a mini-batch of size n. This name is used to contrast this learning mode with (full) batch learning, in which all the training data is processed before calculating the errors and updating the network.
Regularization A well-known issue that takes place after training an ANN is that it may just “memorize” the training data without being able to generalize when data different from the training data is presented to the network. This effect is named overfitting. A way of preventing networks from overfitting is applying any regularization technique. A usual regularization method is extending the loss function with a regularization term so that the loss function grows with the “size” of the weights, being this determined by the `1 norm or by the `2 norm for instance. The impact of the regularization term is determined by a constant that needs to be set before training i.e. the network cannot learn it itself. These kind of values that need to be fixed a priori are called hyper-parameters. The need to find a good set of the hyperparameters is one of the disadvantages of ANNs since it can make the difference between extremely good results and results close to random guessing. 7
2. Background
Figure 2.3: The skip-gram model architecture (extracted from [31]).
2.3
The Continuous Skip-gram model
Mikolov et al. presented in [31] this model architecture for learning distributed word representations. It uses a shallow neural network (see figure 2.3) whose characteristics can be at first sight somewhat confusing for readers familiar with artificial neural networks applied in other areas. Its “formal aim” is to predict nearby words. However, its real aim is not obtaining good predictions from this neural network, but the resulting hidden layer weights after the training phase: these are actually the word embeddings we are interested in. The subsections below explain how each layer contributes to this configuration following the classical terminology of artificial neural networks.
2.3.1
Input Layer: One-hot Representation
Words are presented to the network with the one-hot representation. This means that the dimension of input vector coincides with the size of the vocabulary, that the position of the represented word within the vocabulary determines where the single ’1’ present in the vector is and that all other vector entries are zero-valued. For example, if the word “thesis” is placed third within a vocabulary of 10 words, it will be presented to the network as the one-hot vector (0, 0, 1, 0, 0, 0, 0, 0, 0, 0).
2.3.2
Hidden Layer: Projection to Word Embedding
We can model a projection from the hot-one vector to its embedding by means of an identity transfer function. In contrast to the most common neural networks models which require a non-linear transfer function in the hidden layer, the continuous skip-gram model does not use any non-linear transfer function in the hidden layer but only in the output layer. The number of necessary neurons in the hidden layer is determined by the number of features that we want our word embeddings to have: if V is size of the vocabulary and N the size of our word vectors, we can represent the 8
2.3. The Continuous Skip-gram model weights as a V × N matrix, in which each row i is the embedding of the corresponding word placed in the position i within the vocabulary.
2.3.3
Output Layer: Context Embeddings
We define the context of a word token by setting a maximum window size m. For a token sequence w and a token in position t, the context of the token w(t) is made of tokens that are at a maximum distance m0 from the central word w(t) (excluding the central word itself), where m0 is sampled from the interval [1, m]. By choosing the window size stochastically for each example, we ensure that word tokens that appear closer to the central word get more importance, as they are more likely to fit within the context window w(t − m0 ) · · · w(t − 2)w(t − 1)w(t + 1)w(t + 2) · · · w(t + m0 ).
(2.7)
Apart from the word embeddings from the hidden layer (extracted from the hidden layer weights), each word type has a different embedding in the output layer. The difference between the two types of embeddings is the scope of the respective word type during the training phase: when the word type is found as a central word, the relevant embedding belongs to the hidden layer; when we are considering the word type found in the context of a central word, the respective word embedding has to be found in the output layer (here again as weights of the layer). In order to avoid confusion, when we mention word embedding we refer to the word vector from the hidden layer (i.e. the weight vector of the respective neuron) whereas context embedding is related to the word vector from the output layer. The usage of separated embeddings for central words and their contexts was found to improve the quality of the embeddings [46]. The output layer just performs multinomial logistic regression (without bias term) by using V neurons by means of the softmax function: it multiplies the word embedding w from the hidden layer with the weight vector ci of each neurons i.e. with each context embedding. The outcome of each of these dot products is the logarithm of the estimated likelihood of the context ci appearing nearby w in a document (its estimated probability can be calculated just through normalization). Nonetheless, since we are not interested in optimizing these likelihoods but in the quality of the word embeddings, we may simplify the formulation as long as the resulting word embeddings keep their quality namely by substituting the original softmax formulation by negative sampling [33]. In a nutshell, with negative sampling we do not need to update all the embeddings but only few of them. For each context ci found around the central word w, we update the word embedding of w and the context embedding of ci (the positive sample) plus one or several randomly sampled context embeddings (the negative samples). This method approximates the minimization of the softmax function by differentiating data from noise via logistic regression [33]. 9
2. Background
2.3.4
Table Look-up instead of Matrix Multiplication
A nice property of the continuous skip-gram model is that we can substitute dense matrix multiplications by simple table look-ups [33]. As we can see in figure 2.3, the one-hot input vector from the input layer corresponding to the word w(t) is projected to its embedding in the hidden layer. This consists in just a table look-up. Then, thanks to negative sampling, we do not need to operate on all context embeddings but just on the positive and negative samples. Again, the retrieval of these embeddings can be performed by a table look-up. In the displayed example with a context window of size 2, this means retrieving the embeddings of the context embeddings of the word types to which the tokens w(t − 2), w(t − 1), w(t + 1) and w(t + 2) belong. The reader may notice that formulating the continuous skip-gram model as a neural network may seem an unnecessary (and pretty obscure) theoretical abstraction. Actually, we just search for embeddings from a table to operate with them without needing to define an explicit neural network structure. Nevertheless, modeling our construction as a neural network allows us to take advantage from all the successful learning algorithms developed for neural networks and their solid theoretical basis.
2.4
Energy-based Learning for Word Embeddings
An energy function Eθ (x, y) scores how “good” (or “bad” depending on the sign convention) a configuration of a pair of inputs x and y parametrized by θ is [25]. Applied to our approach, the parameter θ corresponds to the word embeddings. So our aim is to find word embeddings so that pairs of words which are found in a same context (positive pairs) are scored higher than random pairs of words (negative pairs). This can be achieved by defining a loss function L, which taking positive and negative pairs determines the gradients on the parameters. In particular, we use a max-margin ranking objective as loss function: LmL (w, cp , cn ) = max(0, mL − E(w, cp ) + E(w, cn )) where E is an w is a central word,cp is a positive context word (a word found nearby w), cn is a negative context word (generally randomly sampled), E an energy function and mL a constant margin that puts energy scores for positive pairs above energy scores for negative pairs. By backpropagating this loss to the word embeddings, we bring each positive pair closer while we separate negative pairs. This is indeed equivalent to our final aim: making any two words appearing in same contexts have close word vectors compared to a randomly sampled pair of word vectors. As a consequence of this formulation, the output of the network can no longer be interpreted as probabilities (like in the continuous skip-gram model), but this does not really matter as long as we can learn good word representations.
2.5
Dimensionality Reduction for Visualization
Since we work with high-dimensional data, we require a suitable method to transform our embeddings into two-dimensional representations in order to visualize them. The 10
2.5. Dimensionality Reduction for Visualization following methods can be applied to address this problem.
2.5.1
Principal Component Analysis (PCA)
A widely used technique for dimensionality reduction is the so-called Principal Component Analysis (PCA), which finds the directions with maximal variance around the mean of the vectors [43]. First, it estimates the covariance matrix of the high-dimensional vectors x ∈ Rd : ˆx = Σ
N 1 X (xk − x ¯)(xk − x ¯)> N − 1 k=1
(2.8)
where N is the number of vectors to be projected to a lower dimension m and x ¯ is the mean vector from them. Second, the eigendecomposition of the estimated covariance matrix is performed: ˆ x vi = λvi , Σ
i = 1, . . . , d,
λi ∈ R.
(2.9)
Finally, the high-dimensional vectors are mapped to low-dimensional vectors w ∈ Rm by taking the m eigenvectors whose respective eigenvalues are the highest: wi = vi> (x − x ¯),
2.5.2
i = 1, . . . , m.
(2.10)
Kernel PCA
PCA performs a linear projection of the data onto the directions with highest variance and thus works well for linearly separable data. Nevertheless, the latter is rarely the case. This motivates the use of non-linear techniques such as kernel PCA. Kernel PCA takes advantage of the principle of support vector machines (SVM) which consists in projecting non linearly separable data to a vector space of higher dimension, usually referred as feature space and eventually infinite, in which the data points are linearly separable. Therefore, it is also possible to understand this problem with a primal-dual SVM formulation, as shown in [44]. We can define this projection from our N d-dimensional vectors x to the feature space as: φ(x) : Rd → RN (2.11) Now, we can approximate the covariance matrix in the feature space (as first proposed by Schölkopf [41]): ˆx = Σ
N 1 X (φ(xk − x ¯))(φ(xk − x ¯))> N − 1 k=1
(2.12)
Thanks to the Mercer’s theorem [30], we can map the vectors to the feature space without the need of an explicit mapping to that space by means of a kernel function K(xi , xj ) : Rd × Rd → R (the so-called kernel trick): ˆx = Σ
N 1 X (K(xk − x ¯))(K(xk − x ¯)) N − 1 k=1
(2.13) 11
2. Background There are many possible kernel functions that we can use. Hence, kernel PCA permits a wider variety of mappings for dimensionality reduction. The linear PCA can be seen as just the particular case of kernel PCA by selecting the linear kernel. One of the most frequent choices for a kernel function is the RBF kernel: ||xi − xj ||2 σ K(xi , xj ) = e −
(2.14)
where σ is a hyper-parameter to be selected a priori.
2.6
Related Work
The neural network architecture from our model is based on the continuous Skip-gram model with negative sampling, proposed by Mikolov et al. [31, 33]. Together with the Continuous Bag of Words model, it was released within the word2vec toolkit [3]. Thanks to the efficiency of the learning approach and to its high-quality embeddings, that provide state-of-the-art results in different linguistic tasks, word2vec popularized word embeddings and fostered recent studies that analyze the reasons of its impressive performance [27]. Vilnis and McCallum [46] generalized the notion of mapping words to a vector space. Words do not become points in a vector space, but multivariate Gaussian distributions. This is to some extent equivalent to using a Bayesian neural network configuration instead of a classical MLP neural network: the neuron weights and the output of the network are not just vectors but probability distributions [28]. This idea of mapping training examples to probability distributions can be also found in RBF networks. Nonetheless, rather than performing Bayesian inference to learn these parameters, their approach considers the probability distributions as part of the learning objective, which are trained as an “extension” of the word embeddings of the continuous Skip-gram model. Although both mentioned architectures are shallow neural networks, their goal is to learn word embeddings, which ultimately correspond to the trained hidden layer neuron weights. This concept of word embeddings appeared already in the work of Bengio et al. [10], in which a neural network is modeled to predict the next word of a sequence. Also, the concept of extracting latent features from the training data appears in other areas different from NLP since it is just a feature extraction task, which is frequently part of deep learning techniques, for instance by using auto-encoders [9].
12
Chapter 3
Approach Our model is trained following the approach proposed by Vilnis and McCallum in [46] with software built by ourselves taking the original word2vec code as basis [3]. As we detail later, we contacted the authors to clarify missing information such as exact hyper-parameter values that can be used to train the model. After training, the embeddings can be used in different tasks to evaluate the model as we present in chapter 4. We start this chapter providing a detailed explanation of the applied training methods, including the calculation of the gradients that are necessary for the weight updates (which is missing in the paper of [46]). Additionally, section 3.2 illustrates the main issues we faced during the development of the software we built to train the model.
3.1
Learning Model
The ultimate goal of our model is to learn word representations so that words appearing in similar contexts have “similar” word representations. Following the approach from [46], we learn Gaussian distributed embeddings with a model akin to the word2vec skip-gram model [33]. Both models differ in: • The geometric objects the embeddings represent: instead of points in a vector space, our embeddings represent multivariate Gaussian distributions. • The learning algorithm: our embeddings are trained with AdaGrad instead of SGD since the former is more suitable for very sparse data (like ours) than the latter. • The choice of the loss function: we take the max-margin function (see section 3.2.2) instead of the logistic classifier from the continuous skip-gram model. • The choice of the energy function: since we need to determine the similarity between two multivariate Gaussian distributions, we cannot directly apply the dot product any more. We select any of the energy functions presented in section 3.1.1 instead. 13
3. Approach • Regularization of the network weights, which is absent in the word2vec skipgram. The similarity between two embeddings must be determined by both its mean and its covariance matrix. By regularizing them, we avoid that one of these two may dominate the energy function score completely. • Size of the mini-batches: we use larger mini-batches of variable size than word2vec. All these points are detailed in the following subsections.
3.1.1
Energy Functions
The similarity between two word representations i.e. between two Gaussian distributions can be defined in different ways. For the training phase, we choose between two different similarity measures that we take as energy functions: Expected Likelihood (EL) and Kullback-Leibler (KL) divergence. We assume all covariance matrices diagonal. This simplification eases the gradient computations every time the inverse of the covariance matrix is required. Symmetric Similarity: Expected Likelihood or Probability Product Kernel This first similarity measure that we use for our Gaussian distributions is defined as the inner product of two Gaussian density functions. This is the so-called expected likelihood or probability product kernel [21]: Z x∈Rn
N (x; µi , Σi )N (x; µj , Σj )dx = N (0; µi − µj , Σi + Σj )
(3.1)
Since the probability product kernel is a positive definite function on all its domain, we can define our energy function as its logarithm to ease the computation of the gradients. This energy function is thus defined as: E(Pi , Pj ) = log N (0; µi − µj , Σi + Σj )
(3.2)
Relying on the derivation presented in [46] to obtain the gradients of the energy function with respect to the parameters, we get: ∂E(Pi , Pj ) ∂E(Pi , Pj ) = − = ∂µi ∂µj ∂E(Pi , Pj ) ∂E(Pi , Pj ) = = ∂Σi ∂Σj
−∆ij 1 −1 (∆ij ∆> ij − (Σi + Σj ) ) 2
(3.3)
where ∆ij = (Σi + Σj )−1 (µi − µj ) These gradients do not suffice yet for the weight updates since we need the gradients from the cost function instead. From this point forward, we complete 14
3.1. Learning Model the derivation that is not present in [46]. We can now reformulate the max-margin ranking objective as: LmL (w, p, n) = max(0, u) with u = mL − E(w, p) + E(w, n) where w is a processed word, p a positive sample of a context word and n a negative sample. We derive now the gradients of this loss function with respect to the means (µw , µp , µn ) and the covariance matrices (Σw , Σp , Σn ) of the Gaussian embeddings by applying the chain rule: ∀θ ∈ {µw , µp , µn , Σw , Σp , Σn }
∂LmL (w, p, n) ∂LmL (w, p, n) ∂u = ∂θ ∂u ∂θ ∂L (w, p, n) mL ≤0 ∂LmL (w, p, n) 0 : ∂u = ∂u 1 : ∂LmL (w, p, n) > 0 ∂u ∂u ∂E(w, p) ∂E(w, n) =− + ∂θ ∂θ ∂θ
(3.4)
Combining the formulas above with the ones from the equations 3.3, we can finally compute the gradient for the weight updates as follows (when the loss function is evaluated strictly positive, otherwise these gradients are zero-valued):
∂LmL (w, p, n) ∂µw ∂LmL (w, p, n) ∂µp ∂LmL (w, p, n) ∂µn ∂LmL (w, p, n) ∂Σw
∂E(w, p) ∂E(w, n) + = ∆wp − ∆wn ∂µw ∂µw ∂E(w, p) =− = −∆wp ∂µp ∂E(w, n) = = ∆wn ∂µn ∂E(w, p) ∂E(w, n) =− + ∂Σw ∂Σw 1 −1 −1 = ((∆wn ∆> −(∆wp ∆> wn − (Σw + Σn ) ) wp − (Σw + Σp ) )) 2 ∂LmL (w, p, n) ∂E(w, p) 1 −1 =− = − (∆wp ∆> wp − (Σw + Σp ) ) ∂Σp ∂Σp 2 ∂LmL (w, p, n) ∂E(w, n) 1 −1 = = (∆wn ∆> wn − (Σw + Σn ) ) ∂Σn ∂Σn 2 (3.5) =−
15
3. Approach Asymmetric Similarity: KL Divergence An alternative energy function that we use is based on the KL divergence. Since the KL divergence measures distance between two probability distributions, we need to apply a negative sign to turn it into a similarity function for our energy function:
E(Pi , Pj ) = −DKL (Nj ||Ni ) = −
Z x∈Rn
N (x; µi nΣi ) log
N (x; µj , Σj ) dx N (x; µi , Σi )
(3.6)
In an analogous way, we use the energy gradients from [46]: ∂E(Pi , Pj ) ∂E(Pi , Pj ) =− = −∆0ij ∂µi ∂µj ∂E(Pi , Pj ) 1 −1 0 0> = (Σ−1 Σj Σ−1 i + ∆ij ∆ij − Σi ) ∂Σi 2 i ∂E(Pi , Pj ) 1 = (Σ−j − Σ−1 i ) ∂Σj 2 j
(3.7)
where ∆0ij = Σ−1 i (µi − µj ) By applying these values to our equations 3.4, we obtain the formulas for the weight updates using KL divergence: ∂LmL (w, p, n) ∂µw ∂LmL (w, p, n) ∂µp ∂LmL (w, p, n) ∂µn ∂LmL (w, p, n) ∂Σw
∂E(w, p) ∂E(w, n) + = ∆0wp − ∆0wn ∂µw ∂µw ∂E(w, p) =− = −∆0wp ∂µp ∂E(w, n) = = ∆0wn ∂µn ∂E(w, p) ∂E(w, n) =− + ∂Σw ∂Σw 1 −1 −1 −1 −1 0 0> = (−Σw Σp Σw + ∆0wp ∆0> wp +Σw Σn Σw − ∆wn ∆wn ) 2 ∂LmL (w, p, n) ∂E(w, n) 1 =− = − (Σ−1 − Σ−1 w ) ∂Σp ∂Σp 2 p ∂LmL (w, p, n) ∂E(w, n) 1 = = (Σ−1 − Σ−1 w ) ∂Σn ∂Σn 2 n
3.1.2
=−
(3.8)
Learning Algorithm
The network weights are updated applying AdaGrad with gradients calculated according to the formulas presented in section 2.4. We process mini-batches of size 20 sentences of tokens (including central words, their contexts and a negative sample per pair central word - context word) in contrast to the mini-batches from word2vec, 16
3.1. Learning Model which consist of a central word, a context word and a negative sample. Besides that, in formula 2.6, we decided to take = 0.1 after we contacted the authors of [46] to know the hyper-parameters they used. Due to the sparsity of the natural language, some weights are much more frequently updated than others due to the frequency distribution of the words. As a consequence, the weights for infrequent words require a larger learning rate than frequent words. This fact makes AdaGrad quite suitable for our task. In fact, our implementation prior to the introduction of AdaGrad, working with SGD, was able to produce good word embeddings only for tiny manually created training corpora such as few lines of separable groups of tokens. Any training on a subset of the corpus mentioned in section 4.1 resulted in almost random word embeddings. Additionally, we update the embeddings after processing each mini-batch in an asynchronous mode (known as asynchronous stochastic gradient descent): each mini-batch is processed by a different thread. Despite possible data inconsistency when two words appear in two mini-batches being processed simultaneously, this procedure works well and it has been performed previously e.g. in [24, 31, 38]. The high data sparsity lowers the probability of simultaneous updates of the same embeddings. Besides that, this undesired effect has an impact mostly on the frequent words. And since their embeddings are updated more than the rest, they can be also better corrected after an inconsistent update.
3.1.3
Regularization
After each mini-batch is processed and the relevant weights are updated, we regularize both the means and the covariance matrices of the word and context embeddings but in two different ways (as indicated in [46]). The means are forced to remain below a hard constraint on their `2 -norm: ||µi ||2 ≤ C
∀C
(3.9)
When an updated mean exceeds C, it is normalized so that the resulting mean has exactly norm C. This is equivalent to divide each of the vector components by its norm. The regularization for the covariances is slightly different: we keep them within the hypercube [m, M ]d , where d is the dimension of the distribution and m and M are positive constants: mI ≺ Σi ≺ M I, ∀i (3.10) Since we use diagonal covariances, this regularization just implies applying the following additional update: Σii ← max(m, min(M, Σii ))
(3.11)
Keeping the covariances not too close to zero is specially important when the EL energy function is used for training because small covariances may give very high scores, dominating the rest of the energy [46]. 17
3. Approach
3.2
Implementation Details
The base for the software development was the original word2vec package from Mikolov et al. [33], which is available for public use [3]. This efficient code written in C can be used to train word embeddings with the continuous skip-gram model with negative sampling. Therefore, it is a good basis to build our model. The final source code we developed can be found at https://github.com/ebritoc/gaussian_word_ embeddings. The following subsections specify the necessary changes to the word2vec code that led to the final implementation and they also illustrate some of the difficulties of reproducing such a neural network model from a published paper.
3.2.1
Covariance Matrices
The original word2vec implementation uses a large array of floating point numbers for the hidden layer and another one for the output layer of the network. Each central word embedding we want to train is thus a vector within the hidden layer. Also each contextual word has a respective vector in the output layer. During the first read of the corpus file, the vocabulary for the model is created and each word is assigned an id. This id determines their position in both the hidden layer and the output layer. As a consequence, a simple way to extend this to Gaussian distributions without losing computation efficiency to retrieve embeddings is by appending the covariance matrix just after each word vector. Since the covariance matrix is diagonal, it is sufficient to append an additional vector of the same size to each word embedding from the word2vec implementation. Also the word2vec tool distance, which finds the closest words of a given word according to the cosine distance between word embeddings, was adapted to work with the new format of embeddings. Additionally, an option to calculate EL similarity was introduced.
3.2.2
Gradient Calculations
Instead of the word2vec hierarchical softmax or negative sampling, we need the maxmargin function (already presented in section 2.4) as loss function to be minimized during the training phase. The calculation of the gradients which are necessary for the weight updates is determined by the selected loss function. Hence, the weight update mechanism needed to be adapted according to the chosen energy function (EL or KL) reproducing the formulas presented in section 3.1.1.
3.2.3
Regularization Hyper-parameters
The regularization method presented in section 3.1.3 was also implemented to be applied to the word embeddings right after they get updated. 18
3.2. Implementation Details
3.2.4
Switch to Variable Size Mini-batch Learning with AdaGrad
In word2vec, the output layer is updated in an online fashion whereas the hidden layer is updated in mini-batches consisting of the central word and the contextual words plus the negative samples (thus quite small mini-batches). In order to reproduce the approach from [46], we use instead mini-batches of 20 sentences for both the hidden and the output layer. As a consequence, we need to store each calculated weight update until the complete respective mini-batch is processed. Only then the embeddings are effectively updated by taking the sum of the previously calculated update values. Due to the regularization we perform after the updates, we clip these cumulative gradients when they exceed a certain threshold. For the weight values concerning the means, the cumulative gradient values are forced to be between −2C and 2C, where C is the maximum `2 -norm we allow for the means of our Gaussian embeddings. We clip the weight values of the variances in a similar way: they stay between m and M as defined in 3.1.3. AdaGrad requires the cumulative sum of squares of calculated gradients of each weight to calculate the adapted learning rate. Therefore, we also had to adapt the code to store these values.
3.2.5
Sentence Shuffling
In order to reduce variance error, the sentences of the corpus are randomly shuffled before being used for training. This might be specifically relevant for our approach since we apply AdaGrad for learning (see section 3.1.2). With AdaGrad, the local learning rate applied to each word type can decay very aggressively. Since our optimization problem is highly non-convex as well, some word embeddings may be mostly determined by the first word occurrences. If these belong to the same document, the linguistic properties encoded in the resulting word embedding may be too biased by the first document in which the word type appears. Shuffling prevents this undesired effect from taking place.
3.2.6
Multi-threading Issues
One of the reasons of the high training speed of word2vec is the asynchronous update of the weights. It divides the training corpus by the number of available threads so that each thread can process a different text in parallel but using the same network (i.e. the same embeddings) simultaneously. The data sparsity guarantees a low probability of two threads accessing the same weight. Nevertheless, our batch size (20 sentences of central words plus their contexts) is considerably larger than the batch size used in word2vec (only one central word plus its context). As a consequence of these larger mini-batches, the use of multi-threading leads to much more likely simultaneous accesses to the weights during their update, which may cause inconsistent data. Hence, we have to be careful not to spoil our word embeddings by introducing excessive multi-threading. Although a thread synchronization mechanism can prevent from data inconsistency, this was not feasible for our set-up. A thread-safe variant of the implementation 19
3. Approach turned out to be too slow. An evaluation on the full corpus could not be even completed since it was predicted to be finished after several weeks. A workaround to allow several threads updating the same weights and to minimize the risk of concurrent access to a same weight is by keeping the total number of threads reduced. Also due to the constraints of the machine that we use for training, we use 24 threads at most, each one processing different mini-batches extracted from the corpus. This might deviate from the approach presented in [46], although they claim that they keep their procedure as close as possible to word2vec, which performs asynchronous updates. In spite of this, our asynchronous approach should not perform worse than the synchronous equivalent. Recent research shows that is possible to exploit data sparsity to speed-up learning through distributed systems [24, 38]. Their main assumption is that their data is so sparse that lock-free threads can work in parallel and will rarely need to update the same network weight. This condition actually applies except for the most frequent word types. Nevertheless, their involved word embeddings have also more opportunities to be corrected with a following update due to their higher frequency.
20
Chapter 4
Experiments In this chapter, we define the experiments that we use to evaluate the quality of our embeddings, in particular in sections 4.4, 4.6 and 4.7. We present first the training corpus used to train our word embeddings (section 4.1) and we specify the necessary preprocessing to train our model using the presented training corpus (section 4.2). Section 4.3 shows how we subsample the most frequent words instead of performing a more classical stop word removal. Additionally, section 4.5 details our method to find right hyper-parameters, which is a crucial task to obtain high-quality word embeddings.
4.1
Training Corpus
In line with the set-up of [46], we train our embeddings by using the concatenation of the UKWaC and WaCkypedia_EN corpora. UKWaC contains more than one billion word tokens in English extracted through web crawling limited to the domain .uk [7]. WaCkypedia_EN is a English Wikipedia dump from 2009 containing about 800 million tokens. As a result, our training corpus is made of a huge collection of words found in very various contexts. Therefore, we hope our word embeddings to catch the linguistic properties of the word types from all their different usages.
4.2
Preprocessing
Before we can train our embeddings with the corpus presented in section 4.1, we need to preprocess it. The “training unit” of our implementation is a sentence of words. Each sentence requires to be finished with an end of line character so that it can be correctly processed. As a consequence, the training corpus needs to go through a few preprocessing steps so that our implementation can correctly learn on them. 21
4. Experiments
4.2.1
UKWaC Corpus
Each fetched text from the web is preceded by an URL, which needs to be stripped. Additionally, the character encoding of resulting text needs to be transformed from ISO-8859-15 to UTF-8. Then, the resulting text can be tokenized e.g. with the NLTK toolkit [2]. Additionally, we lowercase all tokens.
4.2.2
WaCkypedia_EN
In contrast to the previous corpus, the needed tokens from this corpus appear directly in the first column of the file. The remaining columns contain morphosyntactic information and they are thus not relevant for our purposes. Hence, only the first column is kept for training. Each sentence beginning is marked with the marker (which is removed in the resulting text) and ends with the marker (which is transformed to an end of line character). These simple substitutions and lowercasing can be easily achieved by using e.g. the Unix tools sed and tr. Also for this corpus, the character encoding needs to be adapted, in this case from ISO-8859-1 to UTF-8.
4.2.3
Removal of Infrequent Words
We remove all word tokens belonging to word types that appear less than 100 times. This feature was already included in word2vec, hence we do not need to modify the corpus file itself to discard the infrequent words.
4.3
Subsampling of Frequent Words
Most of the most frequent words such as “the”, “a” or “it” are very little informative when we want to exploit word co-occurrence to compute word embeddings. For example, the co-occurrence of “France” and “Paris” is much more interesting for our model than the co-occurrence of “France” and “the” [33]. Instead of removing most of these so-called stop words (as many NLP approaches do), we follow the subsampling approach proposed by Mikolov et al. in [33]: each word wi of the training set can be randomly skipped (during training) according to the following probability formula: s
P (wi ) = 1 −
t f (wi )
(4.1)
where t is a predefined threshold. Mikolov et al. state that this value is “typically around t = 10−5 ”, hence we use this value. In a bizarre way, the word2vec default value for it is t = 10−3 despite Mikolov’s recomendation and a even code comment stating that “default is 1e-3, useful range is (0, 1e-5)”. Although Mikolov et al. proposed the formula above, word2vec uses a slightly different formula: s f (wi ) − t t P (wi ) = − (4.2) f (wi ) f (wi ) 22
4.4. Word Similarity
No.
1 2 3 4 5 6 7 8 9 10 11 12
Task Name
WS-353 WS-353-SIM WS-353-REL MC-30 RG-65 Rare-Word MEN MTurk-287 MTurk-771 YP-130 SimLex-999 Verb-143
Word pairs
353 203 252 30 65 2034 3000 287 771 130 999 143
Pairs found
323 188 235 27 54 368 2411 271 745 94 955 144
Correlation word2vec
original formula
0.4968 0.5853 0.4388 0.3973 0.5592 0.3530 0.5681 0.5767 0.5136 0.1672 0.2039 0.2485
0.3558 0.4539 0.2866 0.3136 0.5462 0.3573 0.4794 0.5149 0.4496 0.2099 0.1876 0.1755
Table 4.1: Similarity tests results from wordvectors.org of embeddings trained on the validation set taking the original subsampling formula (equation 4.1) and the implemented in word2vec (equation 4.2).
In the tests that we ran with the threshold t = 10−5 to determine whether there is a significant difference between the two variants, the word2vec formula led to slightly better results, so we decided to keep it instead of taking the formula from Mikolov’s paper. The results of one these tests is shown in table 4.1. The details of the performed similarity tests are presented in section 4.4.
4.4
Word Similarity
We evaluate word similarity with the available tool at wordvectors.org [4, 14]. It provides the correlation between similarity rankings provided by our embeddings (i.e. lists of word types closed to a query word sorted by descending similarity) and the rankings from the benchmarks. From the 12 datasets that we can find there, we discard the WS-353 dataset [15] because we already use it as validation set for hyper-parameter tuning (see section 4.5). For the same reason, we also discard the derived datasets WS-353-SIM and WS-353-REL [6] since any apparent good result may be caused just by overfitting on the WS-353 dataset due to its obvious dependency with the WS-353 dataset. From the remaining datasets, we finally use for evaluation those appearing in [46] as well in order to compare our results with theirs. The evaluation datasets are thus: MC-30 [34], RG-65 [39], MEN [11], YP-130 [47] and SimLex-999 [19]. The similarity between two embeddings is calculated as the cosine between means of their distributions. The reported correlation corresponds to the Spearman’s rank correlation coefficient [36]. Therefore, it is necessary to remove the variances from the embeddings to perform this evaluation. Additionally, we filter out all the embeddings 23
4. Experiments Hyper-parameter iter alpha min-count sample batch epsilon size window energy loss-margin C M m
word2vec
[46]
5 0.025 5 1e-3 n.a. n.a. 100 5 n.a. n.a. n.a. n.a. n.a.
1 0.5* 100 (1e-5) 20 0.1*1 50 (5) both 5* 100* 0.25* 1000*
Meaning Number of epochs Initial learning rate (η) Word types below this value are discarded Threshold for word subsampling (t) Number of sentences per mini-batch AdaGrad ridge () Size of the word vector ≡ neurons per layer Window size (max. skip length) Energy function: EL or KL Margin of the max-margin loss function Max. `2 -norm for the distribution means Max. value for a covariance matrix entry Min. value for a covariance matrix entry
Table 4.2: Hyper-parameters from our model with their respective default values in word2vec (when defined) and the values used in [46] for evaluation. The values marked with ’*’ were obtained after contacting the authors. The values between brackets were assumed to be the same as in the word2vec paper [33], in line with the evaluation description from [46].
that are not relevant for the evaluation by means of the Python script available at [4].
4.5
Hyper-parameter Tuning
One of the main disadvantages of training word representations with a neural network model like ours is the need of hyper-parameter tuning. The full list of hyper-parameters from our model is presented in table 4.2. The quality of the word embeddings (no matter the selected learning approach) depends highly on a good set of hyper-parameters [27]. Indeed, we checked empirically that a bad choice of hyper-parameters usually results in word embeddings that are hardly better than random embeddings. Furthermore, we found out that the evaluation results are very unstable with respect of some hyper-parameters: a minimal change in one of them may drop the performance dramatically, as shown in figure 4.1. Since the embeddings trained according to the hyper-parameters suggested by the authors were very bad at the similarity evaluation (see table 4.3), we had to conceive a clear hyper-parameter tuning strategy to obtain acceptable results. For the sake of reducing the search space, we took first all the values mentioned in [46] and then the default values of word2vec with two exceptions: the sample 1
We had not considered this value as hyper-parameter but as a constant that we had set to 1e-8 until we contacted the authors of [46]. Taking 0.1 instead proved to be generally a better choice on the tests we had carried out.
24
4.5. Hyper-parameter Tuning
Figure 4.1: Instability of the quality of the embeddings trained on the validation set with respect to the hyper-parameter C. For each value of C, the other hyperparameters were constant as presented in table 5.2. A higher correlation value shows a better performance in the similarity test.
hyper-parameter (we took 1e-5 as explained in section 4.3) and the initial learning rate because we discovered a lot of potential improvement by tuning it. This leaves six hyper-parameters to be tuned: alpha, energy, loss-margin, C, M and m. As a performance measure, we carry out a similarity test on the WSS-353 dataset in the same way we presented in section 4.4 but training on a validation set. This validation set is created by extracting one million random sentences from the corpus. With this set-up, we perform grid-search taking the following steps: 1. Either EL or KL divergence is considered for tuning. 2. The similarity test with arbitrary hyper-parameters is run. This constitutes the first baseline. 3. For each hyper-parameter to be tuned, we define two hyper-parameter sets to be validated: one with a higher value with respect to the baseline and one with a lower value. The rest of hyper-parameters remain equal to the baseline. 4. After testing the ten resulting hyper-parameter sets, the one with highest performance becomes the new baseline. 5. The two previous steps are repeated until no new hyper-parameter set improves the baseline more than a threshold that we set as 0.001. When we tried this method running just one epoch on the validation set (as we do to train the final embeddings on the training set), the displayed correlation was too 25
4. Experiments Hyper-parameter
Value 1 100 1e-5 20 50 5 5 100 1000 0.25
iter min-count sample batch size window loss-margin C M m Task Name
Word pairs
Pairs found
Correlation (%) η = 0.5
MC-30 RG-65 MEN YP-130 SimLex-999
353 65 3000 130 999
353 65 3000 94 998
EL 12.45 28.62 16.65 0.40 17.88
KL 24.59 30.19 15.87 11.63 17.16
η = 0.025 EL 0.49 14.56 5.27 8.93 -1.06
KL -3.09 12.27 5.45 5.65 0.75
[46] 68.50 77.00 70.18 39.30 30.50
Table 4.3: Results of our similarity evaluation compared with the values published in [46] applying the hyper-parameters suggested by the authors with the two available energy functions (EL and KL) and two different learning rates η. A higher correlation means better performance in the similarity task.
low to distinguish clear differences on performance among the hyper-parameter sets. Therefore, we decided to increase the number of epochs from 1 to 4 so that we could see a more significant result. For the “opposite reason” (not to oversimplify the task) we reduced the minimum count for a word type to appear in the vocabulary from 100 to 10 during the tuning on the validation set.
4.6
Specificity and Uncertainty
Reproducing the experiment described in [46], we sort the 100 nearest neighbors (according to the cosine distance) of some query words by descending variance, where the variance is defined as the determinant of the covariance matrix. Since the covariance matrices are always diagonal in our model, the variance is thus determined by the product of all the elements of the diagonal. This may cause some numeric issues when we allow these values to be high, so instead of calculating the product, we compute the sum of logarithms. The logarithm is a monotone function, hence it keeps the order of the calculated variances, which is the only relevant matter of this experiment. 26
4.7. Visualization We expect the variances (defined as the determinant of the covariance matrix) to catch the degree of specificity: very specific words types should have a lower variance than more general ones. Also the variety of words where the word type is found can play a significant role determining the variance: word types with many different meanings should have a higher variance than words with just one concrete meaning.
4.7
Visualization
We visualize our word embeddings by means of kernel PCA with the LS-SVMlab toolbox [1]. Due to the higher expressive power of kernel PCA compared to linear PCA, we expect better visualizations than if we applied just linear PCA. Furthermore, this method has been already successfully applied for data visualization such as in [42]. Since the mean µ ∈ RN and the the covariance matrix Σ ∈ RN ×N of a Gaussian embedding are different geometric objects to be visualized, we perform kernel PCA first on µ. As a result, we get a projection matrix V ∈ R2×N to compute our 2-dimensional projected mean µ0 : µ0 = V µ
(4.3)
Then we proceed to project the covariance matrix the way it is defined for multivariate Gaussian distributions: Σ0 = V ΣV >
(4.4)
Finally, we already have a 2-dimensional multivariate Gaussian distribution that we can plot with the help of any suitable software such as Matlab.
27
Chapter 5
Results In this chapter, we present the results from the defined experiments from chapter 4. We evaluate first on different similarity tasks in section 5.1. Since these tests essentially evaluate the quality of the means, we also check that the variances of our embeddings can be relevant in our model by evaluating the specificity and uncertainty in section 5.2. Finally, we visualize some embeddings according to the proposed technique in section 5.3.
5.1
Similarity Evaluation
As we already observed in table 4.3, the hyper-parameter values provided by Vilnis and McCallum did not help us to obtain word embeddings performing well at similarity tasks. During the development of our model, we discovered word2gauss, a Python implementation to train Gaussian embeddings that was recently released at [5]. The word2gauss default values for most of the hyper-parameters are several orders of magnitude lower than those shown in table 4.3. They proved to perform much better with our implementation than the values given by Vilnis and McCallum, but still far from the results claimed in [46]. After this finding, we contacted the developers of word2gauss to know whether they managed to reproduce the results of [46]. They reported that they had trouble finding the right hyper-parameters too and that word2gauss was not as accurate as compared to a simple word2vec model. Despite the feedback from the word2gauss developers, we continued our search for hyper-parameters that produce a model that at least approaches the results claimed in [46]. From all the models we trained, the best one is presented in table 5.1. We can see that our model outperforms neither the skip-gram model nor the results from Vilnis and McCallum. Although a set of hyper-parameters that reproduces the results claimed in [46] might exist (we cannot discard this due to the difficulty to find good hyper-parameters), the considerable time investment that is required to tune the hyper-parameters makes us doubt about the eventual advantages of this model against the continuous skip-gram model. Moreover, the embeddings presented in table 5.1 were computed with a strong regularization of the variances: they may only stay within the range [3, 4]. Considering 29
5. Results
Task Name
MC-30 RG-65 MEN YP-130 SimLex-999
Hyper-parameter
Value
alpha loss-margin C M m
0.135 4 10 4 3.00
Word pairs
30 65 3000 130 999
Pairs found
27 54 2411 94 955
Correlation (%) SG-50
SG-100
[46]
Brito
63.96 70.01 70.27 39.34 29.39
66.76 69.38 70.70 35.76 31.13
68.50 77.00 70.18 39.30 30.50
67.56 64.53 65.98 31.38 28.75
Table 5.1: Similarity results of our approach (Brito) compared with the values published in [46] for the skip-gram model with vectors size 50 (SG-50) and 100 (SG-100); and with the Gaussian embeddings model (equation 4.2). The learned word embeddings were trained used the shown hyper-parameters. The value of the other hyper-parameters is identical to those displayed in table 4.3.
this important limitation on the variances, our model is actually behaving close to the skip-gram model i.e. the embeddings are essentially determined by the means (the vectors in the skip-gram model) whereas the variances are almost constant. All the embeddings we trained that performed well on similarity tasks involved always, to a certain extent, a significant regularization of the variances. Therefore, we suspect that the variances are not adding any advantage for a similarity task. However, we would need to train embeddings using an even wider range of hyper-parameters to confirm this point.
5.2
Specificity and Uncertainty
Although we limit the evaluation of our embeddings on specificity and uncertainty to an exploratory analysis in the same way it is done in [46], it should suffice to check whether not only the means of our embeddings but also the covariance matrices are capturing linguistic properties. Few hyper-parameter sets resulted in word embeddings from which the specificity of the word types can be (to some extent) related to the variance of their respective embeddings. The results of one of them is presented in table 5.2. There we show some words closely related to the query words and we sort them by descending variance. In order to compare our results with those from [46], we present the same word queries. We could find some good partial orders within the presented words. For instance, “healthy” is clearly a more general word than “gluten-free”. Also “mappings” refers 30
5.2. Specificity and Uncertainty
Hyper-parameter alpha loss-margin C M m
Value 0.5 4 1 10 0.1
Query word
Nearby words sorted by descending variance
rock food
indie, song/, nazz, hipsters healthy, aromas, low-salt, paninis, desserts, breads, flavouring, linseeds, coffee, nectar, pre-cooked, sugar, fondant, beefburgers, non-kosher, lasagne, gluten-free, mouthwatering, home-cooked scared, worry, silly, drunk, awful, sorrow, hurts, uncomfortable mood, amaze, filthy, blessedness, forgetful, freaked, flippant, tedious boolean, mappings, graphs, topological, isomorphic, meromorphic, riemannian, hermitian, injective, integrable, centroid, hyperelliptic
feeling
algebra
Table 5.2: Some words found within the 100 nearest words according to the cosine distance sorted by their variance. The learned word embeddings were trained used the shown hyper-parameters. The value of the other hyper-parameters is identical to those displayed in table 4.3.
to a much more general concept than “integrable” (we can indeed entail one from the other if we consider semantic properties exclusively). However, this kind of order is reversed when we consider “isomorphic” and “injective”: the former is a specific case of the latter and it should have thus a lower variance. We must also admit that we found many non-related words among the presented lists such as “doorbells” within the words close to “food”, partially because of the suboptimal performance on word similarity of these embeddings in particular. Nonetheless, this might be also the case among the omitted nearby words from the query words presented in [46] since they do not show the full list of 100 nearest neighbors for each query word. In general, most of our collections of embeddings trained on the full training corpus performed very poorly on word specificity. This may be caused by the fact that we optimized our embeddings by validating the hyper-parameters on a similarity task whereas “it is generally advisable to tune all hyperparameters, as well as algorithm-specific hyperparameters, for the task at hand” [27]. In order to do this, we would need to define a “specificity datatest” containing lists of related words sorted by variance so that we can calculate the correlation between our validation set and this new dataset. This is out of the scope of this work though. It is also remarkable that all our embeddings that performed somewhat well in this evaluation had rather bad results in the word similarity tests. This fact makes 31
5. Results
Figure 5.1: Visualizations of the words “madrid”, “spain”, “paris” and “france” after dimensionality reduction with kernel PCA with RBF kernel parameter σ 2 = 50. The means of the words are at the left of each label. The plotted lines are level lines where the density function is valued 0.05.
us suspect that both tasks are less related than we expected and that a good result in one of the tasks is a bad predictor of the performance in the other.
5.3
Visualization
As last evaluation task, we visualize some word embeddings via kernel PCA as explained in section 4.7. On the whole, the visualizations show that the means of our Gaussian embeddings capture some linguistic properties very precisely. For example, the relation capitalcountry can be clearly seen in figure 5.1, where madrid − spain + f rance ≈ paris. Also semantic properties are clear e.g. in figure 5.2, where the word “baroque” appears closer to “Bach” than any other since it is more related to that name than any of the adjectives displayed. More evident is the case of figure 5.3. Here there is an evident co-occurrence of the word pairs “queen” with “Elizabeth” and “king” with 32
5.3. Visualization
Figure 5.2: Visualizations of the words “bach”, “composer”, “classical”, “baroque”, “famous”, “composer” and “man” after dimensionality reduction with kernel PCA with RBF kernel parameter σ 2 = 100. The means of the words are at the left of each label. The plotted lines are level lines where the density function is valued 0.021.
“Baudouin” but at the same time there is a strong semantic relationship between “king” and “queen”. As a result, the means of the “king” and “queen” Gaussian embeddings appear very close to each other while the other words embeddings are located more distant but being “Elizabeth” closer to “queen” than to “king”; and “Baudouin” closer to “king” than to “queen”. On the other hand, the visualization of the variances are very little informative. For instance, although it is reasonable that “Bach” and “baroque” have a lower variance than “classical”, we would not expect “man” having less variance than “classical”. For this reason together with the results on the specificity evaluation, we think that our embeddings are not training the variances optimally.
33
5. Results
Figure 5.3: Visualizations of the words “baudouin”, “king”, “elizabeth” and “queen” after dimensionality reduction with kernel PCA with RBF kernel parameter σ 2 = 1. The means of the words are at the left of each label. The plotted lines are level lines where the density function is valued 0.15.
34
Chapter 6
Conclusion Although neither the output of the neural network is the final objective of our model nor the implemented network follows the usual standards of artificial neural networks, formulating our problem in terms of training a neural network allows us to profit from advanced techniques developed in this area such as backpropagating the error gradients with AdaGrad, which is known for being specially suitable for very sparse data such as natural language documents. Hyper-parameter tuning is a general pitfall of neural network based models. This is in line with the need to define the number of neurons per layer in the case of the standard multilayer perceptron formulation. However, in our approach this problem becomes even harder due to a high number of hyper-parameters to determine despite fixing those values mentioned in [46]. If we had had to perform an exhaustive grid-search on all the hyper-parameters, our search method would have taken even longer to find acceptable hyper-parameters. We could not reproduce the results claimed in [46], perhaps because of the difficulty of finding good hyper-parameters. We only observed a performance comparable to the continuous skip-gram model. Despite some relative good results evaluating specificity of the words, they only take place in few embedding sets and they always include considerable “noise” within the sorted words by variance. This may be caused by not enough tuned hyper-parameters as well or because we should configure a validation set where the variances play a more important role. But again, the potential advantages of adding variance to word vectors cannot be exploited if good hyper-parameters are so hard (and time-consuming) to find. We could also visualize some interesting linguistic relations among our word embeddings by applying kernel PCA. The meaning of the positions of the distribution means was quite evident in the presented examples, but the plotted variances were only sometimes informative. This is for us another signal that the variances are the weak spot of our embeddings. Regarding our software implementation, the training speed can be still improved by approximating the energy function calculation with a precomputed table in a similar way word2vec does. Also the weight updates could be optimized through a 35
6. Conclusion more efficient algorithm e.g. by using hash tables instead of storing all the mini-batch updates in a large array of length the size of the vocabulary. If we continued our work, we would try to find hyper-parameters with a different validation test. In particular, we would check whether calculating similarities with EL similarity instead of cosine distance during validation would result in more informative variances of the Gaussian embeddings. We omitted using spheric variances as suggested in [46] as well, which might have some impact on the resulting word embeddings. To sum up, inserting an uncertainty measure to word vectors via multivariate Gaussian distributions enables some new possibilities such as determining the specificity of a word compared to similar ones. However, a long process of hyper-parameter tuning is required to achieve any acceptable result. Therefore, the few advantages from this model are, in our opinion, not worthy the increase of complexity compared to the continuous skip-gram model.
36
Bibliography [1] LS-SVMlab lssvmlab/. [2]
toolbox.
URL:
http://www.esat.kuleuven.be/sista/
Natural language toolkit. URL: http://nltk.org/.
[3] Original word2vec package. word2vec/.
URL: https://code.google.com/archive/p/
[4]
Word vector evaluation. URL: http://wordvectors.org/.
[5]
word2gauss. URL: https://github.com/seomoz/word2gauss.
[6] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In NAACL-HLT, pages 19–27, 2009. [7] M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209–226, 7 2009. [8] M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1), pages 238–247, 2014. R in [9] Y. Bengio. Learning deep architectures for ai. Foundations and Trends Machine Learning, 2(1):1–127, 2009.
[10] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003. [11] E. Bruni, N.-K. Tran, and M. Baroni. Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49(1-47), 2014. [12] S. Dreiseitl and L. Ohno-Machado. Logistic regression and artificial neural network classification models: a methodology review. Journal of Biomedical Informatics, 35(5-6):352–359, 2002. [13] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 7 2011. 37
Bibliography [14] M. Faruqui and C. Dyer. Community evaluation and exchange of word vectors at wordvectors.org. In 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, USA, June 2014. Association for Computational Linguistics. [15] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131, January 2002. [16] J. R. Firth. A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis, pages 1–32, 1957. [17] Z. S. Harris. Distributional structure. Word, 10(3):146–162, 1954. [18] R. Hecht-Nielsen. Kolmogorov’s mapping neural network existence theorem. In International Conference on Neural Networks, volume 3, pages 11–13. New York: IEEE Press, 1987. [19] F. Hill, R. Reichart, and A. Korhonen. SimLex-999: Evaluating semantic models with (genuine) similarity estimation. CoRR, abs/1408.3456, 2014. [20] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [21] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819–844, 2004. [22] D. Jurafsky and J. H. Martin. Speech and Language Processing. Pearson Education, 2nd edition, 2009. [23] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, volume 3, page 413, 2013. [24] J. Keuper and F. Pfreundt. Asynchronous parallel stochastic gradient descent - A numeric core for scalable distributed machine learning algorithms. CoRR, abs/1505.04956, 2015. [25] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. Predicting Structured Data, 2006. [26] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993. [27] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225, 2015. 38
Bibliography [28] D. J. MacKay. Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469–505, 1995. [29] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of Mathematical Biophysics, 5(4):115–133, 1943. [30] J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446, 1909. [31] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR workshop, 2013. [32] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. ArXiv e-prints, Sept. 2013. [33] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, number 26, pages 3111–3119, 2013. [34] G. A. Miller and W. G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1991. [35] M. Minsky and S. Papert. Perceptrons. 1969. [36] J. L. Myers and A. D. Well. Well. 1995. Research Design & Statistical Analysis. Routledge, 1 edition, june 1995. [37] A. Passos, V. Kumar, and A. McCallum. Lexicon infused phrase embeddings for named entity resolution. ArXiv e-prints, Apr. 2014. [38] B. Recht, C. Ré, S. Wright, and F. Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. [39] H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy. Commun. ACM, 8(10):627–633, Oct. 1965. [40] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986. [41] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998. [42] J. A. K. Suykens. Data visualization and dimensionality reduction using kernel maps with a reference point. IEEE Transactions on Neural Networks, 19(9):1501– 1517, 2008. 39
Bibliography [43] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific Pub. Co, Singapore, 2012. [44] J. A. K. Suykens, T. Van Gestel, J. Vandewalle, and B. De Moor. A support vector machine formulation to PCA analysis and its kernel version. IEEE Transactions on Neural Networks, 14(2):447–450, 2003. [45] J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semi-supervised learning. In A. for Computational Linguistics, editor, 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, 2010. [46] L. Vilnis and A. McCallum. Word representations via Gaussian embedding. In ICLR, 2015. [47] D. Yang and D. M. W. Powers. Verb similarity on the taxonomy of wordnet. In 3rd International WordNet Conference (GWC-06), Jeju Island, Korea, 2006.
40
KU Leuven
2015 – 2016
Master thesis filing card Student: Eduardo Alfredo Brito Chacón Title: Learning Gaussian Word Representations with Neural Networks Dutch title: Gaussische Woordrepresentaties Leren met Neurale Netwerken UDC : 681.3*I20 Abstract: Word embeddings map each word type to a point in a vector space, enabling operations on words as simple linear combinations of vectors. Thanks to recent neural network based models that can train them efficiently, they have been very successful in various NLP tasks in the past years due to the rich linguistic properties that these distributed word representations can capture. However, they cannot measure any uncertainty of the represented words or any asymmetric relationship between words such as entailment. Hence, a model was proposed that learns multivariate Gaussian distributions instead of just vectors. We reproduce this model and evaluate the word similarity and word specificity that these alternative embeddings can achieve, obtaining less impressive results than those claimed by the authors and at the cost of a significant increase in complexity compared to previous models. Additionally, we show how to visualize these word representations plus all the necessary details to implement such a model, whose lack in extensive documentation hardens the work considerably.
Thesis submitted for the degree of Master of Science in Artificial Intelligence, option Engineering and Computer Science Thesis supervisor: Prof. dr. Marie-Francine Moens Assessor: Prof. dr. ir. Johan Suykens Mentor: Geert Heyman