Sketching Word Vectors Through Hashing

arXiv:1705.04253v1 [cs.CL] 11 May 2017

Sketching Word Vectors Through Hashing Behrang QasemiZadeh DFG SFB 991 Universität Düsseldorf Düsseldorf, Germany

Laura Kallmeyer DFG SFB 991 Universität Düsseldorf Düsseldorf, Germany

Peyman Passban ADAP Centre Dublin City University Dublin, Ireland

[email protected]

[email protected]

[email protected]

Abstract We propose a new fast word embedding technique using hash functions. The method is a derandomization of a new type of random projections: By disregarding the classic constraint used in designing random projections (i.e., preserving pairwise distances in a particular normed space), our solution exploits extremely sparse non-negative random projections. Our experiments show that the proposed method can achieve competitive results, comparable to neural embedding learning techniques, however, with only a fraction of the computational complexity of these methods. While the proposed derandomization enhances the computational and space complexity of our method, the possibility of applying weighting methods such as positive pointwise mutual information (PPMI) to our models after their construction (and at a reduced dimensionality) imparts a high discriminatory power to the resulting embeddings. Obviously, this method comes with other known benefits of random projection-based techniques such as ease of update.

1 Introduction Word embedding techniques (i.e., using distributional frequencies to produce word vectors of reduced dimensionality) have become a cornerstone of modern natural language processing and information retrieval systems. These quantifications of words are often justified by Harris’ Distributional Hypothesis [Harris, 1954]. It is hypothesized that words of comparable linguistic properties appear with/within a similar set of ‘contexts’. For instance, words of similar meaning co-occur with a similar set of context words {c1 , . . . cn }. This hypothesis implies that if this set of context words are partitioned randomly into m buckets, e.g. {{c1 . . . cx }1 , . . . , {cy , . . . cn }m }, then co-related words still co-occur with similar set of buckets. We design a new random projection to exploit this presumption, and accordingly, propose a new hash-based word embedding technique. In the remaining of this paper, we propose the method in Section 2. Section 3 provides a brief overview of related

work. Section 4 reports results from a number of experiments, followed by a conclusion in Section 5.

2 The Proposed Embedding Technique Lets assume that we build an m-dimensional embedding for a word w that is co-occurred with a number of context words c. To build an embedding for each w (denoted by w), ~ the following steps are taken: 1: Initialize a zero m-dimensional vector w ~ 2: for each c co-occurred with w do 3: d ← abs(hash(c) % m) 4: w ~d = w ~d + 1 return w ~ Above, wd is the dth component of w; ~ the hash function assigns a (ideally unique, and independently and identically distributed) hash code (e.g., an integer) to each context word (object). The abs function returns the absolute value of an input number; % is the modulus operator and it gives the remainder of the division of the generated hash code by the chosen value m. We choose Jenkins hash function in our implementation since it shows a low collision rate for short words (bytes sequences):1 int hash(byte[] key) { int i = 0; int hash = 0; while (i != key.length) { hash += key[i++]; hash += hash > 6; } hash += hash > 11; hash += hash 0 and discordant if v < 0. If v = 0, the pair is neither concordant nor discordant. Let p and q be the number of concordant and discordant pairs, then γ is given by [Chen and Popovich, 2002, p. 86]: γ = p−q p+q . As with other statistical models, normalizing the counted frequencies in the obtained embeddings using a ‘weighting process’ enhances results. Evidently, a weighting process can eliminate uninformative and irrelevant frequencies and boost the impact of informative and discriminatory ones, e.g., through normalizing raw frequencies by the expected and marginal frequencies, such as the popular PPMI weighting [Turney, 2001; Bouma, 2009]. If Wp×m is the matrix consisting of p row vectors w ~ (i.e., the set of obtained embeddings in our model), the PPMI weight for a component wxy in W is: Pp Pm wxy × i=1 j=1 wij Pm ppmi (wxy ) = max(0, log Pp ). i=1 wiy × j=1 wxj For the weighted vectors (which often have rectified Gaussian distribution), we expect a correlation measure such Pearson be a more suitable mean to compute similarities between vectors. Note that the procedure suggested above can be serialized in various ways to meet the requirements of applications for which the method is used. For example, when processing text streams (or, sequential scan of very large corpora), a new vector can be added or an existing one can be updated on the fly. Obviously, these vectors can be used at any time since no actual training phase is involved; not to mention that the process can be easily carried out in parallel. To facilitate the weighting process, one m-dimensional vector that holds the sum of coordinates of all vectors (or a random subset of them) can reside in memory at all time.

2.1

Method’s Justification

The proposed method can be explained mathematically using the principles of dimensionality reduction using random projections. Let Cp×n denotes the set of p word vectors obtained by counting the co-occurrences of each target word w ~i (1 ≤ i ≤ p) with each context element cj (1 ≤ j ≤ n). In most applications, n is a very large number in a way that it hinders processes (and causes the so-called curse of dimensionality problem). To address this problem, Cp×n undergoes a set of ‘transformations’ T to map the high-dimensional space Cp×n onto a space of reduced dimensionality Wp×m (m ≪ n): Cp×n × Tn×m = Wp×m . In our proposed method, Tn×m is a sparse randomly generated matrix, in which tij elements of T has the following distribution: 0 with probability m−1 m , tij = (1) 1 1 with probability m

such that each row vector of T has exactly one component that has value 1, and these non-zero values of T have independent and identical distribution. It can be verified that the procedure proposed in the previous section computes the desired embeddings (i.e., W) by (a) de-randomization of T using a hash function and the modulus operator, and (b) serializing the involved multiplication for computing C × T to a set of addition operations (based on the distributive property of multiplication over addition). As an expert can identify, the major novelty of our method is in the way that we compute T. Previously, randomprojection-based methods compute T with the goal of having the least distortions in pairwise (α-normed) distances while mapping from C onto W—e.g., the well-known lemmas proposed in [Johnson and Lindenstrauss, 1984] (for ℓ2 normed spaces), [Indyk, 2000] (for ℓ1 -normed spaces), and their subsequent refinements and generalizations such as [Li et al., 2006]. In contrast to these research, we disregard this classic desiderata of preserving distances and the goal of minimum-distortion correspondence. In return, we motivate our proposed random projection directly using the implications that Harris’ Distributional Hypothesis bears.

3 Related Work Random projections and hash kernels have been vibrant research areas in artificial intelligence and theoretical computer science. These methods have been employed to provide viable solutions for a range of problems requiring a notion of (approximate) nearest neighbor search, particularly in information retrieval tasks, e.g., identification of duplicate and near duplicate documents [Manku et al., 2007] and string matching [Dalvi et al., 2013; Michael, 1981], semantic labeling [Yang et al., 2016], and cross-media retrieval [Wang et al., 2015] to name a few. Naturally, these methods have been also applied to the problem of word embedding, either as a dimensionality reduction method [Bingham and Mannila, 2001; Kaski, 1998], or in the form of an incremental vector space construction technique in which the random projection-based dimensionality reduction process is merged with the process of collecting word co-occurrence frequencies, e.g., [Kanerva et al., 2000; Geva and De Vries, 2011; Q. Zadeh and Handschuh, 2014]. Kanerva et al. employed sparse Gaussian random projections [QasemiZadeh, 2015] to build word vectors directly at a reduced dimensionality and showed that their proposed ‘random indexing’ technique yields results comparable to the latent semantic analysis technique of [Deerwester et al., 1990], which employs truncation using singular value decomposition for dimensionality reduction. By the same token, Geva and De Vries and Q. Zadeh and Handschuh extend the idea proposed by Kanerva et al. for the comparison of textual similarities using the Hamming and Manhattan distances, respectively. However, when it comes to comparing semantic similarities between words, these methods fail to compete with the more recent neural-based embedding techniques, e.g., word2vec [Mikolov et al., 2013b] and GloVe [Pennington et al., 2014]. It is known that crude distances between words are not discriminatory enough to address the

so-called semantic relatedness tasks, and a weighting process such as PPMI is crucial for achieving a high score. Unfortunately, since these methods use projections with zero expectation,2 the sum of components in their resulting vectors is always zero. Thus, weighting techniques such as PPMI cannot be applied to these models after their construction (simply, due to the problem of division by zero). Compared to these techniques, the method proposed in this paper has a better computational complexity (thanks to its sparser projections) and yields better results in semantic relatedness tasks given the possibility of applying weighting techniques after the construction of randomly projected spaces.

4 Empirical Evaluation We report results from empirical evaluations in both intrinsic and extrinsic setups. Besides an intrinsic evaluation using word relatedness tests, we report results when the method is used for training a neural-based part-of-speech tagger.

4.1

Intrinsic Evaluation

Datasets and Material: We evaluate our method over a number of word relatedness tests. Each relatedness test encompasses a set of word pairs, associated with similarity/relatedness ratings obtained from human annotators. For each test, evaluation takes the form of calculating the harmonic mean of Pearson and Spearman’s correlations between the list sorted by the human-induced scores and the list sorted by the scores assigned to word pairs by the system. Relatedness tests in our experiments are WS353 [Finkelstein et al., 2001], WSS and WSR by [Agirre et al., 2009], the classic tests of MC30 [Miller and Charles, 1991] and RG65 [Rubenstein and Goodenough, 1965], the Stanford Rare Word dataset RW [Luong et al., 2013], M287 by [Radinsky et al., 2011], [Halawi et al., 2012], M771 [Hill et al., 2015], SL YP130 verb relatedness [Yang and Powers, 2006], and MEN [Bruni et al., 2014]. As input corpus, we build embeddings from a recent dump of Wikiepda (January 1st, 2017; 2.9 billion tokens). Baselines: For this input, as one baseline, we train a word2vec CBOW model [Mikolov et al., 2013b], one of the most popular word embedding techniques. To train this model, we use parameters that are commonly known to give best performances across the chosen tasks in our experiments (i.e., 10 negative samples, 1e-5 for subsampling, and contextwindows that extend 7 tokens around target words). We set the dimensionality of this model to 500 (put aside its impact on computational complexity, for our input corpus and the targeted tests, we observed that using dimensionality larger than 500 has an adverse effect on word2vec’s CBOW algorithm and deteriorates its performance). We use W2C-CBOW to denote the obtained set of embeddings from this trial. As an additional baseline, we report the performance of a classic unweighted and PPMI-weighted high-dimensional 2 Note that for these methods, projections with zero expectation is essential for achieving an acceptable performance when computing projected spaces (i.e., to guarantee the sparsity of projection matrices).

model trained over the same input corpus. In this case, the dimensionality of the classic model soared to 10.671 million. To confirm our earlier statement on sparse Gaussian random projections, we run Kanerva et al.’s random indexing method (using projection vectors of dimensionality 2000 with 8 nonzero components, of which four are set to +1 and another four to -1); as expected, we witness that the obtained results (averaged over 5 independent runs of the system) are almost identical to those reported for unweighted high-dimensional classic model. In the experiments above, we use cosine as the similarity estimator between vectors. Result: We build word vectors of dimension m = 500, 1000, and 2000, and 4000 using our proposed hash-based embedding technique. Similar to the classic baseline method, except a white space tokenization, we do not use any additional heuristics or information (such as subsampling criteria used in word2vec). Table 1 summarizes the observed results, as well as the average of performances across these tasks. As shown, disregarding m and on average, our unweighted and PPMI-weighted models outperform their counterpart classic high-dimensional models. Moreover, for large values of m, weighted models tend to perform as well as word2vec’s CBOW model. Interestingly, increasing m hurts the taskbased performances for unweighted models, while it has a positive impact on the PPMI-weighted models. Overall, when it comes to choosing a neural embedding learning method versus our method for addressing a semantic similarity task, our experiments suggest that the hash-based method is more suitable for similarity identification than relatedness (in the sense that described in [Agirre et al., 2009]), and for tasks that involve comparison of frequent and infrequent words (due to the nature of PPMI weighting). Naturally, the hashbased method is a better choice than a neural embedding learning technique when dealing with frequently updated text data and streams,3 and when computational resources for training neural nets are insufficient or the learning process is constrained by hard time limits (note that hash-based embeddings can be fetched at any time during the construction of vectors). In our experiments and using the specific configurations, building word vectors using the hash-based method took almost one-third of the time for building the CBOW model (using 16 threads on a dual CPU machine). Note that the computational complexity of our method, in contrast to word2vec, is independent of m. For m = 500, our method requires half of the memory used by word2vec using a sparse vector representation model (note that we use unsigned integers for collecting co-occurrence frequencies and postpone using float number arithmetic until we weight vectors); this is not surprising given the very long tail of co-occurrence frequencies. Word2vec’s CBOW-induced vectors (like other neural embeddings) are dense: in contrary to our method, even for words of low frequencies, all the elements of neural-based embeddings are non-zero. We would like to emphasize that 3

And, in scenarios in which vectors are built using a strategy other than offline scanning large text corpora, e.g., as explained in [Turney, 2001] by querying a Web search engine.

Classic-Unweighted Classic-PPMI W2V-CBOW

PPMI Unweighted

Our-Method

Baseline

Vector Set

Dimension=500 Dimension=1000 Dimension=2000 Dimension=4000 Dimension=500 Dimension=1000 Dimension=2000 Dimension=4000

YP130 0.21 0.47 0.45 0.32 0.34 0.32 0.33 0.39 0.45 0.47 0.49

RW 0.07 0.32 0.22 0.14 0.12 0.12 0.15 0.28 0.29 0.30 0.30

M287 0.24 0.57 0.53 0.36 0.37 0.41 0.44 0.61 0.58 0.62 0.61

M771 0.16 0.57 0.64 0.44 0.40 0.40 0.38 0.61 0.55 0.57 0.57

Similarity Tests MC30 MEN WS353 0.28 0.22 0.30 0.75 0.68 0.49 0.72 0.76 0.70 0.58 0.53 0.40 0.63 0.50 0.40 0.59 0.48 0.37 0.59 0.46 0.33 0.64 0.71 0.63 0.73 0.72 0.64 0.75 0.73 0.65 0.75 0.74 0.69

WSS 0.39 0.53 0.78 0.48 0.48 0.44 0.38 0.67 0.71 0.72 0.75

WSR 0.21 0.63 0.65 0.36 0.36 0.36 0.34 0.65 0.64 0.65 0.68

RG65 0.28 0.76 0.75 0.70 0.65 0.60 0.62 0.75 0.75 0.76 0.79

SL 0.11 0.25 0.40 0.28 0.24 0.22 0.19 0.33 0.34 0.35 0.34

Mean A G 0.224 0.204 0.547 0.521 0.600 0.567 0.417 0.387 0.408 0.375 0.392 0.361 0.383 0.356 0.570 0.545 0.582 0.558 0.597 0.573 0.610 0.584

Table 1: Evaluation Results: Performance of the proposed hash-based embedding method in semantic similarity and relatedness tasks and its comparison to baselines. The last two columns report arithmetic and geometric mean of the observed performances across tasks. In our baselines, we do not list Kanerva et al.’s random indexing performance since it is (on average) identical to the unweighted classic method. For the hash-based method, we report results for both unweighted and PPMI-weighted vectors of various dimensionality. We use Goodman and Kruskal’s γ and Pearson’s r for estimating similarities for unweighted and PPMI weighted vectors, respectively. For baselines, we use cosine as the similarity estimator.

Boosting Performances in Relatedness Tasks Through several means, the performance of the proposed hash-based embedding method for semantic similarly tasks can be improved. The easiest method is, perhaps, using models of higher dimensionality ( as results reported in the previous section imply), or/and to enlarge the size of input corpora. As reported in Table 1, we observe that increasing the dimensionality together with PPMI weighting of models enhance task-based performances. Obviously, increasing the dimensionality does not affect the computational complexity of building a model; however, it increases the cost of similarity 4 We say ‘theoretically’ since an inverted index of context elements (often using hashing) is necessary for building classic models.

Avg. of Scores

0.64

0.6

1k

3k

5k

7k

9k

11

k

13

k

15

k

Models’ Dimensionality (m)

(a) Variable Dimensionality 0.65

Avg. of Scores

the real computational advantages of the proposed method become apparent when building vectors for real large corpora and vocabularies. At any rate, a neural embedding method such as word2vec requires two passes over an input corpus: one pass to make a frequency profile of words (which is used, e.g., for subsampling), and the second pass to adjust and optimise weights (i.e., the actual training phase). This requirement, however, is not necessary for our method (although a two-pass strategy for subsampling and the selection of a subset of co-occurrence events can be employed to enhance discriminatory power of the hash-based embeddings). Regarding the classic method, compared to the hash-based method, building vectors and frequency profiles may require slightly less amount of processes (given that, theoretically,4 no hash code must be computed); however, compared to our method, storing these vectors demands much more amount of memory. Moreover, the process of weighting these large vectors is computationally intense and time-consuming. In comparison, using the proposed hash-based method dramatically reduces the time required for the weighting process. For instance, in our experiments, PPMI weighting of highdimensional vectors took almost two hours while this process was carried out in only a few minutes using the hash-based method.

0.6

0.55

0.5

0.1

1

3

5

7.2

Corpus Size (per billion tokens)

(b) Variable Corpus Size

Figure 1: Changes of the averaged performances across tasks when increasing dimensionality of models (1a), and increasing the size of input corpus (1b). In Figure 1b, the black and red lines plot averaged performances for models of dimensionality m =500 and 2000, respectively.

computation. Figure 1a plots changes in the method’s performance (i.e., the arithmetic average of performances across tasks in our experiments) when the dimensionality of models are increased (the evaluation setup, i.e., the input corpus and context window size, remains the same as those reported earlier). As shown, up to m = 13000, the performance mostly improves. We admit that this solution may not look desirable at first sight. However, we notice that these models of higher dimensionality (e.g., m = 4000) can be compressed (e.g., to m = 250) using matrix factorization techniques without hurting task-based performances. Hence, we suggest that the proposed hash-based solution can safely replace methods of constructing classic PPMI-weighted high-dimensional models, for which the obtained models are compressed using matrix factorization methods. To show the effect of enlarging the size of input corpus on the method’s performance, in addition to the Wikipedia corpus that we used initially, we feed another 4 billion tokens of web crawled text data [Schafer and Bildhauer, 2012] to our models. As shown in Figure 1b, disregarding the dimension of models (here 500, and 2000), using larger input corpora enhances the method’s performance in relatedness tasks. Second, like many embedding techniques, our method can be tuned for each task, e.g., by choosing an appropriate context window size, and in general, by making (let’s call it) linguistically more informed decisions. For instance, we could boost the performance in the MEN relatedness from 0.71 to 0.76 using larger context window (i.e., 13+11 instead of 5+5). Similarly, we could enhance the obtained performance for the YP130 verb relatedness from 0.39 to 0.72 (for m = 500) through filtering context words using dependency parses (context elements were limited to those words with direct syntactic relationships to the targeted verbs). Apart from these, our method allows for heterogeneous context types: For example, in addition to context words, we use documentlevel co-occurrences to boost performance in the WS353 test from 0.67 to 0.72. For each w appeared in document di , we pass to the hash function di ’s identifier in our input and continue updating w ~ as instructed in the algorithm. Note that there is no limit on the type of context elements that can be fed to our method as long as these elements can be converted to byte sequences and subsequently to a hash code. In comparison, this is not that straightforward with neural embedding techniques: often the objective function of neural nets (which defines their underlying optimization goal) must be modified and adapted for new context types. Apart from methods discussed above, ‘retrofitting’ (refining word embeddings using lexical relations available in lexical resources [Faruqui et al., 2015]) is another methodology for improving results. Our method can be easily adapted for retrofitting. We implement a notion of retrofitting by simply treating lexical resources like any other text files; however, instead of scanning these resources using a sliding context window, we make sure that we encode co-occurrence information about all related items in each entry of an input lexical resource. For example, given a synset S of words in WordNet (i.e, S = {w1 . . . wn }), for each wi ∈ S, we consider all the remaining words in S as context element and respectively update w ~i with these context elements. To

show the impact of this ‘retrofitting’ method, we take vectors of dimensionality m = 500 from our earlier experiment, and update them by reading WordNet [Miller, 1995] and PPDB [Ganitkevitch et al., 2013] (i.e., in addition to the Wikipedia corpus, WordNet and PPDB are also fed to our algorithm). Table 2 reports our results. As discussed in [Faruqui et al., 2015], encoding this knowledge into neural embedding models is not straightforward, and it demands certain modification in their learning algorithm. For instance, we observe that retraining the CBOW algorithm with WordNet and PPDB as additional inputs leads to a negligible improvement on task-based performances (an improvement of 0.002 on average), whereas performance gain in retrofitted hash-based model is considerably large (from 0.417 to 0.436 for the unweighted model, and from 0.570 to 0.647 for the PPMI-weighted model). In this experiment, we provide the word2vec’s CBOW algorithm with the same input used for building our hash-based model (i.e., in addition to the Wikipedia corpus, we list the set of all generated tuples from the WordNet and PPDB as input).

4.2

Extrinsic Evaluation

When developing models for upstream NLP tasks, an important application of embedding learning methods is to provide a numerical representation of words for employed (neural) learning algorithms. Likewise relatedness tasks, the word2vec family of embeddings methods [Mikolov et al., 2013a; Le and Mikolov, 2014] are a popular choice for this pre-training phase. Hence, we provide a comparison between the embeddings obtained using the proposed method in this paper and those of the word2vec model in a part-of-speech tagging task. We witness that vectors produced using our hash-based technique are more suitable than word2vec embeddings, at least, in our part-of-speech tagging experiment using Long Short Term Memory (LSTM) units [Hochreiter and Schmidhuber, 1997]. The configuration used for obtaining the embeddings is similar to one explained in the previous section, and the hashbased embeddings are weighted using PPMI prior to feeding them to the network. However, to make sure that the evaluation setup is realistic, we set the size of embeddings to 300 for both of the hash-based and word2vec models. For evaluation, we use the tagged version of the Brown corpus [Francis and Kucera, 1979], which has 57,067 sentences and 313 different tags. Note that although the tag set consists of only 85 categories, the tagged Brown corpus contains combined tags for contracted words; e.g., the sequence of we’ll is tagged as PPSS+MD (which means, personal pronoun followed by modal auxiliary); hence, the number of tags in our experiment increases to 313. We randomly select 2000 sentences as the validation set, another 2000 as the test set, and the remaining sentences are used for training. The neural tagger has a straightforward architecture and consists of four layers, of which the first layer is a look-up table. This look-up table is a trainable weight matrix whose values are updated during training. Embeddings reside in this look-up table, which are retrieved at each forward pass. The second and third layers include LSTM units of size 300. The last Softmax layer scales values of the network to the range of

Vector Set W2V-CBOW

Hash-D500:Unweighted+γ Hash-D500:PPMI+r

YP130 0.46 0.46 0.56

RW 0.24 0.16 0.30

M287 0.52 0.37 0.63

M771 0.63 0.47 0.68

Similarity Tests MC30 MEN WS353 0.74 0.75 0.70 0.66 0.45 0.39 0.83 0.78 0.70

WSS 0.78 0.47 0.75

WSR 0.66 0.35 0.67

RG65 0.77 0.72 0.85

SL 0.38 0.30 0.37

Mean A G 0.602 0.571 0.436 0.408 0.647 0.619

Table 2: Results from retrofitting: Both word2vec-CBOW and hash-based models of dimensionality 500 are trained on the Wikipedia corpus as well as WordNet and PPDB. Comparing these results to those reported in Table 1 shows a considerable improvement of the hash-based method’s performance. Setting Setting 1 Setting 2

word2vec 93.12 95.31

hash-based 95.01 95.78*

Table 3: Results from the POS-tagging experiment. Numbers in the table indicate the accuracy score. [0,1] in order to make the prediction probability over classes, i.e., part-of-speech categories in our problem. We keep our neural architecture as simple as possible and avoid using more sophisticated models for two reasons. Firstly, we believe this makes it easier to reproduce results reported in this section. Secondly, we believe this property enables us to focus more on the quality of the employed embeddings rather than the network power, and thus to attribute performance gains to the embeddings themselves, not the neural architecture. We design two evaluation settings. In the first setting, we freeze the look-up table and the network does not manipulate values of our embeddings. They are kept unchanged and other network parameters are tuned to make the best prediction. In the second setting, embeddings are treated as part of the network, and, similar to other network parameters, and they are updated during the training phase. Table 3 reports the observed results. In both settings, we observe better performance (accuracy) using the proposed hash-based embedding technique. We interpret the result from the first setting (frozen look-up table) as evidence that, by nature, the hash-based embeddings preserve richer and more suitable information for our sequence labelling task. Results observed in the second setting reconfirms our claim that the hash-based embeddings are a better choice than word2vec embeddings, at least, for this task: Hash-based embeddings not only yield a better performance in our sequence labelling task, but they also can be obtained faster and using less computational resources than word2vec.

5 Conclusion A method for sketching word vectors using a hash-based algorithm is proposed. The proposed method is fast, and it requires a small amount of computational resources to build a model; yet, as shown empirically, it can deliver an acceptable performance in both intrinsic and extrinsic evaluation setups. Thanks to the random projection that this method implements, developing embeddings requires no offline training time: vectors can be used, updated, added, and removed at any stage during a system’s life cycle. The method can be particularly useful when building embeddings from text streams or very very large corpora, or when the training of

embeddings is constrained by time.

References [Agirre et al., 2009] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In HLT-NAACL ’09, pages 19– 27, Stroudsburg, PA, USA, 2009. ACL. [Bingham and Mannila, 2001] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 245–250, New York, NY, USA, 2001. ACM. [Bouma, 2009] Gerlof Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference, 2009. [Bruni et al., 2014] Elia Bruni, Nam Khanh Tran, and Marco Baroni. Multimodal distributional semantics. J. Artif. Int. Res., 49(1):1–47, January 2014. [Chen and Popovich, 2002] Peter Y. Chen and Paula M. Popovich. Correlation: Parametric and Nonparametric Measures. Quantitative Applications in the Social Sciences. Sage Publications, 2002. [Dalvi et al., 2013] Nilesh Dalvi, Vibhor Rastogi, Anirban Dasgupta, Anish Das Sarma, and Tamas Sarlos. Optimal hashing schemes for entity matching. In WWW ’13, pages 295–306, New York, NY, USA, 2013. ACM. [Deerwester et al., 1990] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391–407, 1990. [Faruqui et al., 2015] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. Retrofitting word vectors to semantic lexicons. In NAACL-HLT ’15, pages 1606–1615, 2015. [Finkelstein et al., 2001] Lev Finkelstein, Gabrilovich Evgenly, Matias Yossi, Rivlin Ehud, Solan Zach, Wolfman Gadi, and Ruppin Eytan. Placing search in context: the concept revisited. In WWW ’10, 2001. [Francis and Kucera, 1979] W. Nelson Francis and Henry Kucera. Brown corpus manual. Technical report, Brown University, 1979. [Ganitkevitch et al., 2013] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. PPDB: The paraphrase

database. In Proceedings of NAACL-HLT, pages 758–764, Atlanta, Georgia, June 2013. ACL. [Geva and De Vries, 2011] Shlomo Geva and Christopher M. De Vries. Topsig: Topology preserving document signatures. In CIKM ’11, pages 333–338, New York, NY, USA, 2011. ACM. [Goodman and Kruskal, 1954] Leo A. Goodman and William H. Kruskal. Measures of association for cross classifications. Journal of the American Statistical Association, 49(268):732–764, 1954. [Halawi et al., 2012] Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale learning of word relatedness with constraints. In KDD ’12, pages 1406–1414, New York, NY, USA, 2012. ACM. [Harris, 1954] Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954. [Hill et al., 2015] Felix Hill, Roi Reichart, and Anna Korhonen. SimLex-999: Evaluating semantic models with genuine similarity estimation. Comput. Linguist., 41(4):665– 695, December 2015. [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [Indyk, 2000] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In 41st Annual Symposium on Foundations of Computer Science, pages 189–197, 2000. [Johnson and Lindenstrauss, 1984] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability, volume 26, pages 189–206. AMS, 1984. [Kanerva et al., 2000] P. Kanerva, J. Kristofersson, and A. Holst. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, volume 1036, 2000. [Kaski, 1998] S. Kaski. Dimensionality reduction by random mapping: fast similarity computation for clustering. In IEEE International Joint Conference on Neural Networks Proceedings, volume 1, pages 413–418 vol.1, May 1998. [Kendall, 1938] Maurice G. Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81–93, 1938. [Le and Mikolov, 2014] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196, 2014. [Li et al., 2006] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections. In KDD ’06, pages 287–296, New York, NY, USA, 2006. ACM. [Luong et al., 2013] Thang Luong, Richard Socher, and Christopher D Manning. Better word representations with recursive neural networks for morphology. In CoNLL, pages 104–113. ACL, 2013. [Manku et al., 2007] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web

crawling. In WWW ’07, pages 141–150, New York, NY, USA, 2007. ACM. [Michael, 1981] Michael. Fingerprinting by random polynomials. Technical report, Center of Research in Computer Technology, 1981. [Mikolov et al., 2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [Mikolov et al., 2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS 26, pages 3111–3119, 2013. [Miller and Charles, 1991] George A. Miller and Walter G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1991. [Miller, 1995] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995. [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP ’14, pages 1532–1543, Doha, Qatar, 2014. ACL. [Q. Zadeh and Handschuh, 2014] Behrang Q. Zadeh and Siegfried Handschuh. Random manhattan integer indexing: Incremental l1 normed vector space construction. In EMNLP, pages 1713–1723, Doha, Qatar, 2014. ACL. [QasemiZadeh, 2015] Behrang QasemiZadeh. Random indexing revisited. In NLDB, pages 437–442, 2015. [Radinsky et al., 2011] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In WWW ’11, pages 337–346, New York, NY, USA, 2011. ACM. [Rubenstein and Goodenough, 1965] Herbert Rubenstein and John B. Goodenough. Contextual correlates of synonymy. Commun. ACM, 8(10):627–633, 1965. [Schafer and Bildhauer, 2012] Roland Schafer and Felix Bildhauer. Building large corpora from the web using a new efficient tool chain. In LREC’12, pages 486–493, 2012. [Turney, 2001] Peter D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In EMCL ’01, pages 491–502, London, UK, UK, 2001. Springer-Verlag. [Wang et al., 2015] Di Wang, Xinbo Gao, Xiumei Wang, and Lihuo He. Semantic topic multimodal hashing for crossmedia retrieval. In IJCAI’15, pages 3890–3896. AAAI Press, 2015. [Yang and Powers, 2006] Dongqiang Yang and David M. W. Powers. Verb similarity on the taxonomy of wordnet. In GWC-06, Jeju Island, Korea, 2006. [Yang et al., 2016] Dingqi Yang, Bin Li, and Philippe CudréMauroux. Poisketch: Semantic place labeling over user activity streams. In IJCAI 2016, pages 2697–2703, 2016.