machine learning- based embedding &event

1 downloads 0 Views 6MB Size Report
Mar 15, 2017 - CNN vs. RNN models. 5. Further studies. 6. Wrap-up. 7. Discussion. Machine .... Less human interpretable than previous representations.
MACHINE LEARNINGBASED EMBEDDING &EVENT EXTRACTION TECHNIQUES 1 ST: 08 MARCH 2017 2 ND: 15 MARCH 2017 KYONG-HA LEE ([email protected]) KOREA INSTITUTE OF SCIENCE & TECHNOLOGY INFORMATION

CONTENTS 1. Bag-of-words 2. Vector representations for words 1. Word2Vec : a popular word embedding technique 2. GloVe : another word embedding technique 3. Other embedding techniques for phrases, sentences, and ngrams 3. Character-level encodings (TBD) 4. Event extraction techniques 1. Joint model vs. pipeline model 2. CNN vs. RNN models 5. Further studies 6. Wrap-up 7. Discussion Machine learning-based embedding & event extraction techniques

2

BAG-OF-WORDS MODEL Simplified representation for language processing • A text (sentence or a document) is represented as the bag (multiset) of its words • Disregarding grammar and word order

• Many algorithms have been devised based on the model • e.g., Inverted index, TF/IDF, ….

Limitations • Loss of the ordering of words • Different sentences can have the same representation as long as the same words are used

• Semantic ignorance • Very little sense about the semantics of the words • No meaning in distances between the words Machine learning-based embedding & event extraction techniques

3

WORD VECTOR Definition • Vector representation for each word, based on bag-of-words • One-hot encoding (or 1-of-N encoding) • word embedding (distributed representation)

Two main model families for learning word vectors • Global matrix factorization methods like LSA • Leverages statistical information, but poor on the word analogy task • SVD is required to shorten word vectors

• Local context window methods • Known to be better on the analogy task • Poorly utilize the statistics of the corpus • they train on separate local context windows instead of on global cooccurrence counts Machine learning-based embedding & event extraction techniques

4

NEURAL LANGUAGE MODEL The model of neural network-based word vectors

Main theme

• Initially devised by Bengio et al, 2006 • Word vectors are usually trained by using stochastic gradient descent where the gradient is obtained via backpropagation • e.g., word2vec, gloVe, sentence2vec, paragraph2vec, ….

• The objective is to maximize the average log prob. of wt given a sequence of training words w1, w2, w3, … , wT

Machine learning-based embedding & event extraction techniques

5

FEATURES Dimension reduction • N words are mapped into positions in k-dimensions(N >>k) • Actually, word vectors are very similar to the result of SVD for a similarity matrix for words Similar meanings are mapped to similar positions in the vector space • “Strong” and “powerful” are closed to each other • Whereas “powerful” and “Paris” are more distant Linearity and compositionality • It can answer analogy questions using simple vector algebra • “France”:”Paris” = “Italy”: ? (Rome) • “king” – “man” + “woman” = ? (queen) • “king” – “queen” = “man” – “woman”

Machine learning-based embedding & event extraction techniques

6

LONG VS. SHORT VECTORS Long sparse vector vs. Short dense vector • Short vectors are easier to include as features in ML systems • Reduction of weight computations • e.g., If we use 100-dimensional word vectors, a classifier can just learn 100 weights • Due to fewer parameters, dense vectors may generalize better and help avoid overfitting

Distributed representation • Each word is represented by a distribution of weights across elements in a vector with high dimensions w1 w2 . . . wn

e1

e2

ek

. .

Machine learning-based embedding & event extraction techniques

7

PRELIMINARIES Dimension reduction techniques • A family of methods that approximate an N-dimensional dataset using fewer dimensions • e.g. Singular Vector Decomposition(SVD), Principal Components Analysis(PCA), Factor analysis(FA)

• Basic concept • Rotate the axes of the original dataset into a new space such that dimensions capture the most variances • Reduce dimensions such that remaining dimensions preserve the most that any dimensions could

Machine learning-based embedding & event extraction techniques

8

LATENT SEMANTIC ANALYSIS A particular application of SVD |V| x c term-document matrix X • |V| words and their co-occurrence with c documents or context Factorization of X In W, each column represents on of m dimensions in a latent space • Columns of m are ordered by variance By using only k