Attention Enhanced Chinese Word Embeddings

Attention Enhanced Chinese Word Embeddings Xingzhang Ren1,2 , Leilei Zhang1,2 , Wei Ye2(B) , Hang Hua1,2 , and Shikun Zhang2 1

School of Software and Microelectronics, Peking University, Beijing, China 2 National Engineering Research Center for Software Engineering, Peking University, Beijing, China {xzhren,leilei zhang,wye,huahang,zhangsk}@pku.edu.cn

Abstract. We introduce a new Chinese word embeddings method called AWE by utilizing attention mechanism to enhance Mikolov’s CBOW. Considering the shortcomings of existing word representation methods, we improve CBOW in two aspects. Above all, the context vector in CBOW is obtained by simply averaging the representation of the surrounding words while our AWE model aligns the surrounding words with the central word by global attention mechanism and self attention mechanism. Moreover, CBOW is a bag-of-word model which ignores the order of surrounding words, and this paper uses the position encoding to further enhance AWE and proposes P&AWE. We design both qualitative and quantitative experiments to analyze the effectiveness of the models. Results indicate that the AWE models far exceed the CBOW model, and achieve state-of-the-art performances on the task of word similarity. Last but not least, we also further verify the AWE models through attention visualization and case analysis. Keywords: Word embedding · Attention mechanism Representation learning · Natural language processing

1

Introduction

The task of word representation is of paramount importance in many natural language processing (NLP) systems and techniques, because the word is the basic unit of linguistic structure. In recent years, word representation approaches were investigated quite intensively. Among them, one-hot encoding is the simplest and most commonly used method. However, it cannot meet the needs of practical applications for its high dimensions and limited ability to express semantics. Hinton proposed distributed representation [6] in 1986, which can not only solve the problem of dimensional disaster but also establish the concept of “distance” between words. Afterwards, growing studies were conducted based on the principle and applied in word representation field called word embeddings. Representative models include CBOW, Skipgram [11,12] and GloVe [13]. They have c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11139, pp. 154–165, 2018. https://doi.org/10.1007/978-3-030-01418-6_16

Attention Enhanced Chinese Word Embeddings

155

been widely used in various tasks such as part-of-speech tagging [3,16], sentence classification [7,9], text summarization [20] and question answering [4,8,10,15], which are common tasks in NLP. Different from English, Chinese as a hieroglyphic has typical characteristics of character meaning and shape expression. Therefore, a multitude of research teams have made unique improvements on Chinese word embeddings. CWE [2] thought the semantic meaning of a word is related to the meanings of its composing characters for Chinese. Meanwhile, in view of the issues of character ambiguity and non-compositional words, it proposes multiple prototype character embeddings and an effective word selection method to joint learning character and word embeddings. After CWE, there are MGE [18], JWE [19], GWE [14] and cw2vec [1] exploiting the radicals, character pixel information, character substructure information, character stroke information to enhance the Chinese word embeddings. Owing to the good structure and scalability of word2vec1 , most works on Chinese word embedding are based on CBOW or Skipgram. In fact, the size of sliding window in word2vec is usually small. It is not that the larger window size will increase computation load, actually, the size of the window has no effect on the calculation of the CBOW model, but the larger window size will make unfavorable effect to the model performance which will be detailed discussed in Sect. 2.1. Small sliding window cannot precisely gather semantic information, which will make the words similar to its antonym with the same surrounding words. As an example, for sentences “I love you” and“I hate you”, when using word2vec, the central word “love” and“hate” are learned by the same surrounding words “I” and“you”, so the word vectors of “love” and“hate” will be closer together in the vector space. Such word embedding is nearly irrelevant to semantic information. Indeed, similar to “love” should be“like” while similar to “hate” should be“disgust”. This paper examines a new measure of utilizing attention mechanism to enhance Chinese word embeddings based on existing word embedding methods called AWE. We propose a model uniting the attention mechanism to CBOW which can fit larger sliding window and thus allow word embedding to accommodate rich semantics. It is because the attention mechanism can emphasize useful surrounding words for the central word and avoid the influence of useless environmental words. The larger sliding window allows the central word to obtain more comprehensive semantic representation by capturing more information of surrounding words. We perform extensive experiments and both qualitative and quantitative analysis, which manifests that our model is able to learn better Chinese word representations than other state-of-the-art approaches.

2

Model

2.1

Motivation

The original CBOW model consists of an input layer, a projection layer, and an output layer as Fig. 1 shows. The inputs, one-hot vectors of the surrounding 1

https://code.google.com/archive/p/word2vec/.

156

X. Ren et al.

words, are mapped to the projection layer with embedding size neurons through a fully connected network. So, in the projection layer, we can obtain context vectors by averaging the embedding value of these surrounding vectors. The order of the words in the sentence has no effect on the result, it is why the model is called continuous bag-of-words model. Finally, the context vector of the projection layer is fully connected to the output layer with vocabulary size neurons. Because the dimension of the vocabulary in the output layer is extremely large, in practical applications, negative sampling or hierarchical softmax are used for optimization [5,12]. We divide the training process of CBOW into two parts. The first part is a full-connection neural network from the input layer to the projection layer, which mainly obtains the context vector by averaging the surrounding words vectors, and the second part is the full-connection neural network of the projection layer and the output layer, which mainly calculates the output probability of each word in vocabulary as target word. The bag-of-words model averages the vectors of the surrounding words as the context vector, which does not fit well with human language understanding behavior. Three major problems have yet to be addressed. Firstly, the degree of contribution of each context vector can not be same, but related to its importance. Context words such as “is”,“a” which do not have much substantive meaning should have smaller weight. Secondly, the order of context words should be concerned. For instance, “A love B” is exceedingly different from“B love A”. Finally, the contribution of words to context may be related to the central word. As an example, in sentence “good rather than bad”,“rather” is important to predict “than”, while“bad” is important to predict “good”. Attention mechanism was first proposed in computer visual area. In NLP field, it is often used to align the original text and the translation text in machine translation task. This paper tries to use the attention mechanism to improve the process of establishing the context vector of the surrounding word, thus generates more precise word vectors by enhanced context vector. 2.2

CBOW

As Fig. 1 shows, the word embeddings matrix was represented as W ∈ Rd×|V | where d is the embedding dimension and |V | is the size of vocabulary V = {v1 , v2 , .., v|V | }. The output embeddings matrix was W ∈ R|V |×d . xi or yi is one-hot encoding of the word vi , and wi is the word embedding of the word vi , thus W xi = wi . The CBOW model predicts the central word wt when given word distributed representations of the surrounding words in a sliding window H = {wt−b , . . . , wt−1 , wt+1 , . . . , wt+b }, where b is a hyper-parameter defining the sliding window size. The CBOW model calculates the average of the input words as a context vector, which is shown as ct =

1 2b

i∈[t−b,t+b]−{t}

wi

(1)


157

Fig. 1. In Mikolov’s CBOW model, the projection layer’s wi is the vector representation of the surrounding word xi in the current sliding window, and the content vector ct is calculated as the average of surrounding words representations. The CBOW model is to make the condition probability of the central word p(yt |w[t−b,t+b]−{t} ) maximum.

and the objection is to maximum the Log-likelihood function L as follows. p(yt |w[t−b,t+b]−{t} ) = L=

exp yt W ct x∈V exp x W ct

log p(yt |w[t−b,t+b]−{t} )

(2) (3)

yt ∈V

2.3

Attention Enhanced CBOW (AWE)

In this paper, we proposed Attention Enhanced Chinese Word Embeddings (AWE) that aims to use attention mechanism to obtain more meaningful context vector. General attention mechanism is applied in encoder-decoder architecture by plugging a context vector into the gap between encoder and decoder. In AWE as model, we calculate the context vector cattention t = a(wi )wi (4) cattention t i∈[t−b,t+b]−{t}

where a(wi ) represents a function that calculate attention of word wi , which is a scalar. Then we define function a(wi ) as a(wi ) =

exp score(wi ) w∈H exp score(w)

(5)

158

X. Ren et al.

where score(wi ) can be defined by different attention mechanisms. Apparently, we can use the global attention mechanism to focus on different surrounding words based on the central word. Considering that adding the central word to surrounding words may affect the acquisition of context vector, we also performed experiments using the self-attention mechanism to enhance CBOW for comparison. The formulas of two attention score functions is calculated in this way (6) score(wi )global = wg tanh(Wg [wi , wt ]) score(wi )self = ws tanh(Ws wi ) where Wg ∈ R (Fig. 2).

a×2d

, Ws ∈ R

a×d

(7)

, wg , ws ∈ R , and a is the attention dimension a

Fig. 2. In the AWE model, the context vector ct is calculated by aligning the attention vector at to each surrounding word wi . (a) In the global attention mechanism, the attention vector at is simultaneously related to the central word wt and environmental words wi . (b) In the self attention mechanism, the attention vector at does not rely on the central word wt and only depends on the surrounding words wi .

2.4

Position & Attention Enhanced CBOW (P&AWE)

Since the bag-of-words model dispense with recurrence and convolutions entirely, we must add some extra information about the relative or absolute position of the words in the sequence so that the model can utilize the order of the words. To this end, we added position encoding to the original surrounding word embeddings to get new surrounding word embeddings with position information and proposed the model called P&AWE. The positional encoding has the same dimension d as the word embeddings, thus we can make addition operation here. The formula


159

of context vector cposition of P&AWE is computed as follows, adding positional t encoding P E(i,j) to raw word embedding where i is the position and j is the dimension. cposition = a(wip )wi , wip = wi + P E(i,j) (8) t i∈[t−b,t+b]−{t}

Following [17] work, we used sine and cosine functions of different frequencies to define the positional encoding P E(i,j) :

3

P E(i,2j) = sin(i/100002j/d )

(9)

P E(i,2j+1) = cos(i/100002j/d )

(10)

Experimental Setup

3.1

Preprocessing

We downloaded Chinese Wikipedia dump2 on May 1, 2018, which consists of 405w lines of Chinese Wikipedia articles. We treated all articles as a sentence, and slid the context window from the beginning of an article to its end. We use a script in the gensim3 toolkit to convert data from XML into text format. Based on our observation, the corpus consists of both simplified and traditional Chinese characters. Hence, we utilize the opencc4 toolkit to normalize all characters as simplified Chinese. Word segmentation was performed using open source python package jieba5 . All English words, numerical words were encoded as “W” and“D”. Furthermore, In all 222,822,535 segmented words, we removed words whose frequency