Distributed Keyword Vector Representation for ...

TAAI2015

Tainan, Taiwan

Nov. 20-22, 2015

Distributed Keyword Vector Representation for Document Categorization Yu-Lun Hsieh \,2, Shih-Hung Liu3,5, Yung-Chun Chang4,5, Wen-Lian Hsu5 I Social Networks and Human-Centered Computing Program, TIGP, lIS, Academia Sinica, Taiwan 2Department of Computer Science, National Chengchi University, Taiwan 3Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan 4Department of Information Management, National Taiwan University, Taiwan 5Institute of Information Science, Academia Sinica, Taiwan Email: {morphe.journey.changyc.hsu}@iis.sinica.edu.tw Abstract-In the age of information explosion, efficiently categorizing the topic of a document can assist our organization

can avoid the problem of sparseness on the word level, and discover the semantic connection between text and topic.

and comprehension of the vast amount of text. In this paper, we propose a novel approach, named DKV, for document cate

gorization using distributed real-valued vector representation of keywords learned from neural networks. Such a representation can project rich context information (or embedding) into the vec tor space, and subsequently be used to infer similarity measures among words, sentences, and even documents. Using a Chinese

news corpus containing over 100,000 articles and five topics, we

provide a comprehensive performance evaluation to demonstrate

that by exploiting the keyword embeddings, DKV paired with

support vector machines can effectively categorize a document into the predefined topics. Results demonstrate that our method can achieve the best performances compared to several other

In recent years, we witness an increasing interest in neural network-based vector space representation for words, or, word embeddings, e.g., [9]-[12]. The most important feature of such a representation is that it can preserve syntactic and semantic information in a dense, real-valued vector. For example, it has been demonstrated that syntactic relations like adjectives and their corresponding adverbs can be encoded into the vectors, and later found through simple vector arithmetic. We can find relations like "quick":"quickly" f-t "clear":"clearly" using the following calculation: v("quickly") - v("quick") + v("clear") v("clearly"), in which v( w ) denotes the vector representation for word w. Such pairs of an adjective and adverb will have similar distance between each other in the vector space. Nevertheless, deeper semantic relations, such as the one in "king":"man" f-t "queen":"woman" can also be computed in an identical manner. Subsequent work also tries to convert longer text, such as a paragraph, into a vector [13]. In this way, we can compare or categorize these longer texts using similar calculations as those used in the word level. =

approaches.

Keywords-neural network; word embedding; document repre sentation;

I.

INTRODU CT ION

Due to recent technological advances, we are overwhelmed by the sheer number of documents available online. How to quickly categorize this huge amount of text has become a challenging problem in information retrieval (IR) and natu ral language processing (NLP) communities. While keyword search systems nowadays can efficiently retrieve documents, we would benefit more from an automatic categorization of these documents to quickly identify the topic of a new docu ment and decide whether it is of our interest. Categorization can be considered as a ranking problem, in which a document is represented as a vector for machine learning methods to learn a classifier. This line of research includes vector space model [I], support vector machines [2], k-nearest neighbors [3], decision tree [4], and logistic regression [5]. Based on measures of similarity with respect to a labeled corpus as a reference, text categorization systems can automatically assign one or several predefined category labels to texts. A different angle of approach uses latent semantic infor mation to model the relationships between text and its topic in order to perform document classification. In this approach, classification is done by first constructing a set of latent topics, and compute the probability that a topic is related to a certain document. Such work include latent semantic analysis (LSA) [6], probabilistic LSA [7], and latent Dirichlet allocation (LDA) [8]. The main advantage of this approach is that they

978-1-4673-9606-6/15/$3l.00 ©2015

IEEE

245

The capability of encoding word relations into a vector inspire us to propose a novel approach that represents a docu ment using distributed keyword vectors, or DKVs, and subse quently categorizes it into different topics. It has been proposed that a topic is essentially associated with specific times, places, and persons [14]. They can be recognized as keywords, and utilized by a discriminative tool for classification purposes. We examine the power of vector-based representations in capturing the relations between those keywords on the surface, and the topic of the document. Experiments show that our proposed method can achieve better performances than other well-known text modeling methods, including vector space model [1], document embedding [13], and the LDA model [8]. We also investigate the impact of choosing different sizes of keywords on the categorization system. The rest of this paper is structured as follows. Section II will describe the proposed method and its precedents. Section III contains the experimental settings and comparisons of our method and other approaches. Then, we will briefly discuss related work in Section IV. Finally, some conclusions are drawn in Section V.

TAAI2015

Tainan, Taiwan

Nov. 20-22, 2015

Wt

Wt-c

......

Wt+c

Output

······

······

......

······

Projection

······

Word Vectors

· · · · · · ...... · · · · · ·

······

· · · · · · ...... · · · · · ·

······

······ ······ ······ ······ ······ ······ ······

······ ······ ······ ······ · · · · · · Shared ······ · · · · · · Matrix

One-hot vector

V

· Wt-c

· Wt-1

· Wt+1

V

·

·

Wt+c

Wt

(a) CBOW

Shared Matrix

(b) SG

Fig. 1. (a) The CBOW model uses the context words Wt-c,· .. , Wt+c in the window as inputs to predict the current word Wt. (b) The SG model predicts words Wt-c, ... , Wt+c in the context using the current word Wt as the input.

II.

MET HOD

One of the pioneer studies on developing neural network based language model was presented in [9]. It estimates a statistical language model, formalized as a feed-forward neural network, for predicting future words in the context while in ducing word embeddings (or representations) as a by-product. It has motivated extensive explorations on developing similar methods for learning latent semantic and syntactic regularities in various NLP applications. Representative methods include the continuous bag-of-word (CBOW ) model and the skip-gram (SG) model [15]. They have been proven to be successful in many tasks including and beyond NLP. However, there is little work done to utilize these methods for use in Chinese text categorization tasks. We will briefly introduce these models and our proposed method in the following sections. A. Continuous Bag-oj-word (CBOW) Model The concept of CBOW is motivated by the distributional hypothesis [16], which states that words with similar meanings often occur in similar contexts and thus suggests to look for word representations that capture their context distributions. Rather than seeking to learn a statistical language model, the CBOW model tries to obtain a dense vector representation (embedding) of each word directly [15]. The structure of CBOW is similar to a feed-forward neural network, with the exclusion of the non-linear hidden layer. In this way, the model can still retain good performances and be trained on much more data efficiently while getting around the heavy computational burden incurred by the non-linear hidden layer. The concept of CBOW model is illustrated in Fig. la.

246

Formally, given a sequence of words WI,W2, . . . , WT, the objective function of CBOW is to maximize the log-probability expressed in (1): T

Llog.P(WtIWt-c, ... ,Wt-I,Wt+l, ... ,wt+c), t=1

(1)

where c is the window size of the training context for the central word Wt, and T denotes the length of the training corpus. The conditional probability .P in Eq. (1) is defined as: 'V V

'" t+c P(wtlwt_J

e

=

W L

'WL

-,v,.,-----

L

(2)

V v e W,. Wi

i=1

where v Wi denotes the vector representation of the word W at position t; V indicates the size of the vocabulary; and v w, denotes the sum of the vector representations of the context words of Wt. B. Skip-gram (SG) Model In contrast to the CBOW model, the SG model employs an inverse training objective for learning word representations with a simplified feed-forward neural network [13], [15], [17]. Formally, given a sequence of words, WI,W2, . . . , WT, the objective function of SG is to maximize the following log probability:

c L L log.P(wt+jlwt), t=1 j=-c,jf.O T

(3)

TAAI2015

Tainan, Taiwan

Nov. 20-22, 2015 Dt

where c is the window size of the training context for the cen tral word Wt; and the conditional probability can be calculated by:

P(Wt+jIWt)

'V V Wt e Wt+j

A

=

v

L

'V V WL e Wi

'

t E3

Output

(4)

t

Projection

i=l

where v Wt+j and v Wt denote the word representations of words at position t + j and t, respectively. The concept of SG model is illustrated in Fig. 1b. In addition, improvements to the training procedure have been proposed to increase speed and effectiveness. They in clude negative sampling (NS) and hierarchical soft-max (HS) ([10], [17], [18]). In NS, instead of updating the whole output layer, words in the context window and k random words not in the context are drawn and evaluated as positive and negative input, respectively. Only those word vectors are updated in this process, hence the speedup. On the other hand, HS first construct a hierarchical structure of the vocabulary, and then sample from the tree the relevant words for a training run. This structure can be a binary Huffman tree built using the frequency of each word so that more frequent words have a shorter code length. This input is thus consisting of only those words that are on the path from root to the current word. By reducing the amount of vectors that needs to be updated, these two methods have been shown to substantially decrease training time while retaining the high quality of word vectors [17]. Since both CBOW and SG adopt a sequential training process, the learned model may be drastically affected by the order of the training samples when using the above speedup techniques. Therefore, randomization and/or multiple iterations of the corpus are often utilized when training these models. C.

Distributed Representation for Documents

Recently, neural network-based approaches for learning vector representations for longer text segments such as para graphs or documents have been proposed. Among them, [13] proposed two approaches that can be seen as direct extensions of word embeddings, namely, DBOW and DM. In essence, the DBOW model is an extension of SG, and the DM of CBOW. Originally, the word vectors are learned using context words. This idea can be extended for document vectors in a similar manner. If every document is mapped to an ID, which can be thought of as a special word, we can use it to predict other words in the same document. During training, the word vectors will be learned first. Then, a sliding window over the whole document is used to sample every word. Eventually, we can obtain a vector for each document that contains information of embeddings in the whole document. More specifically, document vectors (and word vectors) are learned with a stochastic gradient descent obtained via back propagation. For each iteration, the vector for a document is fed through the neural network, then we can compute the error gradient from the network and use the gradient to update the document vector. Distributed document vectors have some advantages over traditional bag-of-words models. First, since they are based on word vectors, the semantics of the words can also be

247

Keyword Vectors

..

Fig. 2. The DKV model estimates the document representation Dt using keyword vectors KWl,' ,KWn in the document. AI,'" ,An denotes the weighted LLR scores of respective keywords.

incorporated. Second, they can include information from a much broader context, i.e., the whole document. Such feature usually requires a very large n in n-gram models, hence a heavy toll on the memory. Lastly, since the document vectors are learned from sequentially feeding the word vectors into the network, the ordering of the words can also be considered. Due to these improvements, they have been shown to have great results in text and sentiment classification tasks [13]. D. Distributed Keyword Vectors for Topic Classification In this work, we propose the Distributed Keyword Vectors (DKV) for representation and classification of a document. Previous research has found that keyword information is very effective in text classification [19]. This motivates us to use only keyword vectors to represent a document in the topic classification task. First, sets of topic-specific keywords are identified using log likelihood ratio (LLR) [20], an effective feature selection method. It calculates the likelihood of the assumption that the occurrence of a word W in topic T is not random, and a higher LLR value indicates that W is closely associated with the topic. LLR value of each word W is calculated as follows. Given a training dataset, we first obtain four frequencies {k, I, m, n} defined as:

k m

=

=

N(w /\ T), N(.w /\ T),

I n

=

=

N(w /\ .T) N(.w /\ .T)

where N(w /\T) denotes the number of documents that contain wand belong to topic T, N(w /\ .T) denotes the number of documents that contain w but does not belong to topic T, and so on. Then, we employ Eq. (5) to calculate the likelihood of the assumption that the occurrence of a word w in the topic T is not random. In Eq. (5), the probabilities p(w), p(wIT), and p(wI.T) are estimated using maximum likelihood estimation. We rank the words in the training dataset based on their LLR values and select those with highest LLR values to compile a topical keyword list.

LLR(w,T) P(W1T)k(1- p(wIT))mp(wl·T)I(l- P(wl.T))n 2log p(w)k+l(l- p(w))m+n

(

=

) (5)

Next, word embeddings are learned on the training corpus using CBOW. Finally, a document is represented also as a

TAAI2015

Tainan, Taiwan

vector using a weighted average of the vectors which cor respond to words in the keyword list. Fig. 2 illustrates the DKV model, in which, the document Dt is represented as a weighted average of the keyword vectors, and the weight Ai for a keyword KWi is determined by its LLR value. If there is no keyword in a document, we calculate the mean of all word vectors in this document and compute cosine similarity against all keyword vectors to find the closest ones to represent this document. Conceptually, we are projecting each document onto a high dimensional vector space constructed from keywords, to which subsequent clustering or classifier can be applied. III.

EXPERIMENT S

A. Dataset and Setting We collected a corpus I of Chinese news articles from Ya hoo! online news website, in which each article is categorized into five topics, namely, Sports, Health, Politics, Travel, and Education. A total of lOO,OOO articles are kept and divided into training and test sets, each containing 50,000 articles. Each set contains roughly the same amount of articles on the five topics. For evaluation metrics, we adopt the convention of using F1scores. Five text representation models are implemented and com pared with the proposed method. First, a baseline system that uses Na'ive Bayes for classification, is denoted as NB [21]. Next is a vector space model [1], denoted as VSM. In addition, we evaluate the latent Dirichlet allocation (LDA) method [8] for representation of a document, and subsequently used to train an SVM classifier (denoted as LDA).2 Lastly, we compare two neural network-based document representations, DM and DBOW, proposed in [13]. 3 They are trained using the dimensionality setting of lOO, i.e., every document is represented as a vector containing 100 elements. Our system, denoted as DKV, is constructed as the fol lowing. At the outset of the training stage, word vectors are learned using CBOW with HS. The dimensionality is set to be lOO, which is identical as that of DM and DBOW. Next, we calculate LLR values of every word in the corpus, and select top 200 distinctive ones for each topic. This threshold is found to be useful in [19]. The weight of a keyword vector is set to be the log LLR value times a scaling factor of two. Afterwards, sum of the weighted keyword vectors are used as representations for each document, and support vector machine (SVM) [23] is trained to classify the topic of the document. To ensure the fairness of this experiment, the keyword vectors are frozen at test time, so that new context words in the testing documents will not be learned by the model. In order to study the effectiveness as well as the charac teristics of keyword vectors, we conducted two experiments. In the first experiment, we compare the ability of the six I Should we clear copyright issues, this corpus will be made publicly available. 2The dictionary required by all comparing methods is constructed by removing stop words in [22], and retaining tokens that make up 90% of the accumulated frequency. For unseen events, we use the Laplace smoothing in NB. We use a toolkit at http://nlp.stanford.edu/software/tmtltmt-O.4/ to implement LDA. 3 Implemented using https://github.com/piskvorky/gensim/

248

Nov. 20-22, 2015

FI-SCORES (%) OF SIX TOPIC DETECTION SYSTEMS ON THE TABLE l. CHINESE NEWS CORPUS. BOLD: THE BEST SCORE ACROSS ALL SYSTEMS. Topic

NB

VSM

LDA

DM

DBOW

DKV 92.22

Sport

67.07

79.13

80.20

90.67

90.74

Health

40.41

63.65

80.35

86.73

86.67

Politics

42.86

66.89

67.31

85.41

85.70

42.52

66.31

80.37

86.78

Travel

74.08

74.40

72.01

Education

28.25

41.07

58.01

71.64

71.61

74.54

Average

44.22

63.41

73.25

8 l .71

81.82

83.17

90.29

methods to categorize news documents into five predefined topics. Then, the second experiment tries to investigate the effect of the amount of keywords on the proposed method. More specifically, we want to know if more keywords lead to a better performance. B. Results & Discussion Table I shows a comprehensive evaluation of DKV and other methods in the first experiment. The baseline method NB can only obtain a lower average F1-score of around 44%. And VSM surpassed the overall performance of NB by about 20% absolutely. It indicates that using only surface word weightings and ignore inter-word relations can not lead to a satisfactory result. In contrast, the LDA model outperforms the above two methods with an overall Fl -score of 73%, which is an absolute lO% improvement. The ability to include both local and long distance word relations may be the reason for its success. For example, names of important athletes that are far apart from each other in the text can still be found to have strong relations in the topic 'Sport' using the LDA model. It suggests that, in order to capture more profound context hidden in the text, one has to consider not only surface words, but also the relations and semantics within it. By considering long-distance relations, the LDA model can obtain a better result. It even obtain the highest Fl -score of 80.37% among all compared methods in the topic Travel'. All other methods fail to achieve such an outstanding performance. On the other hand, neural network-based methods like DM and DBOW can further improve upon the aforementioned methods for another lO% absolutely. They demonstrate the ability to overcome weaknesses of bag-of-word models due to the combinatorial nature of a vector-based model. In addition, since the learning process considers all words in the document, these methods, like the LDA, can include longer context information into the vector representation. We also observe that the difference between these two methods is very small. Last but not least, DKV can further surpass other methods and obtain the best overall F1-score. These results indicate that topic can be sufficiently recognized by using only the information from topic-specific keywords. It demonstrates that word embedding can encode the complex relations between keywords and their topics into a dense vector. By means of an effective keyword weighting like LLR, we can give more discriminating power to those unique vectors. Paired with a robust vector-based classifier like SVM, our system can provide substantial performances. Moreover, such an approach has

TAAI2015 TABLE II.

Tainan, Taiwan

FI-SCORES (%) OF DIFFERENT SETTINGS F OR KEYWORD SIZE IN DKY.

#KWs

Sport

Health

Politics

Travel

Education

Average

200

92.22

90.29

86.78

72.01

74.54

83.17

400

92.69

90.50

87.66

73.98

75.01

83.97

600

92.65

90.62

87.87

74.32

75.26

84.14

800

92.73

90.78

88.01

74.72

75.32

84.31

1000

92.77

90.68

88.19

75.10

75.62

84.47

2000

93.06

90.44

88.26

76.31

76.75

84.96

3000

93.07

90.26

88.44

76.65

76.91

85.07

4000

92.96

90.07

88.40

76.99

76.96

85.08

IV.

the advantage of requiring very little supervision and feature engineering. It automatically learns the importance or weights of various words inherently. However, for the topic 'Travel', we are only able to achieve a less satisfactory result. This suggests that the distribution of keywords in the training set and the test set are not very consistent for this topic. Further research on how to remedy this weakness is needed for us to boost the performance of DKV. In the second experiment, we want to explore if includ ing different amounts of keywords into DKV can affect its performance. We set the size of keywords from 200 up to 4000 with other parameters remain unchanged, and observe the variation in Fl -scores. Table II shows an evaluation of the effect of keyword size. Although it appears that F1-score generally increases when more keywords are included, the difference is not obvious when we reach a certain amount. As indicated by underlines, using more than 2000 keywords can only lead to a less than 0.1 % absolute gain overall. For some topics, the score is even slightly lowered when using more keywords, indicating that some non-distinctive keywords may have been included. Fig. 3 plots the results of different keyword sizes to better visualize the trend of Fl -scores. It shows that the incline is very limited when we surpass 2000 keywords. It suggests that the contribution from keywords has saturated in our framework, and simply adding more keywords would not lead to obvious improvements. In order to further increase the effectiveness of DKV, we may have to devise a way of learning broader information beyond keywords in our text representation framework.

Keyword size vs. F -score 86 85

.-.

� e.... �

84

0 .. '"

/

83

�

•

•

.......

•

•

3000

4000

82 81 80

I 200

400

600

800

1000

2000

# keywords Fig.

3.

Nov. 20-22, 2015

Comparison of keyword size and overall Fl-scores of DKV.

249

RELATED WORK

Automatic categorization of documents has been an im portant research area. Most previous methods rely on some measures of the importance of keyword features. The weights of the keyword features are usually based on traditional statis tical methods such as TF*IDF, conditional probability, and/or generation probability. For instance, [24] propose the TF*PDF algorithm, which extends the above metric to avoid the collapse of important terms when they appear in many text documents. As stated in prior work, the IDF decreases the frequency value for a keyword when it is frequently used. Thus, considering different news sources or channels, the weight of a term from a single channel is linearly proportional to the term's frequency within it, while exponentially proportional to the ratio of documents that contain the term in the channel itself. Such property can be utilized to automatically extract hot topics across different sources. Thus, weighting of words have been proven to be useful in text categorization. Others have adopted machine learning approaches to auto matically recognize discriminative features for a topic. Topic detection can be formulated as a supervised classification problem [8], [25]. Given a training corpus containing a set of manually-tagged examples of predefined topics, a supervised classifier is employed to train a topic detection model to assign (i.e. classify) a topic to a document. A NaIve Bayes classifier that uses semantic information is proposed to categorize text [26]. Alternatively, categorization can be considered as a clustering task. For instance, [14] attempted to find topics by clustering keywords using a statistical similarity measure for grouping documents, each of which represents a topic. The clusters are then connected chronologically to form a time line of the topic. Furthermore, [27] use the tolerance rough set model to populate a set of feature words into an approximated latent semantic space. A complete-link clustering algorithm can be applied to the extracted hot topic sets. As for neural network-based methods, weighted average of vectors for word representation has also been proposed in [28]. The advantage of machine learning approaches is that they can achieve substantial performance without much human involvement. On the other hand, knowledge-base approaches have been proposed to incorporate knowledge such as ontologies into topic modeling and categorization. Ontology is a representation that formulate entities, attributes, relationships, and axioms within a domain in a human understandable, machine-readable format [29]. These information, in addition to other linguistic resources, can be incorporated into templates for detecting the topic of a document [19]. They also can be utilized in many research fields besides text categorization. For instance, automatic knowledge construction from documents on the web can be done using entity relationships [30]. Also, keyword or key terms in a document can be identified through the use of ontology [31]. More extensive applications include a recruit ment system to provide intelligent matching between employer advertisements and the curriculum vitae of the candidates [32]. Another example is a travel route recommendation system that finds customized routes for each tourist based on one's preferences [33]. They show that knowledge can be useful in a variety of ways to support automatic systems. However, the quality of knowledge can have a large impact on the performances of these methods.

TAAI2015

Tainan, Taiwan

Our method shares some advantages of eXistmg ap proaches, but is distinguished by a number of aspects. First, our novel document representation method uses a weighted combination of keyword vectors learned for distinctive key words. They can help eliminate confusion in the classifier and produce a better prediction. Second, the importance (or weight) of a vector is automatically ranked by its LLR. This further assists the classification by giving a higher weight to a more prominent keyword. Lastly, as a semi-supervised method, our approach can be adapted quickly to other categorization tasks. V.

Nov. 20-22, 2015

[II]

R. Coliobert and J. Weston, "A unified architecture for natural language processing: Deep neural networks with multitask learning, " in Proceed ings of the 25th international conference on Machine learning. ACM, 2008, pp. 160-167.

[12]

R. Coliobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural language processing (almost) from scratch, " The fournal of Machine Learning Research, vol. 12, pp. 2493-2537, 2011.

[13]

Q. Le and T. Mikolov, "Distributed representations of sentences and documents, " in Proceedings of the 31st International Coriference on Machine Learning (ICML-14), 2014, pp. 1188-1196.

[14]

R. Nallapati, A. Feng, F. Peng, and J. Allan, "Event threading within news topics, " in Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 2004, pp. 446-453.

[15]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space, " in Proceedings of Workshop at ICLR, 2013.

[16]

G. A. Miller and W G. Charles, "Contextual correlates of semantic similarity, " Language and cognitive processes, vol. 6, no. I, pp. 1-28, 1991.

[17]

T. Mikolov, l. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their composi tionality, " in Advances in neural iriformation processing systems, 2013, pp. 3111-3119.

[18]

A. Mnih and K. Kavukcuoglu, "Learning word embeddings efficiently with noise-contrastive estimation, " in Advances in Neural Information Processing Systems, 2013, pp. 2265-2273.

[19]

y-c. Chang, Y-L. Hsieh, c.-c. Chen, and W-L. Hsu, "A semantic frame-based intelligent agent for topic detection, " Soft Computing, pp. 1-11, 2015. [Online]. Available: http://dx.doi.org/l0. 1007/s00500-015-1695-4

[20]

C. D. Manning and H. SchUtze, Foundations of statistical natural language processing. MIT press, 1999.

[21]

A. McCallum, K. Nigam et al., "A comparison of event models for naive bayes text classification, " in AAAI-98 workshop on learning for text categorization, vol. 752, 1998, pp. 41-48.

[22]

F. Zou, F. L. Wang, X. Deng, S. Han, and L. S. Wang, "Automatic construction of chinese stop word list, " in Proceedings of the 5th WSEAS International Coriference on Applied Computer Science, 2006, pp. 1010-1015.

CONCLU SION

In this paper, we present a novel approach for topic classification using weighted average of keyword vectors as features for SVM classifier. The contributions of this work are two-fold. First, we demonstrate that our method can yield substantial improvement over other models on a Chinese news corpus. We also find that the amount of keywords is positively related to the performance of our system, though it stops to be so effective beyond a certain limit. In the future, we want to investigate algorithms for combining keyword vectors that can lead to a better representation of a document. Also, exploring how to project semantic knowledge into the vector space is another interesting field of research. Lastly, we want to extend this approach to other NLP and IR applications. ACKNOWLEDGMENT

This research was supported by the Ministry of Science and Technology of Taiwan under grant MOST 103-3111-Y001-027. We are grateful for the insightful comments from three anonymous reviewers. REF EREN CE S [I]

G. Salton, A. Wong, and c.-S. Yang, "A vector space model for automatic indexing, " Communications of the ACM, vol. 18, no. II, pp. 613-620. 1975.

[23]

C.-c. Chang and c.-J. Lin, "LIBSVM : a library for support vector machines, " ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1-27:27, 2011.

[2]

T. Joachims, Text categorization with support vector machines: Learn ing with many relevant features. Springer, 1998.

[24]

[3]

O.-W. Kwon and J.-H. Lee, "Text categorization based on k-nearest neighbor approach for web site classification, " Information Processing & Management, vol. 39, no. 1, pp. 25-44, 2003.

K. K. Bun and M. Ishizuka, "Topic extraction from news archive using tf*pdf algorithm, " in Web Information Systems Engineering, International Coriference on. IEEE Computer Society, 2002, p. 73.

[25]

F. De Comite, R. Gilleron, and M. Tommasi, "Learning multi-label alternating decision trees from texts and data, " in Machine Learning and Data Mining in Pattern Recognition. Springer, 2003, pp. 35-49.

X. Zhang and T. Wang, "Topic tracking with dynamic topic model and topic-based weighting method, " fournal of Software, vol. 5, no. 5, pp. 482-489, 2010.

[26]

H. Jing, Y Tsao, K.-Y Chen, and H.-M. Wang, " Semantic naIve bayes classifier for document classification, " in Proceedings

[4]

[5]

of the Sixth International Joint Conference on Natural Language

A. Genkin, D. D. Lewis, and D. Madigan, " Sparse logistic regression for text categorization, " DIMACS Working Group on Monitoring Message Streams Project Report, 2005.

[6]

J. Bellegarda, "Latent semantic mapping, " Signal Processing Magazine, IEEE, vol. 22, no. 5, pp. 70-80, Sept 2005.

[7]

T. Hofmann, "Probabilistic latent semantic indexing, " in Proceedings

Nagoya, Japan: Asian Federation of Natural Language Processing, October 2013, pp. 1117-1123. [Online]. Available: http://www.aclweb.orgianthology/Il3-1158 Processing.

[27]

Y Wu, Y Ding, X. Wang, and J. Xu, "On-line hot topic recommendation using tolerance rough set based topic clustering. " Journal of Computers, vol. 5, no. 4, 2010.

and Development in Iriformation Retrieval, ser. SIGIR '99. New York, NY, U SA: ACM, 1999, pp. 50-57. [Online]. Available: http://doi.acm.org/IO.1145/312624.312649

[28]

L. Qiu, Y Cao, Z. Nie, Y Yu, and Y Rui, "Learning word represen tation considering proximity and ambiguity, " in Twenty-Eighth AMI Conference on Artificial Intelligence, 2014.

[8]

D. M. Blei, A. Y Ng, and M. l. Jordan, "Latent Dirichlet allocation, " Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.

[29]

[9]

Y Bengio, R. Ducharme, P. Vincent, and C. Janvin, "A neural proba bilistic language model, " T he Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.

Q. T. Tho, S. C. Hui, A. C. M. Fong, and T. H. Cao, "Automatic fuzzy ontology generation for semantic web, " Knowledge and Data Engineering, IEEE Transactions on, vol. 18, no. 6, pp. 842-856, 2006.

[30]

H. Alani, S. Kim, D. E. Millard, M. J. Weal, W Hall, P. H. Lewis, and N. R. Shadbolt, "Automatic ontology-based knowledge extraction from web documents, " Intelligent Systems, IEEE, vol. 18, no. 1, pp. 14-21, 2003.

[31]

M. Grineva, M. Grinev, and D. Lizorkin, "Extracting key terms from

of the 22nd Annual International ACM SIGIR Coriference on Research

[10]

F. Morin and Y Bengio, "Hierarchical probabilistic neural network language model, " in Proceedings of the international workshop on artificial intelligence and statistics, 2005, pp. 246-252.

250

TAAI2015

Tainan, Taiwan

noisy and multitheme documents, " in Proceedings of the 18th interna tional conference on World wide web. ACM, 2009, pp. 661-670. [32]

F. Garcia-Sanchez, R. Martinez-Bejar, L. Contreras, 1. T. Fernandez Breis, and D. Castellanos-Nieves, "An ontology-based intelligent system for recruitment, " Expert Systems with Applications, vol. 31, no. 2, pp. 248-263, 2006.

25 1

Nov. 20-22, 2015

[33]

c.-S. Lee, y-c. Chang, and M.-H. Wang, "Ontological recommendation multi-agent for tainan city travel, " Expert Systems with Applications, vol. 36, no. 3, pp. 6740-6753, 2009.

Distributed Keyword Vector Representation for ...

Distributed Keyword Vector Representation for ...

Suggest Documents

Publicity Mining From Distributed Keyword Representation of Socially ...

Fully Distributed Representation

Distributed Variational Representation Learning

Distributed Representation of Misconceptions

Multimodal Reconstruction Using Vector Representation

Encrypted Keyword Search in a Distributed ... - People.csail.mit.edu

Technologies and Tools for Distributed Teams - Vector

A Distributed Framework for NLP-Based Keyword and ... - KSI Research

A multiple distributed representation method

Continuous Distributed Representation of Biological

Vector Multiplicative Error Models: Representation and Inference

Simplified Representation of Vector Fields - CiteSeerX

Text Representation: from Vector to Tensor - CiteSeerX

Vector Logic: A Natural Algebraic Representation

REPRESENTATION OF DIVERGENCE-FREE VECTOR FIELDS ...

INTEGRAL REPRESENTATION FOR A CLASS OF VECTOR VALUED ...

Diffusion Curves: A Vector Representation for Smooth-Shaded Images

Support Vector Machines for Analog Circuit Performance Representation

a vector-space representation of motion data for

On Code Parameters and Coding Vector Representation for Practical ...

INTEGRAL REPRESENTATION FOR A CLASS OF VECTOR VALUED ...

Improved i-Vector Representation for Speaker Diarization | SpringerLink

Vector Representation for Sub-Graph Encoding to ... - ScienceDirect.com

Couplings of Vector-Spinor Representation for SO (10) Model Building