Sentiment Classification with Convolutional Neural Networks.pdf

0 downloads 0 Views 416KB Size Report
Report written for course “Deep Learning for Natural Language Processing”, Prof. Thang Vu ... ter weights of the convolutional neural network, which is effective ...
Report written for course “Deep Learning for Natural Language Processing”, Prof. Thang Vu, University of Stuttgart, Institut für Maschinelle Sprachverarbeitung

Sentiment Classification with Convolutional Neural Networks Roozbeh Bandpey, Aida Zoriyatkha University of Stuttgart Institute of Natural Language Processing {bandperh, zoriyaaa}@ims.uni-stuttgart.de

Abstract

smaller corpus, our design choices are mainly focused on feature exploitation and computational efficiency of our network is not on our notice. In the following, we give a brief explanation of the main components of our network.

This paper describes our deep learning system for sentiment analysis of texts. The main focus of this work is a model that initializes the parameter weights of the convolutional neural network, which is effective to train an accurate model even for a small data-set. In a nutshell, we use a model to train initial word embeddings that are further tuned by our deep learning model. At a final stage, the pre-trained parameters of the network are used to initialize the model. Later on we compared this model with a baseline model, which takes randomly initialized word vectors as input.

1

2.1

Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features. Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP tasks as well. The input to our model are costumer reviews of product , each treated as a sequence of words: [w1 , ..., wn ], where each word is drawn from a vocabulary and words are represented by distributional vectors.

Introduction

In this work we describe our convolutional neural network model for sentiment analysis, the goal is to classify texts such as single sentences and or micro-blog posts, like customer reviews of products into positive or negative sentiments that they represent. This task can be really challenging because, limited amount of contextual data are provided in this type of texts. For effectively solving this task, we need to To fill the gap of contextual information in a scalable manner. To achieve that, we used semantic features of words using word vectors (Mikolov, 2013), wherein words are projected onto vector space. In this representations, semantically close words are likewise close in euclidean or cosine distance in the vector space.

2

Model overview

Let xi ∈ Rk be the k-dimensional word vector corresponding to the i-th word in the sentence: A sentence of length n is represented as: xi:n = x1 ⊕ x2 ⊕ ... ⊕ xn ,

(1)

where ⊕ is the concatenation operator. A convolution operation involves a filter w ∈ Rhk , which is applied to a window of h words to produce a new feature. A feature ci is generated from a window of words xi:i+h−1 by ci = f (w.xi:i+h−1 + b)

(2)

Here b ∈ R is a bias term and filter is a non-linear function such as the hyperbolic tangent or rectified linear units.

Methods

We apply a max-pooling operation over the feature map and take the maximum value ˆ c = max{c} as the feature corresponding to this particular filter. The idea is to capture the most important feature for each feature map. The model uses multiple filters to obtain multiple features. These features are form the penultimate

The architecture of our convolutional neural network for sentiment classification is shown on Figure 1, it is mainly inspired by the architectures used in (Kim, 2014), for performing various sentence classification tasks. Given that our training process requires to run the network on a rather

1

Figure 1: Model architecture with two channels for an example sentence. layer and Will be passed to a fully connected sigmoid layer whose output is the probability distribution over labels. 2.2

gradients. For efficiently train the network, the hyperparameters values were chosen via trial and error. The values which are shown in table 2, are data-set specific and they were successful to gain the highest accuracy without considering the runtime. For regularization we employ dropout on the penultimate layer, Dropout prevents co-adaptation of hidden units by randomly dropping, that is, some proportion of the hidden layers will be set to zero during forward back propagation. For this specific data-set, higher dropout probability for both covolution layer and hidden layers proved to be such a good regularizer, when it consistently added 1%–2% relative performance.

Data

We ran our experiments on a data-set which is customer reviews of various products, sentences are annotated as positive or negative sentiment. Summary statistics of the data-set are in table 1. we randomly select 20% of the training data as the development set. For extracting vectors out of words and further on constructing matrices for sentences, we use google word2vec that were trained on 100 billion words from Google News (Mikolov, 2013). For this purpose we used a python library called gensim. This library practically invokes the trained google model and it is possible to use it on any vocabulary, . 2.3

Embeddings dimension Filter size Number of filters Dropout probability for convolution layer Dropout probability for hidden layers Activation function

Train the network

Convolutional neural networks can be tricky to train as are often severely subject to overfitting when trained on small data-sets. We use stochastic gradient descent (SGD) to optimize the network and use backpropogation algorithm to compute the Number of sentences Vocabulary size Positive reviews Negative reviews Max sentence length

300 3x4 3 70% 80% ReLU

Table 2: Overview of the hyperparameters for our CNN model which is fed by trained word vectors.

1729 3381 1088 641 105

3

Results

The baseline model was designed by randomly initializing the vectors of words. Our baseline model does not perform well on its own. While we had expected performance gains through the use of pre-trained vectors, we have seen slight improve-

Table 1: Overall statistics of the review data-set.

2

Models

ment in results. Models Acc W2VCNN #epoch Acc Baseline #epoch

Performance 82.92 85 79.48 7

20-dim W2VCNN

Performance 81.21 99

Table 5: Overall performance of the system.

4

Table 3: Overall performance of the system.

Summary & Conclusion

We have described a series of experiments with convolutional neural networks built on top of word2vec. Despite little tuning of hyperparameters, a simple CNN with one layer of convolution performs remarkably well. Our results show that pre-trained word vectors is an important ingredient in deep learning for NLP.

The best hyperparameters yield by trial and error for baseline model is shown in table 4, obviously when we are initializing the model with random vectors it need more filters and less dropout probabilities for better capturing the features. Embeddings dimension Filter size Number of filters Dropout probability for convulution layer Dropout probability for hidden layers Activation function

Acc #epoch

20 3x4 150 25%

Acknowledgments

50%

References

ReLU

Y. Kim. 2014. Convolutional neural networks for sentence classification. In In EMNLP, 2014.

We would like to thank Prof. Thang Vu for his great teaching and supervision.

Table 4: Overview of the hyperparameters for the baseline model.

K. Chen G. Corrado J. Dean Mikolov, I. Sutskever. 2013. Distributed representations of words and phrases and their compositionality. In In Proceedings of NIPS 2013.

Figure 2 shows accuracy and error rate changes through 100 epochs for the model which is using trained embedding weights. And figure 3 shows accuracy and error rate changes through 100 epochs for the baseline model. The dramatic increase of loss for the test data, before 20 epoch for the both models, is an obvious sign of over fitting. During our several experiment on the model, the best1 performance, was for the model witch uses 20-dimensional word vectors, all the setting for this model is same as the hyperparameters in table 2 except the vectors dimensionality which is 20 and number of filters which is 10 filters. Given the fact that by reducing the dimension of word vectors, we will loose information, but it turns out those information can be redundant for the model, we concluded that it could possibly be because of the size of our data-set. The performance of the aforementioned model is shown in table 5, and its accuracy and error rate changes through 100 epochs is plotted in figure 4.

1 Best performance considering speed, accuracy and over fitting problem

3

Figure 2: Performance in 100 epoch for the model with trained word vectors

Figure 3: Performance in 100 epoch for the baseline model.

4

Figure 4: Performance in 100 epoch for the model which uses 20-dimensional word vectors.

5

Suggest Documents