Efficient Deep Learning Model for Text Classification ... - IEEE Xplore

1 downloads 0 Views 409KB Size Report
Efficient Deep Learning Model for Text Classification Based on Recurrent and. Convolutional Layers. Abdalraouf Hassan. Dep of Computer Science and ...
2017 16th IEEE International Conference on Machine Learning and Applications

Efficient Deep Learning Model for Text Classification Based on Recurrent and Convolutional Layers Abdalraouf Hassan Dep of Computer Science and Engineering University of Bridgeport, CT, 06604, USA [email protected]

Abstract—Natural Language Processing (NLP) systems conventionally treat words as distinct atomic symbols. The model can leverage small amounts of information regarding the relationship between the individual symbols. Today when it comes to texts; one common technique to extract fixed-length features is bag-of-words. Despite its popularity the bag-ofwords feature has two major weaknesses: it ignores semantics of the words and the order of words. In this paper, we propose a neural language model that relies on Convolutional Neural Network (CNN) and Bidirectional Recurrent Neural Network (BRNN) over pre-trained word vectors. We utilize bidirectional layers as a substitute of pooling layers in CNN in order to reduce the loss of detailed local information, and to capture long-term dependencies across input sequences. We validate the proposed model on two benchmark sentiment analysis datasets, Stanford Large Movie Review (IMDB), and Stanford Sentiment Treebank (SSTb). Our model achieves a competitive advantage compared with neural language models on the sentiment analysis datasets. Keywords- Convlutional Neural Network, Bidirectional Recurrent Neural Network, Long Short-Term Memroy, Recurrent Neural Network.

I.

INTRODUCTION

Language modeling is a fundamental task in artificial intelligence and Natural Language Processing (NLP), which are used in NLP applications, such as speech recognition, text generation, and machine translation. A Language model is formalized as a probability distribution over a sequence of words. Traditional methods (e.g. n-gram) usually apply a Markov assumption to estimate the probability of the ngram; this model is simple to train. However, despite the smoothing technique, n-gram models have poor probabilities because of the data sparsity problem [1]. The text classification standard method is to represent a sentence as a bag-of-words, then feed a fixed length sequence of vectors to train linear classifiers (e.g. logistic regression). Linear classifiers become the strong methods for text classification problems [2]. However, linear classifiers do not share parameters between the features and classes. This might edge the generalization in the context for large output where some classes have few cases. One of the general ways to solve this issue is to factorize the linear classifier into low rank matrices [3]. Another way to solve this issue is by using multilayer neural networks [4]. 0-7695-6321-X/17/31.00 ©2017 IEEE DOI 10.1109/ICMLA.2017.00009

Ausif Mahmood Dep of Computer Science and Engineering University of Bridgeport, CT, 06604, USA [email protected]

Neural Language Models outperform n-gram and overcome the data sparsity issue [1]; as semantically similar words are close in the vector space. However, embedding of rare words are poorly estimated and lead to high perplexities for rare words. Convolution-gated recurrent network is possibly the most relevant work [5]. The method uses a hierarchical processing of documents and a bidirectional recurrent network to extract the feature vectors of the document. Then these vectors are mapped into a classification layer. The recurrent network is focused on the modeling of the inter-sentence structures. The combination of architecture for both convolutional and recurrent layers are explored for image segmentation [6], and a similar approach of work was also applied to speech recognition [7]. Using vanilla Convolution Neural Network (CNN) has one disadvantage; the network must have many layers in order to capture long-term dependencies [4, 8]. Contrary to the convolutional layer, Recurrent Neural Network (RNN) is able to capture long-term dependencies even with one single layer [9]. This is especially true in the case of the Bidirectional Recurrent Neural Network (BRNN), because each hidden state is computed based on the whole input sequence. Therefore, to overcome the problem of long-term dependencies in the CNN based architectures; in this paper we propose an efficient deep learning language model that utilizes both convolutional and recurrent layers. Our approach can be summarized as follows: we train a simple CNN with one layer of convolutional on top of word vectors that is obtained from an unsupervised neural language model [3, 10]. Then we utilize BRNN as an alternative for the pooling layers to potentially reduce the loss of detailed local information. We empirically validate the proposed model on two datasets. We compared our experimental results to the best recent deep learning methods. Our results shows that it is certainly possible to use a much smaller model to achieve the same level of classification performance with fewer number of parameters. II.

RELATED WORK

Deep neural network methods jointly implement feature extraction and classification for document classification [11, 12]. CNN recently accomplished remarkably performance on NLP tasks [11, 13, 14]. However, CNN architecture require experts to set up the network parameters. [4] proposed a 1108

CNN with multilayers and max pooling similar to the model proposed in computer vision [6]. The CNN proposed in [4] required multilayers of CNN to capture long term dependencies because of the locality of the convolutional and pooling layers; as the length of the input grows it becomes crucial to capture long-term dependencies. The objective of [4] investigated a deep convolutional network to overcome the issues highlighted. [8] Investigated the combination of neural network architecture of both convolutional and recurrent networks to encode character input, and implemented a high-level feature input sequence of character level to capture sub word information. However, this model performs better only when a large number of classes are available. RNNs are able to capture long-term dependencies in the case of the single layer. RNN is the lead approach for many NLP today. For sentiment analysis there are a few works for neural network architectures. In the Semi-supervised method the proposed is based on recursive autoencoder [15], the network learns vector representation for phrases then employs recursive nature of sentence. Similar work was also proposed in [16]; a matrix vector recursive model for semantic compositionality, recursive neural network applied to sentence classification, configuration function is defined in this style and recursively applied at each node of the parse tree of an input sentence, in order to extract a feature vector of the sentence. This model relies on an external parser. Generally deep learning approaches start with an input sentence denoted as a sequence of word. Each word is presented as a one-hot vector, and then each word in the sequence is projected into a continuous vector space by being multiplied with a weight matrix, forming a sequence of real value dense. These sequences are then fed into a deep neural network, which processes the sequence in multiple layers resulting in a prediction probability. This whole network is tuned jointly to maximize the classification accuracy on a training set. However, one-hot-vector makes no assumption about the similarity of words, and is also a very high dimensional [8, 11]. Standard RNN makes predictions based only on considering the past word into account for a specific task. This technique is suitable for predicting the next word in context. However, for some tasks it would be efficient if we could use both past and future words in tagging a task, and part-of-speech tagging, where we need to assign a tag to each word in a sentence [17, 18]. In this case we already know all the sequence of the words, and for each word we want to take both words to the left (past) and words to the right (future) in consideration when we want our prediction. That is exactly what the Bidirectional neural network does; it consists of two Long Short Term Memory (LSTM), one runs forward from left to right, and the other one runs backward from right to left. This technique is successful in tagging tasks and for embedding a sequence into a fixed-length vector [8]. Our work is inspired by the method proposed in [8] where RNN was used to support the convolutional layer to capture long-term dependencies across the whole document,

and the successfully applied of CNN with pre-trained word vectors in [4, 11, 19]. In this work, we present, a neural network architecture that utilizes convolutional and recurrent layers on top of pretrained word vectors obtained from an unsupervised neural language model [10]. Pre-trained word vectors able to capture relationships among word representations such as syntactic and semantics leads to reduce the loss of details in local information. The Recurrent layer has the ability to remember important information across long stretches of time; we exploit the CNN and RNN in a single architecture to capture long-term dependencies. The unsupervised neural language model was trained on 100 billion words from Google News [10]. We exploit the word2vec model in order to improve performance in the absence of a large supervised training set [19], in NLP applications a simple way to improve accuracy is to use unsupervised word representations as an extra word feature. III.

NEURAL NETWORK ARCHITECURE

The network takes a sequence of words as input then processes it through a sequence of layers to extract feature A. Word-Level Embedding Word-level embeddings play an important role in the proposed model. Pre-trained word vectors help to capture semantic and syntactic information, which are significant for sentiment analysis tasks, pre-trained vector are also feature extractors that can be utilized for various classification tasks. We used word2vec as a tool that implements skip-gram and CBOW architectures for measuring the vector representations of words. These vectors were trained on 100 billion words from Google News [3].

Figure 1. Word2vec architecure

B. Convlutional neural network The model shown in figure 1, is a variant of the Convolution model of [19]. Le‫ݔ‬௜ ‫ א‬Թ௞ to be the kdimensional word vector corresponding to the ݅-‫ ݄ݐ‬word in the sentence. A sentence of length n padded where necessary is represented as:

x1:n

1109

x1 † x2 † ... † xn

(1)

Where: † is the concatenation operator, a convolutional operational consists of a filter ‫ א ݓ‬Թ௛௞ , which is applied to a window of h words to produce a new feature. For instance, feature

ci

ci

is generated from a window of words

However this is a time-consuming task due to constructing the textual tree complexity [15]. Recurrent neural network has enhanced time complexity. In this model text is analyzed word by word, where the semantic of all the previous text is preserved in a fixed-sized hidden layer [20]. The capability to capture superior appropriate statistics could be valuable to capture semantics of long text in a recurrent network. However, recurrent network is a biased model, because recent words are more significant than earlier words. Therefore, the key components could appear anywhere across the document not only at the end. This might reduce the efficiency when used to capture the semantic of whole documents. The LSTM model is introduced to overcome the RNN difficulties.

xi:i h1 by

f ( w ˜ xi:i  h1  b)

(2)

Where: ܾ ‫ א‬Թ a bias is a term and f is a nonlinear function such as rectifier or tanh. This is done for every time step of the input sequence {x1:h , x 2:h 1 ,..., x n  h _ 1:n } to produce a feature map (3)

c [c1 , c2 ,..., cnh1 ]

Figure 3. Recurrnt and Forward Neural Network

Figure 2. The classification.

proposed

CNN-LSTM

architecture

for

The most naive recursive function is known to suffer from the problem of vanishing gradient [21]. More recently it is common to use Long Short-Term Memory LSTM [9, 22].

document

D. LSTM LSTM is a more complicated function that learns to control the flow of information so as to prevent the vanishing gradient and allow the recurrent layer to more easily capture long-term dependencies. LSTM was initially proposed in [9] and later modified in [23]. RNN has problems of gradient vanishing or explosion. Meanwhile, RNNs consider deep neural networks across many time instance, the gradient at the end of a sentence may not be able to back-propagate to the beginning of a sentence, due to many nonlinearity transformations [6, 20]. These issues are the main motivation behind the LSTM model, which introduces a new structure called a memory cell in figure 3. The memory cell consists of four main components: input, output, forget gates and a candidate memory cell.

We substitute the pooling layer with a recurrent layer, and fed the feature map into a bidirectional layer. C. Recurrent Neural Network RNN make use of sequential information, and the output is based on the previous computation. All inputs are independent of each other in traditional neural network, while this approach is inefficient for many tasks in NLP e.g. predicting the next word in a sentence; in this case it is important to know the previous word. RNNs have shown great success in many NLP tasks. We can think about RNNs as having a memory, which captures information in arbitrary long sequences, which is illustrated in Fig. 2.

ht

f ( xt , ht 1 )

(4)

The equations below describe how the memory cell layers are updated at every timestep t . First we compute the values

Where: ‫ݔ‬௧ ‫ א‬Թௗ is one time step from the input sequence, (‫ݔ‬ଵ , ‫ݔ‬ଶ , … , ‫) ்ݔ‬. ݄଴ ‫ א‬Թௗ` often initialized as an allzero.

~

for it , the input gate, and C t the candidate value for the states of the memory cells at time t :

Recursive neural network proved to be efficient in constructing sentence representations. The model has tree structure, which is able to capture semantic of a sentence.

it

1110

V Wi xt  U i ht 1  bi

(5)

tanh Wc xt  U c ht 1  bc

(6)

Secondly, we compute the value for

f t , the activation of the

~

Ct

different time steps. The objective is to connect two hidden layers of opposite directions to the same output. The output layer can get information from past and future states. The earlier hidden states only observe a few vectors from the lower layer, while the later ones are computed based on most of the lower-layer vectors. BRNN is composed of two recurrent layers working in opposite directions, which will return two sequences of hidden states from the forward and reverse recurrent layers, respectively.

memory cells forgets at time t :

V W f xt  U f ht 1  b f

ft

(7)

In the proposed model, the embedding sequence is turned into a sequence of dense, real-valued vectors, we apply one convolutional layer to ‫ ܧ‬to get a shorter sequence of feature vector: (11) ‫݂( = ܨ‬ଵ , ݂ଶ , … , ்݂ᇱ )

Knowing the new value of the input gate activation it , the forget gate activation we can compute Ct

Ct

ft

~

and the candidate state value C t ,

the memory cells new state at time t :

~

it * C  f t * C t 1

(8)

This feature vector is then fed into a bidirectional recurrent layer, resulting in two sequences

With the new state of the memory cells, we compute the value of their output gates and, subsequently, their outputs:

ot

V Wo xt  U o ht 1  Vo Ct  bo

ht

ot * tanh Ct Where

(9)

(12)

‫ܪ‬௥௘௩௘௥௦௘ = (݄ଵ՚ , ݄ଶ՚ , … , ݄՚ ்ᇱ ),

(13)

We take the last hidden states of both directions and concatenate them to form a fixed-dimensional vector:

(10)

՚ ݄ = [݄՜ ் ᇲ ; ݄ଵ ]

(14)

Finally, the fixed-dimensional vector ݄ is fed into the classification layer to compute the predictive probabilities ‫ )ܺ|݇ = ݕ(݌‬of all the categories ݇ = 1, … , ‫ ܭ‬given the input sequence ܺ.

xt is the input to the memory cell layer at time t ,

Wi , W f , Wc , Wo , U i , U f ,U c , U o , and Vo are weight matrices.

‫ܪ‬௙௢௥௪௔௥ௗ = (݄ଵ՜ , ݄ଶ՜ , … , ݄՜ ்ᇱ ),

bi , b f , bc , bo , are bias vectors.

V.

CLASSIFICATION LAYER

The classification layer is in essence a logistic regression classifier. Given a fixed-dimensional input from lower layer, the classification layer affine-transforms it followed by a softmax activation function to compute the predicative probabilities for all the categories. ‫= )ܺ|݇ = ݕ(݌‬

exp(ܹ௞் ܺ + ܾ௞ ) σ௞௄ᇲ ୀଵ exp(ܹ௞்ᇲ ܺ

(15)

+ ܾ௞ᇱ )

Where ܹ௞ᇱ௦ and ܾ௞ ᇲ ௦ are the weight and bias vector, we assume there are ݇ categories. VI. Figure 4. LSTM framework shown the network using only the last hidden dimention for text prediction.

IV.

DATASETS

To evaluate the performance of our model we used two sentiment analysis datasets, Stanford Large Movie Review Dataset IMDB [25], and Stanford Sentiment Treebank dataset SSTb [16].

BIDIRECTIONAL RECURRENT LAYER

Bidirectional Recurrent Neural Network (BRNN) [18, 24] do not require their input data to be fixed and their future input information is reachable from the current state. One property of the recurrent layer is that there is imbalance in the amount of information seen by the hidden states at

A. IMDB Sentiment Analysis Dataset The Stanford movie large review was proposed by [25]. It is 100,000 movie reviews taken from IMDB. Each movie review has several sentences. The IMDB dataset is divided into three parts: training 25,000 labeled instances, testing

1111

25,000 labeled instances, and 50,000 unlabeled training sentences, the dataset has positive and negative labels balanced in training and testing set. B. Stanford Sentiment Treebank Dataset The Stanford Sentiment Treebank Dataset (SSTb) introduced by [16], consist of 11,855 reviews from Rotten Tomatoes, which is split to train, dev, and test sets, respectively 8,544, 1,101 and 2,210. SSTb are labeled also on scale of five (positive, very positive, negative, very negative, and neutral).

architectures for computing vector representation of word. We validate the proposed model on two datasets, considering the difference in the number of parameters. However, the accuracy of the model does not increase with incensement in the number of the convolutional layers. One layer is enough to peak the model, more pooling layers mostly lead to the loss of long term-dependencies. Therefore, in our model we removed the pooling layer from the convolutional network and replace it with BRNN to reduce the loss in local information, using one recurrent layer is enough to capture long term dependencies.

VII. EXPERMENTAL SETUP A. Model Settings Many different combinations of hyper-parameters can give similar results. We devoted extra time tuning the learning rate and dropout of other parameters, which has a large impact on the prediction performance. Learning rate and the number of units in the convolutional layer that extract sentence feature are the only two different parameters. The number of epochs varies between five and ten for both datasets. We believe that by adding recurrent layers, we can effectively reduce the convolutional layers in the model in order to capture long-term dependencies. Therefore, for each dataset, we consider modeling only with one convolutional layer and with multiple filter widths (4, 5, 5) and feature maps = 256. For activation functions in the convolutional layer we used rectified linear (ReLU) for nonlinearity, receptive field size between three or five based on the depth, to reduce overfitting we applied dropout = 0.5 before the recurrent layer. The main component of our work employs recurrent layers as substitutes for the pooling layer in the convolutional network. The recurrent layer is fixed to a single layer of BRNN [18, 24]. We also used gradient clipping [26] by clipping the cell output and gradients. The tasks are sentiment analysis using IMDB and Stanford Sentiment Treebank. In our model we set the number of filters in the convolutional layers to be 2x as the dimension of the hidden states in the recurrent layer. Dropout is an effective way to regularize deep neural networks [26]. We observe applying dropout before and after the recurrent layer drop the performance of the models, therefore, we only apply dropout after the recurrent layer, and we set the dropout to 0.5. Dropout prevents co-adaptation of hidden units by randomly dropping out. The datasets are publically available. B. Optimization Training is done through stochastic gradient descent over shuffled mini-batches. We randomly split the full training examples into training and validation. The size of the validation is the same as the corresponding test size and is balanced in each class. We train the model by minimizing the negative log-likelihood or cross entropy loss. Early stopping strategy is utilized to prevent over fitting before training. In our work, we employed unsupervised learning of word-level embedding using the word2vec [10], which implemented the continuous bag-of-words and skip-gram

Models Naïve Bayes [27] SVMs [27] RNN [16] CNN-non-static [11] CNN-multichannel [16] RNTN [16] MV-RNN [16] Proposed model

Fine-Grained 41.0% 40.7% 43.2% 48.0% 47.4% 45.7% 44.4% 48.9%

Binary 81.8% 79.4% 85.4% 87.2% 88.1% 85.4% 82.9% 89.6%

Table 1. The performance of our method compared to other approachs on the SSTb dataset

Models NBSVM-uni [16] SVM-uni [16] SVM-bi [27] Full+Unlabeled+BoW [27] BoW-bnc [27] Proposed model

Positive/Negative 88.2% 89.1% 91.2% 89.8% 88.8% 93.4%

Table 2. The oerformance of our method compared to other approaches on the IMDB dataset.

C. Discussion and analysis We perform several experiments to offer fair comparison to recently present deep learning and traditional methods, the results are shown in Table 1 and 2. For IMDB dataset the previous baseline are bag-of-words, since the documents are long, one might expect that it is difficult for recurrent networks to learn. We however find that with tuning, it is possible to train our networks to fit the training set. Our proposed model achieves comparable performances with significantly less parameters. We achieve better results compared to convolutional-only models, which likely loses more detailed local features due to the number of the pooling layers. The proposed model uses recurrent layers as substitutes to pooling layers, which help maintain the detailed information and perform better. We assume the proposed model is more compact due to the small number of parameters, and less disposed to overfitting. Hence, it generates better when the training size is limited. We observe from our experiments that the model accuracy becomes

1112

worse when we increase the convolutional layers; the model performance increases at fewer layers varying between two to four, and declines as we add more layers. More convolutional layers leads to a loss of detailed information, which affects the ability of the recurrent layer to capture long-term dependencies. It is possible to use more filters in the convolutional layers without changing the dimensions in the recurrent layer, which potentially increases the performance without sacrificing the number of parameters. We observed that many factors control the performance of the deep learning methods, such as size of dataset, vanishing and exploding of the gradient, choosing the feature extractors, and classifiers in open search areas. However, there is no specific model fit for all types of datasets.

[7] [8]

[9] [10]

[11]

[12]

VIII. CONCLUSTION

[13]

Convolutional layer learns to extract higher level features that are invariant to local translation, the network can efficiently extract high level features from input sequence. However, it requires stacking multiple convolution layers in order to capture long-term dependencies. Recurrent layers are expected to preserve ordering information even with one single layer. As a result of this observations, we proposed to combine the convolutional and recurrent layers into one single method to efficiently capture long-term dependencies in the document, to perform classification tasks, and to exploit a bidirectional recurrent layer as a substitutes for pooling layer in the convolutional network to hypothetically reduce the loss of details in local information and capture long-term dependencies. We found that by using a pretrained unsupervised model as an extra feature gives the model capabilities to be able to obtain high-level feature extractions to capture long-term dependencies in sequence of sentence. It will be remarkable for future research to apply the architecture on other applications such as spam filtering or web search. Using other variants of recurrent neural network as substitutes for pooling layers is also worth discovering.

[14]

[15]

[16]

[17]

[18]

[19]

[20] [21]

REFERENCES [1] [2]

[3] [4]

[5]

[6]

[22]

Bengio, Y., et al., A neural probabilistic language model. journal of machine learning research, 2003. 3(Feb): p. 1137-1155. Joachims, T. Text categorization with support vector machines: Learning with many relevant features. in European conference on machine learning. 1998. Springer. Mikolov, T., et al., Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. Zhang, X., J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. in Advances in Neural Information Processing Systems. 2015. Tang, D., B. Qin, and T. Liu. Document modeling with gated recurrent neural network for sentiment classification. in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. in Advances in neural information processing systems. 2012.

[23] [24]

[25]

[26] [27]

1113

Sainath, T.N., et al., Deep convolutional neural networks for largescale speech tasks. Neural Networks, 2015. 64: p. 39-48. Xiao, Y. and K. Cho, Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers. arXiv preprint arXiv:1602.00367, 2016. Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p. 1735-1780. Mikolov, T., et al. Distributed representations of words and phrases and their compositionality. in Advances in neural information processing systems. 2013. Kim, Y., Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natual Language Processing (EMNLP). pages 1746-1751, Doha, Qatar, October 2014, Association for Computational linguisitcs. Shen, Y., et al. Learning semantic representations using convolutional neural networks for web search. in Proceedings of the 23rd International Conference on World Wide Web. 2014. ACM. Johnson, R. and T. Zhang, Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058, 2014. Kalchbrenner, N., E. Grefenstette, and P. Blunsom, A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014. Socher, R., et al. Parsing natural scenes and natural language with recursive neural networks. in Proceedings of the 28th international conference on machine learning (ICML-11). 2011. Socher, R., et al. Recursive deep models for semantic compositionality over a sentiment treebank. in Proceedings of the conference on empirical methods in natural language processing (EMNLP). 2013. Citeseer. Hochreiter, S., et al., Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. 2001, A field guide to dynamical recurrent neural networks. IEEE Press. Schuster, M. and K.K. Paliwal, Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997. 45(11): p. 2673-2681. Collobert, R., et al., Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011. 12(Aug): p. 2493-2537. Elman, J.L., Finding structure in time. Cognitive science, 1990. 14(2): p. 179-211. Bengio, Y., P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 1994. 5(2): p. 157-166. Gers, F.A., J. Schmidhuber, and F. Cummins, Learning to forget: Continual prediction with LSTM. Neural computation, 2000. 12(10): p. 2451-2471. Graves, A., Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. Baldi, P., et al., Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 1999. 15(11): p. 937946. Maas, A.L., et al. Learning word vectors for sentiment analysis. in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. 2011. Association for Computational Linguistics. Pascanu, R., T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks. ICML (3), 2013. 28: p. 1310-1318. Wang, S. and C.D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. 2012. Association for Computational Linguistics.

Suggest Documents