Aug 3, 2015 - in Equation 1, where xi â Rd represents the d di- mensional word representation for the i-th word in. arXiv:1507.01839v2 [cs.CL] 3 Aug 2015 ...
Dependency-based Convolutional Neural Networks for Sentence Embedding∗ Bing Xiang‡ Bowen Zhou‡ Mingbo Ma† Liang Huang† ‡ ‡ † IBM Watson Group Graduate Center & Queens College T. J. Watson Research Center, IBM City University of New York {mma2,lhuang}gc.cuny.edu {lhuang,bingxia,zhou}@us.ibm.com
arXiv:1507.01839v2 [cs.CL] 3 Aug 2015
Abstract In sentence modeling and classification, convolutional neural network approaches have recently achieved state-of-the-art results, but all such efforts process word vectors sequentially and neglect long-distance dependencies. To combine deep learning with linguistic structures, we propose a dependency-based convolution approach, making use of tree-based n-grams rather than surface ones, thus utlizing nonlocal interactions between words. Our model improves sequential baselines on all four sentiment and question classification tasks, and achieves the highest published accuracy on TREC.
1
Introduction
Convolutional neural networks (CNNs), originally invented in computer vision (LeCun et al., 1995), has recently attracted much attention in natural language processing (NLP) on problems such as sequence labeling (Collobert et al., 2011), semantic parsing (Yih et al., 2014), and search query retrieval (Shen et al., 2014). In particular, recent work on CNN-based sentence modeling (Kalchbrenner et al., 2014; Kim, 2014) has achieved excellent, often state-of-the-art, results on various classification tasks such as sentiment, subjectivity, and question-type classification. However, despite their celebrated success, there remains a major limitation from the linguistics perspective: CNNs, being invented on pixel matrices in image processing, only consider sequential n-grams that are consecutive on the surface string and neglect longdistance dependencies, while the latter play an important role in many linguistic phenomena such as negation, subordination, and wh-extraction, all of which might dully affect the sentiment, subjectivity, or other categorization of the sentence. ∗
This work was done at both IBM and CUNY, and was supported in part by DARPA FA8750-13-2-0041 (DEFT), and NSF IIS-1449278. We thank Yoon Kim for sharing his code, and James Cross and Kai Zhao for discussions.
Indeed, in the sentiment analysis literature, researchers have incorporated long-distance information from syntactic parse trees, but the results are somewhat inconsistent: some reported small improvements (Gamon, 2004; Matsumoto et al., 2005), while some otherwise (Dave et al., 2003; Kudo and Matsumoto, 2004). As a result, syntactic features have yet to become popular in the sentiment analysis community. We suspect one of the reasons for this is data sparsity (according to our experiments, tree n-grams are significantly sparser than surface n-grams), but this problem has largely been alleviated by the recent advances in word embedding. Can we combine the advantages of both worlds? So we propose a very simple dependency-based convolutional neural networks (DCNNs). Our model is similar to Kim (2014), but while his sequential CNNs put a word in its sequential context, ours considers a word and its parent, grandparent, great-grand-parent, and siblings on the dependency tree. This way we incorporate longdistance information that are otherwise unavailable on the surface string. Experiments on three classification tasks demonstrate the superior performance of our DCNNs over the baseline sequential CNNs. In particular, our accuracy on the TREC dataset outperforms all previously published results in the literature, including those with heavy hand-engineered features. Independently of this work, Mou et al. (2015, unpublished) reported related efforts; see Sec. 3.3.
2
Dependency-based Convolution
The original CNN, first proposed by LeCun et al. (1995), applies convolution kernels on a series of continuous areas of given images, and was adapted to NLP by Collobert et al. (2011). Following Kim (2014), one dimensional convolution operates the convolution kernel in sequential order in Equation 1, where xi ∈ Rd represents the d dimensional word representation for the i-th word in
ROOT
Despite
the
film
’s
shortcomings
the
stories
are
quietly
moving
.
Figure 1: Running example from Movie Reviews dataset. Figure 1: Dependency tree of an example sentence from the Movie Reviews dataset.
mensional word representation for the i-th word in
the sentence, is concatenation the concatenationoperator. operator. the sentence, and ⊕and is ⊕the e Therefore x refers to concatenated word vector i,j ei,j refers Therefore x to concatenated word vector from the i-th word to the (i + j)-th word: from the i-th word to the (i + j)-th word: (1)
ei,j = xi ⊕ xi+1 ⊕ · · · ⊕ xi+j x
(1)
xi,k = xi ⊕ xp(i) ⊕ · · · ⊕ xpk−1 (i)
(2)
ei,j = xi ⊕ xi+1 ⊕ · · · ⊕ xi+j x
ei,j works as Sequential word concatenation x
n-gram models which feeds local x einformation Sequential word concatenation as i,j worksinto convolution operations. However, this settinginto can n-gram models which feeds local information not capture long-distance relationships unless we convolution operations. However, this setting can enlarge the window indefinitely which would innot capture long-distance relationships unless we evitably cause the data sparsity problem. enlarge the window indefinitely which would inIn order to capture the long-distance dependenevitablycies cause the datathe sparsity problem. we propose dependency tree-based convolution model (DTCNN). Figure 1 illustrates an In order to capture the long-distance dependenfrom Movie Reviews (MR) dataset cies weexample propose thethedependency-based convolu(Pang and Lee, 2005). The sentiment of this sention model (DCNN). Figure 1 illustrates an examtence is obviously positive, but this is quite difple from the Movie Reviews (Pang ficult for sequential CNNs (MR) becausedataset many n-gram and Lee, 2005).would Theinclude sentiment of this sentence windows the highly negative word is obviously positive, and but this is quitebetween difficult“Defor “shortcomings”, the distance spite” and “shortcomings” is quite long.windows DTCNN, sequential CNNs because many n-gram couldhighly capture the tree-based bigram would however, include the negative word “short“Despite – shortcomings”, thus flipping the senticomings”, and the distance between “Despite” and ment, and the tree-based trigram “ROOT – moving “shortcomings” is quite long.positive. DCNN, however, – stories”, which is highly could capture the tree-based bigram “Despite – 2.1 Convolution on Ancestor Paths shortcomings”, thus flipping the sentiment, and We define our concatenation based on the depenthe tree-based trigram “ROOT – moving – stodency tree for a given modifier xi : ries”, which is highly positive.
2.1 Convolution onkAncestor Paths where function p (i) returns the i-th word’s k-th We define our index, concatenation based ondefined the depenancestor which is recursively as: dency tree for a given(modifier x : i k−1 pk (i) =
p(p
(i)) if
k>0
xi,k = xi ⊕ xp(i) i ⊕ ··· ⊕ x if pk−1 k= (i)0
(3)
(2)
k (i) illustrates Figure 2p(left) ancestor patterns where function returns the i-th paths word’s k-th with various orders. We always start the convoancestor index, which is recursively defined as: lution with( xi and concatenate with its ancestors. If the root node is reached, as p(pk−1 (i)) ifwe add k >“ROOT” 0 pk (i) = (3) dummy ancestors (vertical padding). i tree-based concatenated if k = 0 word seFor a given
quence xi,killustrates , the convolution operation Figure 2 (left) ancestor paths applies patternsa filter w ∈ Rk×d to xi,k with a bias term b dewith various orders. We always start the convoscribed in equation 4: lution with xi and concatenate with its ancestors. ci = f (w · xi,k + b) (4) If the root node is reached, we add “ROOT” as dummy ancestors (vertical padding). For a given tree-based concatenated word sequence xi,k , the convolution operation applies a filter w ∈ Rk×d to xi,k with a bias term b described in equation 4:
ci = f (w · xi,k + b)
(4)
where f is a non-linear activation function such as rectified unit (ReLu) activation or sigmoid function function. such as where flinear is a non-linear The filter w is applied to each word in the sen-function. rectified linear unit (ReLu) or sigmoid tence, generating the feature map c ∈ Rl : The filter w is applied to each word in the senc = [c1 , c2 , · · · , cl ] (5) tence, generating the feature map c ∈ Rl : where l is the length of the sentence. 2.2 Max-Over-Tree c =Pooling [c1 , c2 and , · · ·Dropout , cl ] (5) The filters convolve with different word concatenation 4 can be regarded as pattern detecwhereinl Eq. is the length of the sentence. tion: only the most similar pattern between the words and the filter could return the maximum ac2.2 Max-Over-Tree Pooling and Dropout tivation. In sequential CNNs, max-over-time pooling (Collobert et al., 2011; Kim, 2014) operates The filters convolve with different word concateover the feature map to get the maximum actination in Eq. 4 can be regarded as pattern detecvation cˆ = max c representing the entire feature tion: only the most similar pattern between the map. Our DTCNNs also pool the maximum acwords from and the filtermap could returnthethe maximum activation feature to detect strongest tivation. In sequential CNNs, max-over-time activation over the whole tree (i.e., over the whole poolsentence). Since the longer defines a se-operates ing (Collobert et tree al., no 2011; Kim, 2014) quential “time” direction, we refer to our pooling over the feature map to get the maximum actias “max-over-tree” pooling. vation cˆ = max c representing the entire feature In order to capture enough variations, we ranmap. initialize Our DCNNs pool maximum domly the set also of filters to the detect different activation from feature map to detect the acstructure patterns. Each filter’s height is the strongest numtivation over the whole (i.e., isover the whole ber of words considered and tree the width always equal to the dimensionality d ofno word representasentence). Since the tree longer defines a setion. Each “time” filter will be represented by only one pooling quential direction, we refer to our feature after max-over-tree pooling. After a series as “max-over-tree” pooling. of convolution with different filter with different In order to features capturecarry enough variations, heights, multiple different structural we randomly initialize filters to detect information becomethe the set finalofrepresentation of thedifferent input sentence. Then, this sentence representation structure patterns. Each filter’s height is the numisber passed to a fully connected and soft-max of words considered the layer widthand is always outputs a distribution over different labels. equal to the dimensionality d of word representaNeural networks often suffer from overtraintion.Following Each filter be we represented by only one ing. Kim will (2014), employ random feature after max-over-tree pooling. After dropout on penultimate layer (Hinton et al., 2012). a series convolution differentof filter inoforder to prevent with co-adaptation hiddenwith units.different Inheights, our experiments, we set our drop out rate as multiple features carry different0.5 structural and learning as 0.95the byfinal default. Following of the informationrate become representation Kim (2014), training is done through stochastic input sentence. Then, this sentence representation gradient descent over shuffled mini-batches with is Adadelta passed toupdate a fully soft-max layer and the ruleconnected (Zeiler, 2012).
outputs a distribution over different labels.
2.3 Convolution on Siblings Neural networks suffertofrom overtrainAncestor paths alone is often not enough capture ing. linguistic Following Kim (2014), employ random many phenomena such aswe conjunction.
dropout on penultimate layer (Hinton et al., 2014). in order to prevent co-adaptation of hidden units. In our experiments, we set our drop out rate as 0.5 and learning rate as 0.95 by default. Following Kim (2014), training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update rule (Zeiler, 2012).
ancestor paths pattern(s)
n 3
m
4
5
m
m
h
h
h
2
g
g2
s
3
g2
g
g
siblings pattern(s)
n
g3
4
s
t
m
s
m
h
m
h
t
s
s
m
m
h
g
Table 1: Tree-based convolution patterns. Word concatenation always starts with m, while h, g, and2 g 2 Figure 2: Convolution patterns on trees. Word concatenation always starts with m, while h, g, and g denoteswords wordsexcluded excluded convolution. denote parent, grand parent, and and ““ ””denotes denote parent, grand parent, andgreat-grand great-grandparent, parent, etc., etc., and in in convolution. 2.3 2.3Convolution onon Siblings Convolution Siblings Ancestor paths alone is Ancestor paths alone isnot notenough enough toto capture capture many linguistic phenomena many linguistic phenomenasuch suchasasconjunction. conjunction. Inspired by higher-order dependency Inspired by higher-order dependencyparsing parsing(Mc(McDonald andand Pereira, 2006; Koo Donald Pereira, 2006; Kooand andCollins, Collins,2010), 2010), incorporate siblingsfor fora agiven givenword word in in we we alsoalso incorporate siblings various ways. Figure 2 (right) details. various ways. SeeSee Table 1 (right) forfor details. Combined Model 2.4 2.4Combined Model Powerful as is, it is, structuralinformation informationstill still does does Powerful as it structural fully cover sequential information.Also, Also,parsparsnot not fully cover sequential information. ing errors (which are common especially for ining errors (which are common especially for informal text such as online reviews) directly affect formal text such as online reviews) directly affect DCNN performance while sequential n-grams are DTCNN performance while sequential n-grams always correctly observed. To best exploit both inare always correctly observed. To best exploit formation, we want to combine both models. The botheasiest information, we want toiscombine both modway of combination to concatenate these els. representations The easiest way of combination to contogether, then feed intoisfully concatenate these representations together, then feed nected soft-max neural networks. In these cases, intocombine fully connected soft-max neural In with different feature from networks. different type these cases, combine with different feature from of sources could stabilize the performance. The different type of sources could stabilize final sentence representation is thus: the performance. The final sentence representation is thus: a) s) ˆ = [ˆ c c(1) ˆ(N ; cˆ(1) , ..., cˆ(N ) ] ; cˆ(1) ˆ(N a , ..., c a s , ..., c s | {z (N )} | {z } | {z } (1) (Na ) (1) (Ns ) (1)
ˆ = [ˆ c ca , ..., cˆ ; cˆ , ..., cˆ ; cˆ sequential , ..., cˆ ] ancestors siblings {z } | {z a } |s {z s } | ancestors
siblings
sequential
where Na , Ns , and N are the number of ancestor, sibling, and, sequential filters. In practice, we use where Na , N s and N are the number of ancestor, 100 filters for each template 2 . The sibling, and sequential filters. inInFigure practice, wefully use combined representation is 1,100-dimensional by 100 filters for each template in Table 1. The fully contrast to 300-dimensional for sequential CNN.
combined representation is 1100-dimensional by 3 Experiments contrast to 300-dimensional for sequential CNN.
1 summarizes results in the context of other 3 Table Experiments
high-performing efforts in the literature. We use We three implement our datasets DTCNNin on of the sentiopen benchmark twotop categories: mentCNN analysis on both Movie(2014). Review1 (MR) (Pang source code by Kim Table 2 and Lee, 2005) and Stanford Sentiment Treebank summarizes our results in the context of other (SST-1) (Socher et al.,in2013) datasets, and high-performing efforts the literature. Wequesuse tion classification on TREC (Li and Roth, 2002). three benchmark datasets in two categories: senti-
ment analysis on both Movie Review (MR) (Pang and Lee, 2005) and Stanford Sentiment Treebank 1
https://github.com/yoonkim/CNN sentence
(SST-1) and quesFor all (Socher datasets, et weal., first2013) obtaindatasets, the dependency tion on TREC (Li(Manning and Roth,et2002). parseclassification tree from Stanford parser al., 1 For all datasets, we first obtain the dependency 2014). Different window size for different choice parse tree fromareStanford parser (Manning et al., of convolution shown in Figure 2. For the 2 dataset without a development we ran2014). Different window sizeset for(MR), different choice domly choose 10% the training data to1.indicate of convolution areofshown in Table For the early stopping. order to have set a fare comparidataset without In a development (MR), we ranson with baseline CNN, we also use 3 to 5 as our domly choose 10% of the training data to indicate window size. Most of our results are generated by early stopping. In order to have a fare compariGPU due to its efficiency, however CPU could poson with baseline CNN, 2we also use 3 to 5 as our tentially get better results. Our implementation, window size. Most of our3 results are generated on top of Kim (2014)’s code, will be released.4 by GPU due to its efficiency, however CPU potentially could generate better results.3 Our imple3.1 Sentiment Analysis mentation can be found on Github.4 Both sentiment analysis datasets (MR and SST1) areSentiment based on movie reviews. The differences 3.1 Analysis between them are mainly in the different numBoth sentiment analysis datasets (MR and SSTbers of categories and whether the standard split 1) are based on movie reviews. The differences is given. There are 10,662 sentences in the MR between theminstance are mainly in the different numdataset. Each is labeled positive or negbers of categories and whether the standard ative, and in most cases contains one sentence.split is given. There are sentences in the Since no standard data10,662 split is given, following theMR dataset. Each instance is labeled positive or literature we use 10 fold cross validation to includenegative, and in in most cases every sentence training andcontains testing atone leastsentence. once. Concatenating withdata sibling sequential infor- the Since no standard splitand is given, following mation obviously improves DCNNs, and the literature we use 10 fold cross validation to final include model outperforms the baseline sequential CNNs every sentence in training and testing at least once. by 0.4, and ties with et al. and (2015). Concatenating withZhu sibling sequential inforDifferent from improves MR, the Stanford mation obviously tree-basedSentiment CNNs, and Treebank (SST-1) annotates finer-grained the final model outperforms the baselinelabels, sequenveryCNNs positive, very tial by positive, 0.4, and neutral, ties withnegative Zhu et and al. (2015). negative, on an extension of the MR dataset. There Different from MR, the Stanford Sentiment are 11,855 sentences with standard split. Our Treebank (SST-1) annotates finer-grained labels, model achieves an accuracy of 49.5 which is secvery positive, positive, neutral, negative and very ond only to Irsoy and Cardie (2014). negative, on an extension of the MR dataset. There phrase-structure trees in SST-1 are actually automatically are1 The 11,855 sentences with standard split.parsed,Our and thus can not be used as gold-standard trees. 2 model achieves an accuracy of 49.5 which is secGPU only supports float32 while CPU supports float64. 3 ond4 https://github.comw/yoonkim/CNN_sentence only to Irsoy and Cardie (2014). We set batch https://github.com/cosmmb/DCNN size to 100 for this task. 2
The phrase-structure trees in SST-1 are actually automatically parsed, and thus can not be used as gold-standard trees. 3 GPU only supports float32 while CPU supports float64. 4 https://github.com/cosmmb/DTCNN
Category This work CNNs
Recursive NNs Recurrent NNs Other deep learning Hand-coded rules
Model MR DCNNs: ancestor 80.4† DCNNs: ancestor+sibling 81.7† DCNNs: ancestor+sibling+sequential 81.9 Category CNNs-non-static (Kim, 2014) – baseline 81.5 CNNs-multichannel (Kim, 2014) This work 81.1 Deep CNNs (Kalchbrenner et al., 2014) Recursive Autoencoder (Socher et al., 2011) 77.7 CNNs Recursive Neural Tensor (Socher et al., 2013) Deep Recursive NNs (Irsoy and Cardie, 2014) LSTM on tree (Zhu et al., 2015) 81.9 Recursive NNs Paragraph-Vec (Le and Mikolov, 2014) SVMS (Silva et al., 2011) Recurrent NNs Other deep learning
SST-1 TREC TREC-2 47.7† 95.4† 88.4† † † 48.3 95.6 89.0† † 49.5 95.4 88.8† Model 48.0 93.6 86.4∗ DTCNNs: ancestor ∗ 47.4 92.2ancestor+sibling 86.0 DTCNNs: 48.5 93.0ancestor+sibling+sequential DTCNNs: 43.2 CNNs-non-static (Kim, 2014) – baseline CNNs-multichannel 45.7 - (Kim, 2014) Deep CNNs (Kalchbrenner et al., 2014) 49.8 Recursive (Socher et al., 2011) 48.0 - Autoencoder Recursive Neural Tensor (Socher et al., 2013) 48.7 Deep Recursive NNs (Irsoy and Cardie, 2014) 95.0 90.8 LSTM on tree (Zhu et al., 2015) Paragraph-Vec (Le and Mikolov, 2014)
MR 80 81 81 81 81 77 81 -
Table 1: Results on Movie Review (MR), Stanford Sentiment Treebank (SST-1), and TREC datasets. Hand-coded rules SVMS (Silva et al., 2011) † TREC-2 is TREC with fine grained labels. Results generated by GPU (all others generated by CPU). ∗ Results generated from Kim (2014)’s implementation. Table 2: Results on Movie Review (MR), Stanford Sentiment Treeba 3.2 Question Classification In the TREC dataset, the entire dataset of 5,952 sentences are classified into the following 6 categories: abbreviation, entity, description, location and numeric. In this experiment, DCNNs easily outperform any other methods even with ancestor convolution only. DCNNs with sibling achieve the best performance in the published literature. DCNNs combined with sibling and sequential information might suffer from overfitting on the training data based on our observation. One thing to note here is that our best result even exceeds SVMS (Silva et al., 2011) with 60 hand-coded rules. The TREC dataset also provides subcategories such as numeric:temperature, numeric:distance, and entity:vehicle. To make our task more realistic and challenging, we also test the proposed model with respect to the 50 subcategories. There are obvious improvements over sequential CNNs from the last column of Table 1. Like ours, Silva et al. (2011) is a tree-based system but it uses constituency trees compared to ours dependency trees. They report a higher fine-grained accuracy of 90.8 but their parser is trained only on the QuestionBank (Judge et al., 2006) while we used the standard Stanford parser trained on both the Penn Treebank and QuestionBank. Moreover, as mentioned above, their approach is rule-based while ours is automatically learned. 3.3 Discussions and Examples Compared with sentiment analysis, the advantage of our proposed model is obviously more substantial on the TREC dataset. Based on our error analysis, we conclude that this is mainly due to the
TREC-2 is TREC with fine grained labels. † Results generated by GPU generated from Kim (2014)’s implementation.
∗ Results
root
What is Hawaii ’s state flower ?
(a) enty ⇒ loc
root
What is natural gas composed of ?
(b) enty ⇒ desc root
What does a defibrillator do ?
(c) desc ⇒ enty root
Nothing plot wise is worth emailing home about
(d) mild negative ⇒ mild positive root
What is the temperature at the center of the earth ?
root
(e) NUM:temp ⇒ NUM:dist
What were Christopher Columbus ’ three ships ?
(f) ENTY:veh ⇒ LOC:other Figure 2: Examples from TREC (a–c), SST-1 (d)
Figure 3: Examples from TREC and TREC with fine-grained label(a–c), (e–f) SST-1 that are(d) andmisclassified TREC with fine-grained label (e–f) that are by the baseline CNN but correctly misclassified byDTCNN. the baseline CNN but labeled by our For example, (a) correctly should be entity labeled location by CNN. labeled by but ourisDCNN. For example, (a) should be entity but is labeled location by CNN.
3.2 Question C In the TREC dat sentences are cla gories: abbreviat and numeric. In ily outperform an cestor convolutio achieve the best p erature. DTCNN quential informat on the training da thing to note her ceeds SVMS (Si coded rules. We s The TREC dat such as numeric and entity:vehicl istic and challen model with respe are obvious impr from the last colu et al. (2011) is constituency tree trees. They repor of 90.8 but their p tionBank (Judge standard Stanford Treebank and Qu tioned above, the ours is automatic batch size to 30. 3.3
Discussions
Compared with s of our proposed m tial on the TREC
root
root
root
What is the speed fly ? What is hummingbirds the speed hummingbirds (noun)
root
What ? is the What meltingis point of copper ? of copper ? fly the melting point
(noun)
(a) num ⇒ enty(a) and desc⇒ enty and desc num
(a) num ⇒ enty (a) num ⇒ enty root
root
root
What What body of water are the Canary Islands in ? What body of water are the Canary Islands in ?
(b) loc ⇒ enty (b) loc ⇒ enty root
root
root
root
did Jesse Jackson organize ? What did Jesse Jackson
organize ? (b) hum ⇒ enty and enty (b) hum ⇒ enty and enty root
What position did Willie Davis play in baseball ? What is the electrical output in Madrid , Spain ? What position did Willie Davis play in baseball ? What is the electrical output in Madrid , Spain ?
(c) enty ⇒ num and num (c) hum ⇒ enty (c) enty ⇒ num and num (c) hum ⇒ enty Figure 4: Examples from TREC datasets that are Figure 3: Examples from TREC datasets that are Figure 4: Examples from TREC datasets that are Figure 4: Examples from TREC Figure 3: Examples from TREC datasets that are by misclassified both DTCNN and baseline CNN.datasets that are misclassified by DTCNN but correctly labeled by Figure 5: Examples from TREC datasets that are misclassified both DTCNN misclassified by (a) DTCNN correctly by For (a) should be by numerical but is and la- baseline CNN. baseline CNN. For example, should but be numermisclassified by DCNN but correctly labeled labeled by example, misclassified by both DCNN and baseline CNN. example, (a) shouldbybeCNN. numerical but is labeled entity byFor DTCNN and description baseline CNN. For(a) example, be numerical but is labeled by DTCNN. baseline CNN. For entity example, should(a)beshould numerFor example, (a) should be numerical but is labeled entity by DTCNN and description by CNN. ical but is labeled entity by DTCNN. ysis,iswe conclude that this is mainly due to the ical but labeled entity by DCNN. Figure 3 showcases examples where baseline beled entity by DCNN and description by CNN. difference of the we parse tree quality between the due CNNs get betterFigure results 3than DTCNNs. Exam- where baseline ysis, conclude that this is mainly to the showcases examples difference of the parse tree quality between the two tasks. In sentiment analysis, the dataset is ple (a) is misclassified as entity by DTCNN due difference of the parse tree quality between the get better results than DTCNNs. own CNNs part-of-speech tagging). The wordExam“fly” at collected In from thetasks. RottenIn Tomatoes website error (the Stanford parser per-by DTCNN due two tasks. sentiment analysis, the which dataset isparsing/tagging two sentiment analysis, the to dataset is ple (a) is misclassified as entity the end of the sentence should be a verb instead of includes many usage language. Somewebsite forms its own to part-of-speech tagging). from the of Rotten Tomatoes which parsing/tagging error The (the word Stanford parser percollected fromcollected theirregular Rotten Tomatoes website which noun, and “hummingbirds fly” should be a relative of the sentences even come from languages other “fly” at the end of the sentence should be a verb includes many irregular usage of language. Some forms its own part-of-speech tagging). The word includes irregular usage oftrees language. Some than many English. The errors in parse inevitably instead of noun, and “hummingbirds fly” should clause modifying “speed”. of the sentences even come from languages other “fly” at the end of the sentence should be a verb the classification accuracy. However, the other of theaffect sentences come from a relativeThere clause modifying “speed”. areofsome that are fly” misclassified thaneven English. The errorslanguages in parse trees be inevitably instead noun, sentences and “hummingbirds should parser works substantially better ontrees the TREC There are some sentences that are misclassified than English. The errors in parse inevitably affect the classification accuracy. However, the be a relative clause modifying “speed”. by both the baseline CNN and DCNN. Figure 5 dataset since all questions are in formal written by both the baseline CNN and DTCNN. Figure 4 parser worksaccuracy. substantiallyHowever, better 5on the TREC affectEnglish, the classification the There are some sentences that are misclassified shows three such examples. Example (a) is not and the training set for Stanford parser shows three such examples. Example (a) is not dataset since all better questions in formal written by both by theboth baseline CNN and DTCNN. Figure 4 parseralready works substantially onarethe includes the QuestionBank (Judge et al.,TREC classifiedclassified as numerical methods due to the as numerical by bothExample methods due to the English, and theTREC training set for Stanford parser5 showsofthree such “point” examples. (a) is not 2006) whichall includes 2,000 sentences. ambiguous meaning the word which is dataset since questions are in formal written ambiguous meaning of the word “point” which already includes thewhere QuestionBank classified as embedding. numerical by both methods due to the is visualizes examples CNN errs (Judge 5 et al., difficult to capture by word This word English,Figure and 2the training set for Stanford parser 2006) which includes 2,000 TREC sentences. difficult toopinion, capture word embedding. This isword ambiguous meaning of the word while DTCNN does not. For example, CNN lacan mean location, etc.by Apparently, the“point” which already the due QuestionBank (Judge et al., Figure 2 visualizes examples where CNN errs difficult to captureby by wordembedembedding. This word the belsincludes (a) as location to “Hawaii” and “state”, can mean location, opinion, etc. Apparently, numerical aspect is not captured word while DTCNN does not. sentences. For example,ding. CNNExample lacan mean be location, opinion, etc. Apparently, the 2006)while which includes 2,000 TREC the long-distance backbone “What – flower” (c) might an is annotation error. numerical aspect not captured by word embedbels (a) as location due to “Hawaii” and From “state”, numerical aspect is not captured is clearly asking for an entity. Similarly, in (d), the mistakes made by DTCNNs, we findby word embedFigure 3 visualizes examples where CNN errs ding.ding. Example (c) might be anannotation annotation error. while the backbonetree“Whatthe – flower” Example mightlimited be an error. DTCNN captures thelong-distance obviously negative performance of DTCNN is(c) mainly by whilebased DCNN not. For –example, CNN la- in Shortly before submitting to ACL 2015 isdoes clearly asking an entity. Note Similarly, (d), trigram “Nothing – worthfor emailing”. the mistakes madeand by the DTCNNs, we find we two factors: the From accuracy of the parser bels (a) towith “Hawaii” andde“state”, that as our location model also due works non-projective DTCNN captures the obviously negative treequality of word embedding. Future work will fothe performance of DTCNN is mainly limited learned Mou et al. (2015, unpublished) havebyindetrees such asbackbone the“Nothing one in (b). The– based trigram – worth emailing”. cus onNote these two issues. whilependency the long-distance “What –last flower” two factors: the accuracy of the parser and the pendently reported concurrent and related efforts. two examples inour Figure 2 visualize cases where that model also works with non-projective dequality of word embedding. Future work will fois clearly asking for an entity. Similarly, in (d), Their constituency model, based on their unpubConclusions and Future Work DTCNN outperforms baseline CNNs in finependency the trees such as the one in (b). 4 The last cus on these two issues. DCNN captures theexample obviously negative tree-based grained TREC. (e), the word “temperlished programming languages (Mou et twoInexamples in Figure 2 visualize cases where We have presentedwork a veryin simple dependency tree6 performs trigram “Nothing – worth – emailing”. Note that ature” is at DTCNN second from the top and is root of aCNNs 4 Conclusions and Future Work based convolution framework which outperforms outperforms the baseline in fineal., 2014), convolution on pretrained re8 word also spangrained “the ... TREC. earth”. we use winour model works with InWhen non-projective depenCNN baselines on various classification example (e),a the wordsequential “tempercursive representations rather than word We node have presented a very simple dependency tree- emdow of sizeature” 5 for is treeat convolution, every words tasks. this model would consider second fromThe the last top and root ofExtensions a dencyintrees such as the one in (b). twoisexbasedofconvolution framework which outperforms to beddings, thus baring little, if any, resemblance that span get convolved with “temperature” and dependency and constituency trees. Also, 8 wordvisualize span “the ...cases earth”. WhenDCNN we use a win- labels sequential CNN baselines on various classification amples Figure where thisin should be the3reason why DTCNN get correct. our dependency-based model. we would evaluate on gold-standard parse trees. Their dependency dow of size 5 for tree convolution, every words tasks. Extensions of this model would consider outperforms the baseline CNNs in fine-grained 5 model is related, butand always includes a node http://nlp.stanford.edu/software/parser-faq.shtml in that span get convolved with “temperature” and dependency labels constituency trees. Also, and TREC. In example (e), be thetheword this should reason“temperature” why DTCNN getiscorrect. all itswechildren (resembling Iyyer parse et al.trees. (2014)), would evaluate on gold-standard
at second from the 5 top and is root of a 8 word span http://nlp.stanford.edu/software/parser-faq.shtml “the ... earth”. When we use a window of size 5 for tree convolution, every words in that span get convolved with “temperature” and this should be the reason why DCNN get correct. Figure 4 showcases examples where baseline CNNs get better results than DCNNs. Example (a) is misclassified as entity by DCNN due to parsing/tagging error (the Stanford parser performs its
which is a variant of our sibling model and always flat. By contrast, our ancestor model looks at the vertical path from any word to its ancestors, being linguistically motivated (Shen et al., 2008).
4
Conclusions
We have presented a very simple dependencybased convolution framework which outperforms sequential CNN baselines on modeling sentences. 6
5
http://nlp.stanford.edu/software/parser-faq.shtml
Both their 2014 and 2015 reports proposed (independently of each other and independently of our work) the term “tree-based convolution” (TBCNN).
References R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12. Kushal Dave, Steve Lawrence, and David M Pennock. 2003. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of World Wide Web. Michael Gamon. 2004. Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In Proceedings of COLING. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Improving neural networks by preventing co-adaptation of feature detectors. Journal of Machine Learning Research, 15. Ozan Irsoy and Claire Cardie. 2014. Deep recursive neural networks for compositionality in language. In Advances in Neural Information Processing Systems, pages 2096–2104.
Shotaro Matsumoto, Hiroya Takamura, and Manabu Okumura. 2005. Sentiment classification using word sub-sequences and dependency sub-trees. In Proceedings of PA-KDD. Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proceedings of EACL. Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. 2014. TBCNN: A tree-based convolutional neural network for programming language processing. Unpublished manuscript: http://arxiv.org/ abs/1409.5718. Lili Mou, Hao Peng, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2015. Discriminative neural sentence modeling by tree-based convolution. Unpublished manuscript: http://arxiv.org/abs/1504. 01106v5. Version 5 dated June 2, 2015; Version 1 (“Tree-based Convolution: A New Architecture for Sentence Modeling”) dated Apr 5, 2015. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of ACL, pages 115–124.
Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daum´e III. 2014. A neural network for factoid question answering over paragraphs. In Proceedings of EMNLP.
Libin Shen, Lucas Champollion, and Aravind K Joshi. 2008. LTAG-spinal and the treebank. Language Resources and Evaluation, 42(1):1–19.
John Judge, Aoife Cahill, and Josef van Genabith. 2006. Questionbank: Creating a corpus of parseannotated questions. In Proceedings of COLING.
Yelong Shen, Xiaodong he, Jianfeng Gao, Li Deng, and Gregoire Mesnil. 2014. Learning semantic representations using convolutional neural networks for web search. In Proceedings of WWW.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of EMNLP. Terry Koo and Michael Collins. 2010. Efficient thirdorder dependency parsers. In Proceedings of ACL. Taku Kudo and Yuji Matsumoto. 2004. A boosting algorithm for classification of semi-structured text. In Proceedings of EMNLP. Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of ICML. Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Mller, E. Sckinger, P. Simard, and V. Vapnik. 1995. Comparison of learning algorithms for handwritten digit recognition. In Int’l Conf. on Artificial Neural Nets. Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of COLING. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of ACL: Demonstrations, pages 55–60.
J. Silva, L. Coheur, A. C. Mendes, and Andreas Wichert. 2011. From symbolic to sub-symbolic information in question classification. Artificial Intelligence Review, 35. Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. 2011. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In Proceedings of EMNLP 2011. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP 2013. Wen-tau Yih, Xiaodong He, and Christopher Meek. 2014. Semantic parsing for single-relation question answering. In Proceedings of ACL. Mattgew Zeiler. 2012. Adadelta: An adaptive learning rate method. Unpublished manuscript: http:// arxiv.org/abs/1212.5701. Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long short-term memory over tree structures. In Proceedings of ICML.