Dependency-based Convolutional Neural Networks for Sentence Embedding∗ Bing Xiang‡ Bowen Zhou‡ Mingbo Ma† Liang Huang† ‡ ‡ † IBM Watson Group Graduate Center & Queens College T. J. Watson Research Center, IBM City University of New York {mma2,lhuang} {lhuang,bingxia,zhou} Abstract

Indeed, in the sentiment analysis literature, researchers have incorporated long-distance information from syntactic parse trees, but the results are somewhat inconsistent: some reported small improvements (Gamon, 2004; Matsumoto et al., 2005), while some otherwise (Dave et al., 2003; Kudo and Matsumoto, 2004). As a result, syntactic features have yet to become popular in the sentiment analysis community. We suspect one of the reasons for this is data sparsity (according to our experiments, tree n-grams are significantly sparser than surface n-grams), but this problem has largely been alleviated by the recent advances in word embedding. Can we combine the advantages of both worlds? So we propose a very simple dependency-based convolutional neural networks (DCNNs). Our model is similar to Kim (2014), but while his sequential CNNs put a word in its sequential context, ours considers a word and its parent, grandparent, great-grand-parent, and siblings on the dependency tree. This way we incorporate longdistance information that are otherwise unavailable on the surface string. Experiments on three classification tasks demonstrate the superior performance of our DCNNs over the baseline sequential CNNs. In particular, our accuracy on the TREC dataset outperforms all previously published results in the literature, including those with heavy hand-engineered features. Independently of this work, Mou et al. (2015, unpublished) reported related efforts; see Sec. 3.3.

In sentence modeling and classification, convolutional neural network approaches have recently achieved state-of-the-art results, but all such efforts process word vectors sequentially and neglect long-distance dependencies. To combine deep learning with linguistic structures, we propose a dependency-based convolution approach, making use of tree-based n-grams rather than surface ones, thus utlizing nonlocal interactions between words. Our model improves sequential baselines on all four sentiment and question classification tasks, and achieves the highest published accuracy on TREC.



Convolutional neural networks (CNNs), originally invented in computer vision (LeCun et al., 1995), has recently attracted much attention in natural language processing (NLP) on problems such as sequence labeling (Collobert et al., 2011), semantic parsing (Yih et al., 2014), and search query retrieval (Shen et al., 2014). In particular, recent work on CNN-based sentence modeling (Kalchbrenner et al., 2014; Kim, 2014) has achieved excellent, often state-of-the-art, results on various classification tasks such as sentiment, subjectivity, and question-type classification. However, despite their celebrated success, there remains a major limitation from the linguistics perspective: CNNs, being invented on pixel matrices in image processing, only consider sequential n-grams that are consecutive on the surface string and neglect longdistance dependencies, while the latter play an important role in many linguistic phenomena such as negation, subordination, and wh-extraction, all of which might dully affect the sentiment, subjectivity, or other categorization of the sentence.


Dependency-based Convolution

The original CNN, first proposed by LeCun et al. (1995), applies convolution kernels on a series of continuous areas of given images, and was adapted to NLP by Collobert et al. (2011). Following Kim (2014), one dimensional convolution operates the convolution kernel in sequential order in Equation 1, where xi ∈ Rd represents the d dimensional word representation for the i-th word in

This work was done at both IBM and CUNY, and was supported in part by DARPA FA8750-13-2-0041 (DEFT), and NSF IIS-1449278.


Figure 1: Running example from Movie Reviews dataset. Figure 1: Dependency tree of an example sentence from the Movie Reviews dataset.

mensional word representation for the i-th word in

the sentence, is concatenation the concatenationoperator. operator. the sentence, and ⊕and is ⊕the e Therefore x refers to concatenated word vector i,j ei,j refers Therefore x to concatenated word vector from the i-th word to the (i + j)-th word: from the i-th word to the (i + j)-th word: ei,j = xi ⊕ xi+1 ⊕ · · · ⊕ xi+j x

ei,j = xi ⊕ xi+1 ⊕ · · · ⊕ xi+j x



ei,j works as Sequential word concatenation x

n-gram models which feeds local x einformation Sequential word concatenation as i,j worksinto convolution operations. However, this settinginto can n-gram models which feeds local information not capture long-distance relationships unless we convolution operations. However, this setting can enlarge the window indefinitely which would innot capture long-distance relationships unless we evitably cause the data sparsity problem. enlarge the window indefinitely which would inIn order to capture the long-distance dependenevitablycies cause the datathe sparsity problem. we propose dependency tree-based convolution model (DTCNN). Figure 1 illustrates an In order to capture the long-distance dependenfrom Movie Reviews (MR) dataset cies weexample propose thethedependency-based convolu(Pang and Lee, 2005). The sentiment of this sention model (DCNN). Figure 1 illustrates an examtence is obviously positive, but this is quite difple from the Movie Reviews (Pang ficult for sequential CNNs (MR) becausedataset many n-gram and Lee, 2005).would Theinclude sentiment of this sentence windows the highly negative word is obviously positive, and but this is quitebetween difficult“Defor “shortcomings”, the distance spite” and “shortcomings” is quite DTCNN, sequential CNNs because many n-gram couldhighly capture the tree-based bigram would however, include the negative word “short“Despite – shortcomings”, thus flipping the senticomings”, and the distance between “Despite” and ment, and the tree-based trigram “ROOT – moving “shortcomings” is quite long.positive. DCNN, however, – stories”, which is highly could capture the tree-based bigram “Despite – 2.1 Convolution on Ancestor Paths shortcomings”, thus flipping the sentiment, and We define our concatenation based on the depenthe tree-based trigram “ROOT – moving – stodency tree for a given modifier xi : ries”, which is highly positive. xi,k = xi ⊕ xp(i) ⊕ · · · ⊕ xpk−1 (i)


2.1 Convolution onkAncestor Paths where function p (i) returns the i-th word’s k-th We define our index, concatenation based ondefined the depenancestor which is recursively as: dency tree for a given(modifier x : i k−1 pk (i) =


(i)) if


xi,k = xi ⊕ xp(i) i ⊕ ··· ⊕ x if pk−1 k= (i)0



k (i) illustrates Figure 2p(left) ancestor patterns where function returns the i-th paths word’s k-th with various orders. We always start the convoancestor index, which is recursively defined as: lution with( xi and concatenate with its ancestors. If the root node is reached, as p(pk−1 (i)) ifwe add k >“ROOT” 0 pk (i) = (3) dummy ancestors (vertical padding). i tree-based concatenated if k = 0 word seFor a given

quence xi,killustrates , the convolution operation Figure 2 (left) ancestor paths applies patternsa filter w ∈ Rk×d to xi,k with a bias term b dewith various orders. We always start the convoscribed in equation 4: lution with xi and concatenate with its ancestors. ci = f (w · xi,k + b) (4) If the root node is reached, we add “ROOT” as dummy ancestors (vertical padding). For a given tree-based concatenated word sequence xi,k , the convolution operation applies a filter w ∈ Rk×d to xi,k with a bias term b described in equation 4:

ci = f (w · xi,k + b)

where f is a non-linear activation function such as rectified unit (ReLu) activation or sigmoid function function. such as where flinear is a non-linear The filter w is applied to each word in the sen-function. rectified linear unit (ReLu) or sigmoid tence, generating the feature map c ∈ Rl : The filter w is applied to each word in the senc = [c1 , c2 , · · · , cl ] (5) tence, generating the feature map c ∈ Rl : where l is the length of the sentence. 2.2 Max-Over-Tree c =Pooling [c1 , c2 and , · · ·Dropout , cl ] (5) The filters convolve with different word concatenation 4 can be regarded as pattern detecwhereinl Eq. is the length of the sentence. tion: only the most similar pattern between the words and the filter could return the maximum ac2.2 Max-Over-Tree Pooling and Dropout tivation. In sequential CNNs, max-over-time pooling (Collobert et al., 2011; Kim, 2014) operates The filters convolve with different word concateover the feature map to get the maximum actination in Eq. 4 can be regarded as pattern detecvation cˆ = max c representing the entire feature tion: only the most similar pattern between the map. Our DTCNNs also pool the maximum acwords from and the filtermap could returnthethe maximum activation feature to detect strongest tivation. In sequential CNNs, max-over-time activation over the whole tree (i.e., over the whole poolsentence). Since the longer defines a se-operates ing (Collobert et tree al., no 2011; Kim, 2014) quential “time” direction, we refer to our pooling over the feature map to get the maximum actias “max-over-tree” pooling. vation cˆ = max c representing the entire feature In order to capture enough variations, we ranmap. initialize Our DCNNs pool maximum domly the set also of filters to the detect different activation from feature map to detect the acstructure patterns. Each filter’s height is the strongest numtivation over the whole (i.e., isover the whole ber of words considered and tree the width always equal to the dimensionality d ofno word representasentence). Since the tree longer defines a setion. Each “time” filter will be represented by only one pooling quential direction, we refer to our feature after max-over-tree pooling. After a series as “max-over-tree” pooling. of convolution with different filter with different In order to features capturecarry enough variations, heights, multiple different structural we randomly initialize filters to detect information becomethe the set finalofrepresentation of thedifferent input sentence. Then, this sentence representation structure patterns. Each filter’s height is the numisber passed to a fully connected and soft-max of words considered the layer widthand is always outputs a distribution over different labels. equal to the dimensionality d of word representaNeural networks often suffer from overtraintion.Following Each filter be we represented by only one ing. Kim will (2014), employ random feature after max-over-tree pooling. After dropout on penultimate layer (Hinton et al., 2012). a series convolution differentof filter inoforder to prevent with co-adaptation hiddenwith units.different Inheights, our experiments, we set our drop out rate as multiple features carry different0.5 structural and learning as 0.95the byfinal default. Following of the informationrate become representation Kim (2014), training is done through stochastic input sentence. Then, this sentence representation gradient descent over shuffled mini-batches with is Adadelta passed toupdate a fully soft-max layer and the ruleconnected (Zeiler, 2012).

outputs a distribution over different labels.

2.3 Convolution on Siblings Neural networks suffertofrom overtrainAncestor paths alone is often not enough capture ing. linguistic Following Kim (2014), employ random many phenomena such aswe conjunction.

dropout on penultimate layer (Hinton et al., 2014). in order to prevent co-adaptation of hidden units. In our experiments, we set our drop out rate as 0.5 and learning rate as 0.95 by default. Following Kim (2014), training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update rule (Zeiler, 2012).

Table 1: Tree-based convolution patterns. Word concatenation always starts with m, while h, g, and2 g 2 Figure 2: Convolution patterns on trees. Word concatenation always starts with m, while h, g, and g denoteswords wordsexcluded excluded convolution. denote parent, grand parent, and and ““ ””denotes denote parent, grand parent, andgreat-grand great-grandparent, parent, etc., etc., and in in convolution. 2.3 2.3Convolution onon Siblings Convolution Siblings Ancestor paths alone is Ancestor paths alone isnot notenough enough toto capture capture many linguistic phenomena many linguistic phenomenasuch suchasasconjunction. conjunction. Inspired by higher-order dependency Inspired by higher-order dependencyparsing parsing(Mc(McDonald andand Pereira, 2006; Koo Donald Pereira, 2006; Kooand andCollins, Collins,2010), 2010), incorporate siblingsfor fora agiven givenword word in in we we alsoalso incorporate siblings various ways. Figure 2 (right) details. various ways. SeeSee Table 1 (right) forfor details. Combined Model 2.4 2.4Combined Model Powerful as is, it is, structuralinformation informationstill still does does Powerful as it structural fully cover sequential information.Also, Also,parsparsnot not fully cover sequential information. ing errors (which are common especially for ining errors (which are common especially for informal text such as online reviews) directly affect formal text such as online reviews) directly affect DCNN performance while sequential n-grams are DTCNN performance while sequential n-grams always correctly observed. To best exploit both inare always correctly observed. To best exploit formation, we want to combine both models. The botheasiest information, we want toiscombine both modway of combination to concatenate these els. representations The easiest way of combination to contogether, then feed intoisfully concatenate these representations together, then feed nected soft-max neural networks. In these cases, intocombine fully connected soft-max neural In with different feature from networks. different type these cases, combine with different feature from of sources could stabilize the performance. The different type of sources could stabilize final sentence representation is thus: the performance. The final sentence representation is thus: a) s) ˆ = [ˆ c c(1) ˆ(N ; cˆ(1) , ..., cˆ(N ) ] ; cˆ(1) ˆ(N a , ..., c a s , ..., c s | {z (N )} | {z } | {z } (1) (Na ) (1) (Ns ) (1)

ˆ = [ˆ c ca , ..., cˆ ; cˆ , ..., cˆ ; cˆ sequential , ..., cˆ ] ancestors siblings {z } | {z a } |s {z s } | ancestors



where Na , Ns , and N are the number of ancestor, sibling, and, sequential filters. In practice, we use where Na , N s and N are the number of ancestor, 100 filters for each template 2 . The sibling, and sequential filters. inInFigure practice, wefully use combined representation is 1,100-dimensional by 100 filters for each template in Table 1. The fully contrast to 300-dimensional for sequential CNN.

combined representation is 1100-dimensional by 3 Experiments contrast to 300-dimensional for sequential CNN.

1 summarizes results in the context of other 3 Table Experiments high-performing efforts in the literature. We use We three implement our datasets DTCNNin on of the sentiopen benchmark twotop categories: mentCNN analysis on both Movie(2014). Review1 (MR) (Pang source code by Kim Table 2 and Lee, 2005) and Stanford Sentiment Treebank summarizes our results in the context of other (SST-1) (Socher et al.,in2013) datasets, and high-performing efforts the literature. Wequesuse tion classification on TREC (Li and Roth, 2002). three benchmark datasets in two categories: senti-

(SST-1) and quesFor all (Socher datasets, et weal., first2013) obtaindatasets, the dependency tion on TREC (Li(Manning and Roth,et2002). parseclassification tree from Stanford parser al., 1 For all datasets, we first obtain the dependency 2014). Different window size for different choice parse tree fromareStanford parser (Manning et al., of convolution shown in Figure 2. For the 2 dataset without a development we ran2014). Different window sizeset for(MR), different choice domly choose 10% the training data to1.indicate of convolution areofshown in Table For the early stopping. order to have set a fare comparidataset without In a development (MR), we ranson with baseline CNN, we also use 3 to 5 as our domly choose 10% of the training data to indicate window size. Most of our results are generated by early stopping. In order to have a fare compariGPU due to its efficiency, however CPU could poson with baseline CNN, 2we also use 3 to 5 as our tentially get better results. Our implementation, window size. Most of our3 results are generated on top of Kim (2014)’s code, will be released.4 by GPU due to its efficiency, however CPU potentially could generate better results.3 Our imple3.1 Sentiment Analysis mentation can be found on Github.4 Both sentiment analysis datasets (MR and SST-

1) areSentiment based on movie reviews. The differences 3.1 Analysis between them are mainly in the different numBoth sentiment analysis datasets (MR and SSTbers of categories and whether the standard split 1) are based on movie reviews. The differences is given. There are 10,662 sentences in the MR between theminstance are mainly in the different numdataset. Each is labeled positive or negbers of categories and whether the standard ative, and in most cases contains one sentence.split is given. There are sentences in the Since no standard data10,662 split is given, following theMR dataset. Each instance is labeled positive or literature we use 10 fold cross validation to includenegative, and in in most cases every sentence training andcontains testing atone leastsentence. once. Concatenating withdata sibling sequential infor- the Since no standard splitand is given, following mation obviously improves DCNNs, and the literature we use 10 fold cross validation to final include model outperforms the baseline sequential CNNs every sentence in training and testing at least once. by 0.4, and ties with et al. and (2015). Concatenating withZhu sibling sequential inforDifferent from improves MR, the Stanford mation obviously tree-basedSentiment CNNs, and Treebank (SST-1) annotates finer-grained the final model outperforms the baselinelabels, sequenveryCNNs positive, very tial by positive, 0.4, and neutral, ties withnegative Zhu et and al. (2015). negative, on an extension of the MR dataset. There Different from MR, the Stanford Sentiment are 11,855 sentences with standard split. Our Treebank (SST-1) annotates finer-grained labels, model achieves an accuracy of 49.5 which is secvery positive, positive, neutral, negative and very ond only to Irsoy and Cardie (2014). negative, on an extension of the MR dataset. There phrase-structure trees in SST-1 are actually automatically are1 The 11,855 sentences with standard split.parsed,Our and thus can not be used as gold-standard trees. 2 model achieves an accuracy of 49.5 which is secGPU only supports float32 while CPU supports float64. 3 ond4 https://github.comw/yoonkim/CNN_sentence only to Irsoy and Cardie (2014). We set batch size to 100 for this task.

3.2 Question Classification In the TREC dataset, the entire dataset of 5,952 sentences are classified into the following 6 categories: abbreviation, entity, description, location and numeric. In this experiment, DCNNs easily outperform any other methods even with ancestor convolution only. DCNNs with sibling achieve the best performance in the published literature. DCNNs combined with sibling and sequential information might suffer from overfitting on the training data based on our observation. One thing to note here is that our best result even exceeds SVMS (Silva et al., 2011) with 60 hand-coded rules. The TREC dataset also provides subcategories such as numeric:temperature, numeric:distance, and entity:vehicle. To make our task more realistic and challenging, we also test the proposed model with respect to the 50 subcategories. There are obvious improvements over sequential CNNs from the last column of Table 1. Like ours, Silva et al. (2011) is a tree-based system but it uses constituency trees compared to ours dependency trees. They report a higher fine-grained accuracy of 90.8 but their parser is trained only on the QuestionBank (Judge et al., 2006) while we used the standard Stanford parser trained on both the Penn Treebank and QuestionBank. Moreover, as mentioned above, their approach is rule-based while ours is automatically learned.

What is Hawaii ’s state flower ?

(a) enty ⇒ loc root

What is natural gas composed of ?

(b) enty ⇒ desc root

What does a defibrillator do ?

(c) desc ⇒ enty root

Nothing plot wise is worth emailing home about

(d) mild negative ⇒ mild positive root

What is the temperature at the center of the earth ?

(e) NUM:temp ⇒ NUM:dist root

What were Christopher Columbus ’ three ships ?

3.3 Discussions and Examples Compared with sentiment analysis, the advantage of our proposed model is obviously more substantial on the TREC dataset. Based on our error analysis, we conclude that this is mainly due to the 177

We have presented a very simple dependencybased convolution framework which outperforms sequential CNN baselines on modeling sentences. 6


Both their 2014 and 2015 reports proposed (independently of each other and independently of our work) the term “tree-based convolution” (TBCNN).



