A Token Classification Approach to Dependency Parsing - LBD

3 downloads 0 Views 277KB Size Report
languages: danish, dutch and portuguese. As far as we ... Projectivity. One of the advantages of a dependency grammar is that it can deal with variable word.
A Token Classification Approach to Dependency Parsing Ruy L. Milidiu´ 1 , Carlos E. M. Crestana1 , C´ıcero Nogueira dos Santos1,2 1 2

Departamento de Inform´atica – PUC-Rio – Rio de Janeiro, RJ – Brazil

Mestrado em Inform´atica Aplicada – Universidade de Fortaleza – Fortaleza, CE – Brazil {milidiu,ccrestana}@inf.puc-rio.br, [email protected]

Abstract. The Dependency-based syntactic parsing task consists in identifying a head word for each word in an input sentence. Hence, its output is a rooted tree where the nodes are the words in the sentence. State-of-the-art dependency parsing systems use transition-based or graph-based models. We present a token classification approach to dependency parsing, where any classification algorithm can be used. To evaluate its effectiveness, we apply the Entropy Guided Transformation Learning algorithm to the CoNLL 2006 corpus, using the UAS as the accuracy metric. Our results show that the generated models are close to the average CoNLL system performance. Additionally, these findings also indicate that the token classification approach is a promising one.

1. Introduction Tesni`ere [Tesni`ere 1959] introduced the idea of a dependency tree, where words show a direct head-dependent relation. This relation captures the syntactic structure of a sentence. In dependency-based syntactic parsing, the task is to derive a syntactic structure for an input sentence by identifying the syntactic head of each word in the sentence. In the labeled dependency parsing, we additionally require the parser to assign a specific type to each dependency relation holding between a head word and a dependent word [Nivre et al. 2007]. Figure 1 [Nivre 2005] shows an example of a dependency structure in a English sentence. In this case, the arcs go from each token to their children and are labeled with their dependency relation type.

Figure 1. An example of a dependency graph

The dependency tree is regarded as an important information in Natural Language Processing, particularly for its improvement in the Semantic Role Labeling task [Hacioglu 2004]. Hence, it has been chosen as the CoNLL shared task in 2006 and 2007 and part of the shared task in 2008 and 2009.

There are two main Machine Learning paradigms for dependency tree modeling: transition-based parsers and graph-based parsers. Transition-based parsers build dependency trees by performing a sequence of actions, or transitions. These transitions represent either an iteration step over the sequence of tokens or the creation of a dependency relation between two tokens. In this case, the trained model has to correctly predict the next parsing transition of a given sentence. There are several powerful systems that use transition-based models [Nivre et al. 2006, Attardi et al. 2007, Duan et al. 2007, Hall et al. 2007a, Johansson and Nugues 2007, Titov and Henderson 2007]. Instead of locally treating dependency relations, graph-based parsers learn models that treat the sentence as a whole. Given a sentence, they assign a score to every possible dependency relation. Next, the parser uses this information to find the best dependency tree for the sentence. The learning step is then used to find good scoring functions and estimate its parameters. Examples of graph-based systems are presented by McDonald et al. [McDonald et al. 2005], Shimizu and Nakagawa [Shimizu and Nakagawa 2007], Hall et al. [Hall et al. 2007b] and Carreras [Carreras 2007]. These two different approaches also lead to different types of errors, as shown in [McDonald and Nivre 2007]. We present a token classification approach for dependency parsing. Here, the trained model assigns a class to each token that uniquely identifies its head. Using this approach we apply the Entropy Guided Transformation Learning (ETL) algorithm to three languages: danish, dutch and portuguese. As far as we know, this is the first study that effectively solves the dependency parsing task using a token classification approach. Our initial results show that this approach is a promising one. The remaining of this work is organized as follows. In section 2, we comment on projectivity and also on metrics for the dependency parsing task. In section 3, we present our token classification approach. In section 4, we describe the ETL algorithm and its parameters. In section 5, we relate our experiments and results. Finally, in section 6, we present some conclusions as well as future work.

2. Dependency Parsing In this section, we describe the issue of projectivity and non-projectivity in dependency trees and how this issue is currently dealt with. We also present some of the metrics used to evaluate dependency parsers. 2.1. Projectivity One of the advantages of a dependency grammar is that it can deal with variable word order languages in a more adequate way [Mel’cuk 1988]. Nevertheless, this is only true when the projectivity constraint is not assumed. According to Nivre and Nilsson [Nivre and Nilsson 2005] “an arc (i, j) is projective iff all nodes ocurring between i and j are dominated by i (where dominates is the transitive closure of the arc relation)”. Therefore, to better deal with variable word order languages a dependency grammar framework must allow non-projective arcs. The inclusion of non-projective structures makes the parsing problem more complex and compromises efficiency, accuracy and robustness [Nivre and Nilsson 2005], what leads most transition-based parsers to only build projective dependency graphs

[Nivre et al. 2007]. An alternative [Nivre and Nilsson 2005] is the pseudo-projective dependency parsing. This method consists in to projectivize the training data and encode this information in modified arc labels. This way, the parser also learns how to assign these modified labels to the arcs. After the extraction of the dependency graph, one can convert the arcs with non-standard labels back to non-projective ones. In section 3, we describe our model that works both with projective and nonprojective arcs. 2.2. Evaluation Metrics To evaluate dependency parsers the three most common metrics are the labeled attachment score (LAS), the unlabeled attachment score (UAS) and the label accuracy (LA). The LAS is the percentage of tokens where the system correctly predicts both its head and the relation type that the token holds with its head. UAS is the percentage of tokens where the system has correctly predicts its head, while LA is the percentage of tokens with correct relation type. For both the CoNLL 2006 [Buchholz and Marsi 2006] and the CoNLL 2007 [Nivre et al. 2007] shared tasks the LAS is used as the main evaluation metric, but the systems’ results are reported for all the three metrics. In this work we use the UAS as our metric, since our concern is the prediction of correct token heads and how to model this as a token classification problem.

3. The Token Classification Approach On a token classification task, a model has to predict the class that each token belongs to. Hence, in the case of dependency parsing, an algorithm has to predict, for each token, a class that identifies its head. One way of identifying the token’s head is by its position in the sentence. In Figure 2, there is an example extracted from the portuguese corpus, showing three of the available columns. The column ID is simply an identifier of the position of the token in a sentence, whereas the column Head identifies the head of each token by its ID - where 0 is used when the token is the root of the sentence. Therefore, one way of applying a token classification is to treat each position as a class, e.g. ‘class 1’ if the head is the first token in a sentence, ‘class 2’ if the head is the second token in a sentence and so on. However, this approach leads to a poor generalization, since it is heavily dependent of word order. The addition of a single word in a sentence would change the classification of several tokens. In our model, instead of using an absolute position tag, like the token ID, we use a relative one. To identify the head of a token, we use three informations: the part-ofspeech of the head; how many tokens with the same part-of-speech of the head there are between the token and its head and if the head is to the left or to the right of the token. The combinations of these three informations define the ’relative head class’ tagset that can be atributed to a token. For instance, if the head of a token is the first noun to its left, its tag would be ‘1 noun left’. In figure 2, the column Relative Head shows the relative head class of each token.

Figure 2. Portuguese corpus example

Therefore, our classification algorithm has to predict not the absolut position of a token’s head, but this more general tag that identifies it. Whereas the first and na¨ıve model presented is very sensitive to any addition or removal of words in a sentence, our proposed model is more robust: a token will only have a different class if words with the same part-of-speech of its head are added or removed between them. Like the graph-based model, this token classification approach is projectivity insensitive. Hence, there is no need of extra modeling concern to deal with non-projective arcs. Moreover, it can be used to create a baseline system for the dependency parsing task by atributing to each token its most frequently observed tag in the training set.

4. Entropy Guided Transformation Learning To evaluate our model effectiveness we use Entropy Guided Transformation Learning [dos Santos and Milidi´u 2009], a classification algorithm that generalizes Transformation Based Learning by automatically generating rule templates. ETL employs an entropy guided template generation approach, which uses Information Gain (IG) in order to select the feature combinations that provide good template sets. ETL has been successfully applied to part-of-speech (POS) tagging [dos Santos and Milidi´u 2009], phrase chunking [dos Santos et al. 2008] and named entity recognition [dos Santos and Milidi´u 2009, Milidi´u et al. 2008b], producing results at least as good as the ones of TBL with handcrafted templates. Several ETL-based language processors, for different languages, are freely available on the Web through the F-EXT-WS 1 service [Fernandes et al. 2009]. The ETL algorithm is illustrated in Figure 3. Information gain, which is based on the data Entropy, is a key strategy for feature selection. The most popular Decision Tree (DT) learning algorithms [Quinlan 1993, Su and Zhang 2006] implement this strategy. Hence, they provide a quick way to obtain entropy guided feature selection. In the ETL strategy, we use DT induction algorithms to automatically generate template sets. 1

http://www.learn.inf.puc-rio.br/

Figure 3. Entropy Guided Transformation Learning.

ETL uses a very simple DT decomposition scheme to extract templates. The decomposition process includes a depth-first traversal of the DT. For each visited tree node, we create a template that combines the features in the path from the root to this node. A detailed description of ETL can be found in [Milidi´u et al. 2008a, dos Santos and Milidi´u 2009]. ETL training time is highly sensitive to the number and complexity of templates it generates. However, it offers a technique called template evolution that greatly enhances its speed without compromising its descriptive power [Milidi´u et al. 2008a].

5. Experiments This section presents the experimental setup and results of our token classification approach to dependency parsing. We use the corpora from the CoNLL 2006 shared task that are publicly available [Buchholz and Marsi 2006]. Hence, to evaluate our model effectiveness we apply the ETL algorithm to three languages: danish, dutch and portuguese. These corpora provide the following input features: word form, token position, lemma of the word, corse-grained part-of-speech, fine-grained part-of-speech and a list of set-valued syntactic and morphological features. Using the input features, we create three derived token features: the total number of verbs before the token, the total number of verbs after the token and the lemma of the nearest verb before the token. These features increase the ETL classifier accuracy. Since the swedish corpus provides only one input feature, the part-of-speech of each token, this corpus is not used in our work. To create our tagset we use the coarse-grained part-of-speech, that gives a better accuracy than the fine-grained part-of-speech. When analyzing the statistical distribution of our tagset, we noticed that, altough there are tags where the distance information is greater than 2, they are statiscally rare. For instance, in the case of the Portuguese corpus, they correspond to only 5% of the training corpus. The tags with distance greater than 4 correspond to only 1%. The three provided corpora have a training and a test set already defined. We create also a development set to calibrate the ETL parameters and to verify the effectiveness of the derived features. To create the development set we randomly select 10% of the sentences of each training set. The initial classifier assigns for each token its most frequent tag in the training

Table 1. Number of Generated Templates.

Language 2 Portuguese 18 Danish 29 Dutch 39

Number of features 3 4 5 Total 43 179 314 554 46 133 201 409 131 158 157 485

set. For instance, in the case of the portuguese corpus, each noun is classified as first noun to the right and each verb is classified as root. Our ETL models are trained with the following parameters: window size of 7, a rule threshold of 4 and a template evolution window with rules using from 2 to 5 features. In Table 1, we show how many templates are automatically generated, broken down by language and by the number of features in the template. Observe that for the Portuguese language, ETL generates a total of 554 templates. At the extraction step, after the algorithm classifies each token, we use the tag atributed to identify its head. In case it is not possible to consistently identify its head, e.g. its tagged as the third verb to the left, but there are only two verbs, the token is simply classified as root.

Table 2. Unlabeled Attachment Score for Test set.

System

Danish (%) State-of-the-art 90.58 ETL 83.97 Average 84.52 Baseline 33.09

Dutch Portuguese (%) (%) 83.57 91.36 75.21 87.92 75.07 86.46 41.24 51.31

To evaluate our systems we use the evaluation script of the CoNLL 2006 task and report the Unlabeled Attachment Score (UAS). Table 2 shows our experimental results. For each language, we present the results of the baseline system, the ETL algorithm, the average score of the 18 systems that took part in the CoNLL 2006 shared task and the state-of-the-art by the occasion of the task. In two of the three languages our system has an above average performance. Moreover, in the three cases our results are within a 10% error-margin of the state-ofthe-art systems, what suggests that this is a promising approach. When analyzing the most frequent errors of our models, we observe that the most misclassified tags are those where the head of the token is either a verb or a noun. Table 3 shows the percentage of errors that corresponds to the tags where the head is the first verb, first noun, second verb or second noun, either to the left or to the right of the token. These results suggests a deeper investigation of noun and verb heads, that can lead to new derived features that may improve our models accuracy.

Table 3. Most Common Errors

Head of the Danish Dutch Portuguese Token (%) (%) (%) First verb 24.8 28.5 24.0 First noun 12.1 9.9 14.5 Second verb 6.5 12.6 13.4 Second noun 5.4 4.0 8.8 Total 48.7 55.0 60.7

6. Conclusions The Dependency-based syntactic parsing task consists in identifying a head word for each word in an input sentence. Hence, its output is a rooted tree where the nodes are the words in the sentence. We propose a token classification approach for the dependency parsing problem by creating a special tagging set that helps to correctly find the head of a token. Using this tagging style, any classification algorithm can be trained to identify the syntactic head of each word in a sentence. In adition, this classification model treats projective and non-projective dependency graphs equally, avoiding pseudo-projective approaches. It can also be used to generate a baseline system for the dependency parsing task, by applying the most frequent token tag in the training set. To evaluate our approach effectiveness we use the ETL algorithm with three languages of the CoNLL 2006 corpus: danish, dutch and portuguese. As far as we know, this is the first study that effectively treats dependency parsing as a token classification problem. Our findings indicate that this is a promising modeling approach. Our experiments also suggest that better results can be achieved. We shall explore its use with some other classification algorithms, new enhanced features for them and a post-processing step that includes the tree structural constraints.

References Attardi, G., Dell’Orletta, F., Simi, M., Chanev, A., and Ciaramita, M. (2007). Multilingual dependency parsing and domain adaptation using DeSR. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 1112–1118, Prague, Czech Republic. Association for Computational Linguistics. Buchholz, S. and Marsi, E. (2006). Conll-x shared task on multilingual dependency parsing. In In Proc. of CoNLL, pages 149–164. Carreras, X. (2007). Experiments with a higher-order projective dependency parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 957– 961, Prague, Czech Republic. Association for Computational Linguistics. dos Santos, C. N. and Milidi´u, R. L. (2009). Foundations of Computational Intelligence, Volume 1: Learning and Approximation, volume 201 of Studies in Computational Intelligence, chapter Entropy Guided Transformation Learning, pages 159–184. Springer.

dos Santos, C. N., Milidi´u, R. L., and Renter´ıa, R. P. (2008). Portuguese part-of-speech tagging using entropy guided transformation learning. In Proceedings of PROPOR 2008, pages 143–152, Aveiro, Portugal. Duan, X., Zhao, J., and Xu, B. (2007). Probabilistic parsing action models for multilingual dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 940–946, Prague, Czech Republic. Association for Computational Linguistics. Fernandes, E. R., Milidiu, R. L., and Santos, C. N. (2009). Portuguese language processing service. In 18th International World Wide Web Conference. Hacioglu, K. (2004). Semantic role labeling using dependency trees. In COLING ’04: Proceedings of the 20th international conference on Computational Linguistics, page 1273, Morristown, NJ, USA. Association for Computational Linguistics. Hall, J., Nilsson, J., Nivre, J., Eryigit, G., Megyesi, B., Nilsson, M., and Saers, M. (2007a). Single malt or blended? a study in multilingual parser optimization. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 933–939, Prague, Czech Republic. Association for Computational Linguistics. Hall, K., Havelka, J., and Smith, D. A. (2007b). Log-linear models of non-projective trees, k-best MST parsing and tree-ranking. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 962–966, Prague, Czech Republic. Association for Computational Linguistics. Johansson, R. and Nugues, P. (2007). Incremental dependency parsing using online learning. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 1134–1138, Prague, Czech Republic. Association for Computational Linguistics. McDonald, R. and Nivre, J. (2007). Characterizing the errors of data-driven dependency parsing models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 122–131, Prague, Czech Republic. Association for Computational Linguistics. McDonald, R., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 523–530, Vancouver, British Columbia, Canada. Association for Computational Linguistics. Mel’cuk, I. A. (1988). Dependency Syntax: Theory and Practice. State University of New York Press, New York. Milidi´u, R. L., dos Santos, C. N., and Duarte, J. C. (2008a). Phrase chunking using Entropy Guided Transformation Learning. In Proceedings of ACL2008, Columbus, Ohio. Milidi´u, R. L., dos Santos, C. N., and Duarte, J. C. (2008b). Portuguese corpus-based learning using ETL. Journal of the Brazilian Computer Society, 14(4). Nivre, J. (2005). Dependency grammar and dependency parsing. Technical report, V¨axj¨o University: School of Mathematics and Systems Engineering.

Nivre, J., Hall, J., K¨ubler, S., McDonald, R., Nilsson, J., Riedel, S., and Yuret, D. (2007). The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 915–932, Prague, Czech Republic. Association for Computational Linguistics. Nivre, J., Hall, J., Nilsson, J., and Marinov, S. (2006). Labeled pseudo-projective dependency parsing with support vector machines. In In Proceedings of the tenth conference on Computational Natural Language Learning (CoNLL), pages 221–225. Nivre, J. and Nilsson, J. (2005). Pseudo-projective dependency parsing. In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 99–106, Morristown, NJ, USA. Association for Computational Linguistics. Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Shimizu, N. and Nakagawa, H. (2007). Structural correspondence learning for dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 1166–1169, Prague, Czech Republic. Association for Computational Linguistics. Su, J. and Zhang, H. (2006). A fast decision tree learning algorithm. In Proceedings of the Twenty-First AAAI Conference on Artificial Intelligence. Tesni`ere, L. (1959). El´ements de Syntaxe Structurale. Klincksieck, Paris. Titov, I. and Henderson, J. (2007). Fast and robust multilingual dependency parsing with a generative latent variable model. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 947–951, Prague, Czech Republic. Association for Computational Linguistics.