Building an Efficient Functional Labeling System for Vietnamese Nguyen Thanh Huy Faculty of Information Technology, Hanoi Open University
[email protected]
Nguyen Kim Anh Faculty of Information Technology, Hung Vuong University
[email protected]
Nguyen Phuong Thai University of Engineering and Technology, Vietnam National University, Hanoi
[email protected]
Abstract Functional tags represent semantic roles and abstract syntactic information of constituents such as noun phrases and prepositional phrases in a tree. Functional-tag labeling has been studied for languages such as English and Chinese. In this paper, we present a new system for tagging Vietnamese sentences functionally. We used maximum entropy model for this task with six tree-based features. Besides, a new feature based on word cluster has also been made use of. Our experiments on Vietnamese treebank showed that the system achieved a good performance, a precision of 87.77%, and the word cluster feature had a good contribution to that precision.
Key word Nature language processing, functional tag, word cluster, maximum entropy model.
1. Introduction In a syntactic tree there are two kinds of tags including constituent tags and functional tags. The target of syntactic parsing task is to recognizing the tree structure and constituent labels. However, the prediction of functional labels is the objective of functional labeling task. Functional labels are the key to answer questions who, whom, where, what which provide the meaning of a sentence in its context. Figure 1 shows a parse tree sample from Vietnamese treebank with functional labels. In English and Chinese there are linguistic resources which have been available for research purposes for a long time. However, the Vietnamese treebank1 has been developed recently Nguyen et al. [17] learning experiences and approaches from Penn English and Chinese treebanks [15].
1
http://vlsp.vietlp.org:8080/demo/?page=resources
Figure 1. A parse tree from Vietnamese treebank Functional labeling is an important processing step for many natural language processing applications such as question answering and information extraction. There have been several researches on this problem. Collins’ work [7] can be considered as a very simple form of functional labeling. Motivated by the idea that complement and adjunct should be discriminate to improve parsing accuracy, in the training phase, instead of removing all functional tags as a preprocessing step, complement and adjunct information were kept. As a result, Collins’ parser can produce syntactic trees in constituent structure with complement and adjunct functional tags. The functional tagging task was defined more precisely by Blaheta [2]. He proposed a classification approach for functional labeling and focused on English. Following Don Blaheta’s research, there have been various other researches such as Merlo and Mussillo [16], Blaheta and Charniak [3], Chrupala et al. [6], Sun and Sui [19], etc. These studies extended the functional-labeling research topic by focusing on a new language such as Chinese, or proposing a new approach such as sequential labeling, or investigating new features such
as lexical features. In the next section we will discuss these studies in more detail. This paper presents our research on Vietnamese functional labeling problem. According to our knowledge, though there have been studies on Vietnamese syntactic parsing such as [13], our study on functional labeling is the first one. We formulate this task as a classification problem and then use the maximum entropy model [1]. We analyze various aspects of our functional labeling system including accuracy, learning curve, error analyses, etc. We also investigate the effectiveness of word cluster feature in solving the data sparseness problem. Our experimental results showed that our system achieved a good performance. The rest of this paper is organized as follows: Section 2 presents related works. Section 3 describes our method including system architecture, features, etc. Section 4 shows experimental results and discussions. Finally, Section 5 gives conclusions and future works.
formal grammars such as context-free grammars in which non-terminal symbols represents linguistic constituents such as noun phrase (NP tag) while in parsing adapted for including functional labeling, nonterminal symbols represent linguistic constituents with richer information such as temporal noun phrase (NPTMP tag). The temporal noun phrase tag combines a constituent tag (NP) and a functional tag (TMP). The advantage of this approach is the simplicity. Gabbard et al. [9] claimed that by modifying less than ten lines of code, Bikel’s parser, a state-of-the-art implementation of Collins’ method, achieved near state-of-the-art performance on recovering functional tags. However, since parsing models are often designed based on probabilistic context-free grammars (PCFGs), it is not easy to extend these models to use new features. This is a limitation if researchers intend to improve functional labeling accuracy using new features.
2.2. Functional Labeling by Classification
2. Related Works There have been three main approaches for functional tagging. The major differences between them are summarized in the Table 1. Approach 1 st Gabbard et al. 2 nd Blaheta
Input word sequence syntactic tree
3 rd Yuan et al.
word sequence
Ours
syntactic tree
Technique parsing (PCFG) classification (decision tree, perceptron) sequential labeling (HMSVM) classification (MEM)
Features tree-based tree-based word-based tree-based, word clusters
Table 1. Functional Labeling Approaches Note that there is a strong relation between the functional labeling (FL) task and the semantic role labeling (SRL) task [5, 10]. SRL is more complex than FL. However, in order to study SRL, a proposition bank is required. Since such kind of corpus has not been available for Vietnamese yet, we focus on the FL problem only.
This approach carries out functional labeling after syntactic parsing. Functional labeling is considered as a classification problem, or more specifically a treebased classification problem. Syntactic trees, the output of parsing, are used as the input of functional labeling. Tree nodes – syntactic constituents – are labeled with functional tags independently. Following this approach, many machine learning techniques can be applied. This labeling approach was firstly used by Blaheta [2] for English. He employed two machine learning techniques including decision tree and perceptron. Nine features were used, including:
2.1 Functional Labeling by Parsing
In this by-parsing approach [7, 9], functional labels are determined immediately during the parsing process. Traditionally, syntactic parsing bases on
Label: syntactic label of the constituent cc-Label: if the constituent contains coordinate conjunction Head word: a very basic feature in parsing studies such as [7] Head POS: POS tag of head word Alternative head: if the constituent is a prepositional phrase, its noun phrase’s head word will be alternative head. Alternative POS tag: POS tag of alternative head Functional tags: employing dependencies between functional tags Label clusters: labels are manually grouped into clusters. Word clusters: an algorithm was run to group all words with a given POS tag into a relatively small number of clusters.
The advantage of classification strategy is that new feature, both local and non-local, can be incorporated easily. Because of this advantage, we
decide to follow the classification approach for our research.
2.3. Sequential Functional Labeling The third approach [22] formulates functional tagging as a sequential labeling problem which has been applied to other important natural language processing tasks such as named entity recognition and chunking. This approach does not require tree-based information. All features are word-based including surrounding words and their part-of-speech (POS) tags. The learning model for predicting a sequence observed can be done by some technique such as Hidden Markov Model [18], Conditional Random Fields [12]. In his paper, Yuan et al. [22] choose the Hidden Markov-Support Vector Machine (HM-SVM) technique for their learning model. The result of tagger system was very high, reached an accuracy of 96.18%. Features proposed by Yuan include: Word and POS tags: the context made by surrounding words can increase the accurate of prediction. In their experiments, they started from range [-2, +2] and up to [-4, +4] words context.
Bi-gram of POS tags: capturing local dependencies between words. Verbs: functional tags like subject and object describe the relations between verb and its arguments. Besides, each class of verb is associated with a set of syntactic frames. In this sense, Yuan et al. relied on the surface verb for distinguishing syntactic roles. POS tags of verbs: reducing the data sparseness problem of the previous feature. Position indicators: whether a constituent occurs before or after the verb is highly correlated with its grammatical function. For example, for Chinese language, subjects generally appear before a verb, and objects after. This feature was used to overcome the lack of syntactic structure that could be read from the parse tree.
3. Our Method
Figure 2. System architecture
3.1. System Architecture Figure 2 represents the architecture of our functional labeling system. The figure describes both training and testing phases. In the training phase, two resources are used including a Vietnamese treebank and an unlabeled corpus collected from online newspapers. Two main training steps are feature extraction and maximum entropy model (MEM) training. Besides, an additional step is required to calculate word clusters over the unlabeled corpus. In the testing phase, the input is a syntactic tree and the output is functional labels over the input tree. Two main testing steps are feature extraction and MEM classification. As mentioned in previous sections, there are many machine learning techniques to solve the classification problem. Some of them have been used in functional tag labeling such as perceptron, decision tree. We choose to use the maximum entropy model [1] since it is fast in training and testing. Besides, it has been successfully applied to many other natural language processing tasks.
Word cluster: cluster of the head word. This feature is used to solve the data sparseness problem of the head word feature. For example, in the training data, the word “cạnh” (next) is the head word of a locative constituent. In the testing data, there are other locative words that do not occur in the training data such as: bên kia (there), hố (hole), bán cầu (hemisphere). However, these four words belong to the same cluster of words. Therefore, though the head word feature is not effective in this case, the word cluster feature does. Figure 3 gives an illustration of features on a parse tree. In this figure, six features are shown except the word cluster feature. This feature is presented in Figure 4 particularly.
3.2. Features In this paper, we use seven linguisticallymotivated features to recognize Vietnamese functional tags. Among them, several were also used by previous authors such as Blaheta [2], or Sun and Sui [19]. The features are described as follows 2: Label: the syntactic label of current constituent, which is being functionally classified, is very important in recognizing its role. Father’s label: this feature is useful in certain cases. For example, suppose that the current constituent is a noun phrase (NP). If its father is a clause (S), there are more chances for it to be a subject (SUB), otherwise, if its father is a verb phrase (VP) it is more likely to be a direct object (DOB). Head word: this feature has been proved to be useful in parsing. It is also important for discriminating functions, for example, between temporal (TMP) and (LOC). POS of head word: the part-of-speech of the head word. Left sister’s label: the label of the constituent to the left of the current. Right sister’s label: the label of the constituent to the right of the current.
2
Underlined features are different from [2] and [19].
Figure 3. Features for functional classification
3.3. Word Clustering Word clustering is often used as a method for estimating the probabilities of low frequency events that are likely unobserved in a small annotated training corpus. To construct word clusters, we base on a similarity definition [4] that two words are similar if they occur in similar contexts or that can be used exchangeable to some extent. For example, “chủ tịch” (president) and “tổng thống” (president), or “kéo” (scissors) and “dao” (knife) are similar in this definition. In this research, we use the Brown clustering algorithm [4] to compute word clusters for using as the seventh feature described in the previous subsection. The output of clustering algorithm is a binary tree, in which each inner node represents a cluster as in Figure 4. Initially, each word in training corpus belongs to a distinct cluster. The clustering algorithm iteratively merges pairs of clusters which can cause the smallest decrease in the likelihood of the corpus, according to a
Table 3. Vietnamese functional tags
class-based bigram language model defined on word clusters.
Figure 4. Example of word cluster hierarchy
4. Experiments 4.1. Corpora and Tools In experiments, the most important resource is a hand-crafted Vietnamese treebank [17]. This treebank contains 10,471 sentences which were tagged with both constituent and functional labels. The treebank has been developed from 2006. Until now it has been updated regularly to support Vietnamese language processing researches. We used about 9,000 trees for training and the rest, 1,471 trees, for testing. Table 2 shows statistics of the treebank. Table 3 describes Vietnamese functional tags in four groups including clause types, syntactic roles, adverbials, and miscellaneous. Sentences 10,471
Words 225,085
The MEM tool3 we used in this paper is a library for maximum entropy classification of Tsuji’s laboratory. In the last version (2006), the library has several advance features such as fast parameter estimation using the BLVM algorithm, smoothing with Gaussian priors, etc. Training and testing examples were extracted from syntactic trees as described in Figure 2 (the feature extraction step). For word clustering, we used an open source tool2 of Liang (2005), an efficient implementation of Brown’s algorithm [4]. An unlabeled corpus containing about 700,000 Vietnamese sentences was collected from online newspapers including Lao Dong, PC World, and Tuoi Tre. This corpus was pre-processed, sentence split and word segmentation4, before word clustering. We ran word-clustering tool with 700 clusters and obtained about 670 good clusters which can be used for functional labeling. Figure 5 shows an example of a good cluster. The first line of each cluster represents its name and identification. Each word in a cluster has a bit string and frequency in the corpus.
Syllables 271,268
Table 2. Vietnamese treebank statistics Clause types CMD Command Syntactic roles SUB Subject TPC Topic PRD Predicate EXT Extent Adverbials TMP Time MNR Manner PRP Purpose CND Condition Miscellaneous TTL Title
EXC
Exclamation
LGS DOB IOB VOC
Logical subject Direct object Indirect object Vocative
LOC DIR CNC ADV
Location Direction Concession Adverbial
SPL
Special
Figure 5. An example of word clusters
4.2. Functional Labeling Precisions To evaluate the functional labeling system, we used the familiar precision measure in classification studies. Precision is the proportion of the number of testing examples which are correctly labeled by the system to the number of testing examples. The precision is defined as:
3
http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/ http://www.cs.berkeley.edu/~pliang 4 http://www.loria.fr/~lehong/tools/vnTokenizer.php 2
The reason why we did not compute recall and Fscore measures is that the number of constituents in a syntactic tree is fixed. In addition to functional tags described in Table 3, we used NoneLBL to label constituents which do not have a functional tag. Label Overall ADV CND CNC CMD DIR DOB EXC EXT IOB LOC MDP MNR PRD PRP SPL SUB SQ TMP TPC TTL VOC NoneLBL
Frequency Testing / Training 104 / 989 5 /168 1/42 4/24 51/192 1451 /4933 5/ 59 9/ 219 121 / 417 511 / 1414 6/282 100 / 924 232 / 1730 331 / 1230 37/73 1823/4891 57/128 642/1255 71/801 41 / 58 18 / 163 11286/13380
Precision 87.77% 40.38% 20% 0% 0% 29.41% 77.33% 60% 44.44% 40.5% 63.01% 50% 24% 75.86% 55.59% 18.91% 95% 100% 72.27% 42.25% 73.17% 50% 94.12%
0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81
Precision
Precision = #correctly labeled / #examples
1000 2000 4000 8000 9000 Number of trees Figure 6. Learning curve
4.3. Error Analyses In this section, we would like to analyze errors occurring in the testing phase. Table 4 shows that several functional labels such as CNC, CMD have a zero precision since there are too few testing examples belonging to this category. Another type of errors causes by the dependency between functional labels. Note that we do not use functional label as a feature. Figure 7 shows an example of dependency between TC and TPC labels. According to Vietnamese treebank guideline, if a clause (S) has a topicalized phrase (PP in this example), the clause will be labeled with TC tag.
Table 4. Evaluation of Vietnamese functional labeling Among 16,997 testing examples5, there were 14,913 correctly predicted, and 2,084 incorrectly predicted. The overall precision was 87.77%. In more detail, the precision and frequency of each functional label are presented in Table 4. To investigate the relation between training corpus size and precision, we ran our system with different training corpus size, increasing double per test. The learning curve in Figure 6 shows that the precision increased fastest, around 2%, when the number of training sentences changed from 4000 to 8000.
5
Note that from one syntactic tree there can be many classification examples extracted depending on the number of constituents in that tree.
Figure 7. The dependency between two function labels
4.4. The Effectiveness of the Word Cluster Feature Experimental results in Table 4 were achieved by using all seven features. To evaluate the effectiveness of the word cluster feature, we carried out an experiment using the other features only. As a result, the over-all precision of our system decreased 0.5%. Table 5 shows increases in precision by using the word cluster feature. Labels which have no increase (or decrease) are omitted. Though the overall increase
(0.5%) is not high, there are specific changes which are relatively high such as manner (MNR), vocative (VOC), exclamation (EXC). According to our observations, the head word feature was important in identifying these functional tags. However, this feature was sparse in our training corpus. Therefore, the word cluster feature, trained using a large corpus, was useful to reduce the sparseness of the head word feature. Label TTL SUB TPC MNR SPL IOB VOC CMD
Precision -7% +1% +3% +22% +5% -2% +6% -20%
Label DOB TMP ADV PRD SQ EXT EXC Over all
Precision +1% +2% +1% +3% +2% -11% +20% +0.5%
Table 5. Increases in precision by using word cluster feature
5. Conclusion and Future Works In this research, we have investigated the Vietnamese functional labeling problem. We made a number of contributions. First, we built the first Vietnamese functional labeling system with a high precision. Second, we carried out various experiments to give a better understanding of this system such as learning curve, error analyses. Additionally, we showed the effectiveness of the word cluster feature. In the future, we would like to find out more useful features for functional classification. Besides, we would like compare our approach with others such as functional labeling by parsing, and functional labeling using sequence learning.
[4] P.F. Brown, V.J. DellaPietra, P.V deSouza, J.C. Lai, and R.L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479. [5] Xavier Carreras, Lluís Màrquez, TALP Research Centre, Technical University of Catalonia, Introduction to the Semantic Role Labeling, CoNLL-2004 Share Task, 2004 [6] Grzegorz Churpala, Nicolas Stroppa, Josef van Genabith, Better training for Function Lableling, 2007. [7] Michael Collins. Three Generative, Lexicalised Models for Statistical Parsing. Proceedings of the ACL, 1997. [8] Fillmore, Frame Semantics and the nature of language, Annals of the New York Academy of Sciences, Conference on the Origin and Development of Language and Speech. Volume 280:20-32, 1976 [9] Ryan Gabbard, Mitchell Marcus, Seth Kulick, Fully Parsing the Penn Treebank Treebank, 2006. [10] Gildea, Jurafsky, Automatic labeling of semantic roles, 2002 [10] Katz, Fodor, The Structure of a Semantic Theory, 1963.
[11] T. Koo, X. Carreras, and M. Collins. Simple Semi-supervised Dependency Parsing. In Proc. ACL, 2008, pp.595-603. [12] Lafferty, J., McCallum, A., Pereira, F. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of ICML 2001, pages 282-289, Williamstown, USA. [13] Anh-Cuong Le, Phuong-Thai Nguyen, Hoai-Thu Vuong, Minh-Thu Pham, Tu-Bao Ho. 2009. An Experimental on Lexicalized Statistical Parsing for Vietnamese. Proceedings of KSE 2009, pp 162-167.
References
[14] Percy Liang, Semi-supervised learning for natural language. Massachusetts Institute of Technology, 2005
[1] A. L. Berger, S. A. D. Pietra, V. J. D. Pietra. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics. 1996.
[15] Merlo, P., Musillo, G. 2005. Accurate Function Parsing. In: Proceedings of EMNLP 2005, pages 620627, Vancouver, Canada.
[2] Don Blaheta, Function tagging, PhD thesis, 2003.
[16] Mitchell P. Marcus et al. Building a Large Annotated Corpus of English: The Penn Treebank. 1993. Computational Linguistics.
[3] Don Blaheta, Eugene Charniak, Assigning Function Tags to Parsed Text, Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, 2000.
[17] Phuong-Thai Nguyen, Xuan-Luong Vu, Minh-Huyen Nguyen, Van-Hiep Nguyen, Hong-Phuong Le. Building a Large Syntactically-Annotated Corpus of Vietnamese. The 3rd Linguistic Annotation Workshop (LAW), ACL-IJCNLP 2009.
[18] Rabiner, L. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In: Proceedings of the IEEE, 77(2):257-286. [19] Weiwei Sun, Zhifang Sui, Chinese Function Tag Labeling, 2009. [20] Honglin Sun, Daniel Jurafsky, Shallow semantic parsing of Chinese. In Daniel Marcu Susan Dumains and Salim Roukos, editors, HLT-NAACL 2004: Main proccedings. [21] Nianwen Xue, Martha Paler, CIS Department University of Penn Treebanksylvania, Automatic Semantic Role Labeling for Chinese Verbs, 2004 [22] Caixia Yuan, Fuji Ren, and Xiaojie Wang, Accurate Learning for Chinese Function Tags from Minimal Features, 2009