Punctuation Prediction for Vietnamese Texts Using ...

83 downloads 0 Views 195KB Size Report
punctuation is respectively a comma, period, colon, semicolon, ellipsis, question mark, and exclamation mark. The last label O-None indicates that the current ...
ACML Workshop: Machine Learning and Its Applications in Vietnam 1–9, 2014

MLAVN 2014

Punctuation Prediction for Vietnamese Texts Using Conditional Random Fields Quang H. Pham Binh T. Nguyen

[email protected] [email protected]

Department of Computer Science, University of Science, Vietnam

Nguyen Viet Cuong

[email protected]

Department of Computer Science, National University of Singapore

Abstract We present the first results for the punctuation prediction task for the Vietnamese language. This task is important since it can be used to add punctuations to machine-transcribed speeches which usually do not have such information. Similar to previous works for English and Chinese, we also model the problem using the conditional random field model and propose a set of features that are useful for prediction. We report the experimental results of our model on a corpus collected from online news texts. Keywords: punctuation prediction, Vietnamese, conditional random field

1. Introduction Punctuation prediction (Beeferman et al., 1998; Huang and Zweig, 2002) is an important problem in language and speech processing. Models from punctuation prediction systems can be used to annotate machine-transcribed speeches which usually do not come with punctuation information. This problem has been extensively researched for major languages such as English (Beeferman et al., 1998; Huang and Zweig, 2002; Lu and Ng, 2010; Cuong et al., 2014) and Chinese (Lu and Ng, 2010; Zhao et al., 2012). However, to the best of our knowledge, punctuation prediction has not been investigated for the Vietnamese language. In this paper, we report preliminary results for the first Vietnamese punctuation prediction system. Our system is based on linear-chain conditional random fields (CRFs) (Lafferty et al., 2001), a powerful sequence labeling model that has been used in many applications such as part-of-speech tagging (Lafferty et al., 2001), phrase chunking (Sha and Pereira, 2003), or named entity recognition (McCallum and Li, 2003). This model has been applied to English punctuation prediction (Lu and Ng, 2010; Cuong et al., 2014) and Chinese punctuation prediction (Lu and Ng, 2010; Zhao et al., 2012) with promising results. We first describe our corpus for the Vietnamese punctuation prediction task. The corpus is constructed from online Vietnamese news sources. This is the first such corpus for Vietnamese punctuation prediction. Then we describe how to model the problem as a sequence labeling task and apply the CRF model. More specifically, we describe our label sets and feature sets for the CRF model. We consider two different sets of labels for the CRF models: one with the simple label O to indicate that a word is not followed by a punctuation, and one with expanded label O to capture various long-range dependencies. Previous works for the English language have shown that capturing long-range dependencies is helpful for c 2014 Q.H. Pham, B.T. Nguyen & N.V. Cuong.

Pham Nguyen Cuong

punctuation prediction (Lu and Ng, 2010; Cuong et al., 2014). Our experiments also show that using the extended label set helps to improve the performance of our model. The rest of this paper is structured as follows. In Section 2, we review some related works for the punctuation prediction task and Vietnamese language processing. A brief introduction to the conditional random field model is given in Section 3. We describe in Section 4 our approach for the Vietnamese punctuation prediction task and report the experimental results in Section 5. Finally, we give a conclusion and discuss some future directions in Section 6.

2. Related Works Punctuation prediction has been broadly investigated for many years. One of the first punctuation prediction systems was created by Beeferman et al. (1998) to automatically insert commas into texts. The system uses a finite state transition model and Viterbi decoder to predict the positions of commas in a sentence. Huang and Zweig (2002) proposed a maximum entropy model for the task with 3 punctuations: period, comma, and question mark. Using CRF models, Lu and Ng (2010) can achieve better performances for punctuation prediction task on both the English and Chinese data sets of the IWSLT corpus (Paul, 2009). Notably, they showed that using a dynamic CRF to jointly model word-level and sentence-level labeling tasks and thus capture some long-range dependencies is useful for punctuation prediction. Similarly, Cuong et al. (2014) used high-order semi-Markov CRFs to capture long-range dependencies between punctuations and achieved better prediction performance than linear-chain CRFs. Zhao et al. (2012) investigated Chinese punctuation prediction by formulating the problem as a multiple-pass labeling task and applying the CRF model. Cho et al. (2012) studied a segmentation and punctuation prediction problem for German-English with a monolingual translation system and demonstrated their results in the oracle experiments. Our work is one of the Vietnamese language processing tasks. Related to Vietnamese language processing, there have been various works in different directions such as word segmentation (Dien et al., 2001; Nguyen et al., 2006) and part-of-speech (POS) tagging (Tran et al., 2009). Using a weighted finite state transducer and neural network, Dien et al. (2001) built a Vietnamese word segmentation system with high precision. Nguyen et al. (2006) also investigated the Vietnamese word segmentation problem using CRF and SVM models. The Vietnamese POS tagging task was studied by Tran et al. (2009) with three different techniques: maximum entropy model, CRF, and SVM. However, to our knowledge, the problem of punctuation prediction has not been investigated before for Vietnamese.

3. Conditional Random Fields Conditional random fields (Lafferty et al., 2001) are discriminative, undirected Markov models which can capture various dependencies between a structured observation x and its corresponding structured label y. In this section, we give a brief introduction to linear-chain CRFs, a special type of CRFs that is widely used in practice, especially for sequence labeling

2

Punctuation Prediction for Vietnamese Using CRFs

tasks. Since we only work with linear-chain CRFs, we will use the term CRF to refer to a linear-chain CRF throughout this paper. To formally define a CRF, let X and Y be random vectors, x be a value of X, and y be a value of Y such that x = (x1 , x2 , . . . , xT ) and y = (y1 , y2 , . . . , yT ). Assume we have a set of real-valued feature functions {fk (y, y 0 , xt )}K k=1 for the CRF and each feature fk has a corresponding weight λk . Note that the feature function fk (y, y 0 , xt ) may depend on the whole sequence x instead of just one element xt . A linear-chain conditional random field defines the conditional distribution p(Y|X) that has the following form ! K X T X 1 p(y|x) = exp λk fk (yt , yt−1 , xt ) , Z(x) k=1 t=1

P

 P K PT where Z(x) = y exp λ f (y , y , x ) is the normalization function, also t k=1 t=1 k k t t−1 called the partition function. Given the training data D = {(xi , yi )}i , the CRF model is trained by choosing the parameters λ = (λ1 , λ2 , . . . , λK ) that maximize the following regularized conditional loglikelihood of the data K X X λ2k i i , L(λ) = ln p(y |x ) − 2σ 2 i

k=1

where σ is a parameter that controls the degree of regularization. This function is concave and thus the global optimum can be found using any convex optimization algorithm. Optimization algorithms for L(λ) usually require inference on the CRFs. Similar to the hidden Markov models (Rabiner, 1989), inference on CRFs is done by defining a set of forward and backward variables and using dynamic programming algorithms to compute them efficiently. During testing, the label sequence for a new test instance is determined by a Viterbi-like algorithm (Sutton and McCallum, 2006), which returns the label sequences with the highest probability according to the trained model. We use these algorithms in our punctuation prediction system.

4. Punctuation Prediction for Vietnamese Texts Using CRFs 4.1. Corpus Since there was no standard corpus available in Vietnamese for the punctuation prediction task, we create our own corpus from online news sources. More specifically, we build a corpus from 500 online newspaper articles available from VnExpress, a popular news website in Vietnam. In this work, we only focus on one domain, namely, the news section of the website.1 As a pre-processing step, we clean the data by fixing common writing errors or nonstandard uses of punctuations such as two or more punctuations at the end of a sentence (e.g., an ellipsis followed by a period, triple question marks, etc.) and punctuations outside a quotation mark. After the pre-processing step, we end up with a corpus of about 240,000 1. The URL for the news section is: http://vnexpress.net/tin-tuc/thoi-su.

3

Pham Nguyen Cuong

words and 7 types of punctuations: comma (,), period (.), colon (:), semicolon (;), ellipsis (...), question mark (?), and exclamation mark (!). 4.2. Punctuation Prediction as Sequence Labeling Like previous works for English and Chinese (Lu and Ng, 2010; Zhao et al., 2012), we model the punctuation prediction task as a sequence labeling problem. In particular, we treat each paragraph in a news article as a sequence and aim to label each word in the sequence by the punctuation that immediately follows the word. In the simple case, we use the label O to indicate that a word is not followed by any punctuation. For example, consider the following paragraph in Vietnamese.2 Khu vực Đà Nẵng - Bình Định có tần số bão ít hơn và bão thường tập trung tháng 10 và 11. Hai vùng còn lại Phú Yên - Khánh Hòa, Ninh Thuận - Cà Mau bão ít đổ bộ hơn các vùng khác. (The area of Da Nang - Binh Dinh has a lower storm frequency and storms usually occur in October and November. In the other two areas, Phu Yen Khanh Hoa and Ninh Thuan - Ca Mau, storms occur less often than other areas.) This paragraph can be labeled as follows. We note that all the words are in lower case because the word case information is usually not available for the punctuation prediction task. For instance, when the texts are transcribed from speeches, we do not have the case information for the words. khu/O vực/O đà/O nẵng/O -/O bình/O định/O có/O tần/O số/O bão/O ít/O hơn/O và/O bão/O thường/O tập/O trung/O tháng/O 10/O và/O 11/Period hai/O vùng/O còn/O lại/O phú/O yên/O -/O khánh/O hòa/Comma ninh/O thuận/O -/O cà/O mau/O bão/O ít/O đổ/O bộ/O hơn/O các/O vùng/O khác/Period For punctuation prediction, we also observe that there is usually a dependency between the current punctuation and its immediately preceding punctuation in the sequence. For example, when we have just ended a sentence (with a period, question mark, etc.), a comma is more likely to happen than other punctuations. Therefore, it is a good idea to keep track of what the previous punctuation is. This is a type of long-range dependencies between the labels and can be captured by expanding the label O to include the information of the previous punctuations. More specifically, we replace the label O by the following 8 labels: O-Comma, O-Period, O-Colon, O-Semicolon, O-Ellipsis, O-QuestionMark, O-ExclamMark, and O-None. The first 7 labels indicate that the label of the current word is O and the immediately preceding punctuation is respectively a comma, period, colon, semicolon, ellipsis, question mark, and exclamation mark. The last label O-None indicates that the current label is O and there is no preceding punctuation (i.e., at the beginning of the sequence). As an example, the above sample sequence can be labeled as follows using the new label set. 2. The paragraph is extracted from the following article: http://vnexpress.net/tin-tuc/thoi-su/ bao-xuat-hien-nhieu-nhat-o-quang-ninh-thanh-hoa-3077937.html.

4

Punctuation Prediction for Vietnamese Using CRFs

ID

Feature

Feature definition

U U00 U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 B

N/A %x[0,0] %x[-1,0] %x[-2,0] %x[1,0] %x[2,0] %x[-3,0] %x[-4,0] %x[-2,0]/%x[-1,0] %x[-1,0]/%x[0,0] %x[0,0]/%x[1,0] %x[1,0]/%x[2,0] N/A

The current label without any word The current word and its label The preceding word and the current label The word at position -2 and the current label The succeeding word and the current label The word at position +2 and the current label The word at position -3 and the current label The word at position -4 and the current label Bigram of words at positions -2 and -1, plus the current label Bigram of preceding and current words, plus the current label Bigram of current and succeeding words, plus the current label Bigram of words at positions 1 and 2, plus the current label Transitions between labels

Table 1: Set of features for the CRF model. The feature template %x[r,0] means the words at the r-th position relative to the current position.

khu/O-None vực/O-None đà/O-None nẵng/O-None -/O-None bình/O-None định/ONone có/O-None tần/O-None số/O-None bão/O-None ít/O-None hơn/O-None và/O-None bão/O-None thường/O-None tập/O-None trung/O-None tháng/ONone 10/O-None và/O-None 11/Period hai/O-Period vùng/O-Period còn/OPeriod lại/O-Period phú/O-Period yên/O-Period -/O-Period khánh/O-Period hòa/Comma ninh/O-Comma thuận/O-Comma -/O-Comma cà/O-Comma mau/OComma bão/O-Comma ít/O-Comma đổ/O-Comma bộ/O-Comma hơn/O-Comma các/O-Comma vùng/O-Comma khác/Period 4.3. Features for CRFs We construct a set of features that are useful for the punctuation prediction task. Our baseline zero-th order CRF features include the following unigram features: the current label without any word, the current word itself, the words within 4 positions preceding the current words, and the words within 2 positions succeeding the current words. Then, we add all pairs of consecutive words within a window of size 2 before and after the current position as bigram features. For first order CRF features, we include the transitions between two labels as features. Table 1 details our set of features. The table follows the template of the CRF++ toolkit (Kudo, 2005), which we use to train and test our model. We also note that although including other words surrounding the current position as unigram features (such as words at the 3rd or 4th position succeeding the current position) and the trigram features is helpful for punctuation prediction in the English language (Lu and Ng, 2010; Cuong et al., 2014), our experiments indicate that they are not helpful for Vietnamese. Thus, we do not include these features into our model.

5

Pham Nguyen Cuong

Training set

Punctuation

Testing set

Number

Percentage (%)

Number

Percentage (%)

8015 4881 420 258 171 74 3

5.33 3.24 0.279 0.171 0.113 0.05 0.002

4698 2966 243 115 112 45 5

5.23 3.31 0.27 0.12 0.12 0.047 0.005

Comma Period Colon Semicolon Ellipsis Question mark Exclamation mark

Table 2: Distribution of punctuations in training and testing data sets. Note that the rest of the data sets contain no punctuation (label O).

5. Experimental Results 5.1. Setup For the experiments, we split our corpus into two parts: 66% for training and the rest for testing. Table 2 summarizes the punctuations distribution over the training and testing data sets. In our data sets, comma and period are the most common punctuations, while question marks and exclamation marks are very rare. We note that our data sets have a very imbalanced class distribution. We may try to balance the non-O classes by adding more training sequences that contain the rare classes. However, we will leave this case for the future work. In this work, we will work directly with the original imbalanced data sets. We use three metrics to measure the performance of our system: precision (denoted by P), recall (denoted by R), and F1 (denoted by F). From Table 2, the punctuations are not equally distributed, hence we use micro-averaged scores (D. Christopher et al., 2008) instead of macro-averaged scores for the overall performance of the system. The microaveraged precision and recall formulas are given as follows: P P j tpj j tpj P =P R= P j (tpj + f pj ) j (tpj + f nj ) where tpj is the number of documents correctly classified as class j (true positive), f pj is the number of documents incorrectly classified as class j (false positive), and f nj is the number of documents in class j that are misclassified (false negative). The micro-averaged F1 score is computed as 2P R F = . P +R In our experiments, we illustrate the effects of different combinations of features to the performance of our Vietnamese punctuation prediction system. We begin with the unigram words features and subsequently add the bigram word features as well as the label transition features. After obtaining the best set of features, we use it to train a CRF model using the expanded label set described in Section 4.2 and compare this model with the CRF that 6

Punctuation Prediction for Vietnamese Using CRFs

Features

P (%)

R (%)

F (%)

U, U00, U01, U02, U03, U04, U05, U06

75.07

35.67

48.36

U, U00, U01, U02, U03, U04, U05, U06, U07, U08, U09, U10

82.99

37.80

51.95

U, U00, U01, U02, U03, U04, U05, U06, U07, U08, U09, U10, B

81.24

39.21

52.89

Table 3: Results for different features combinations. uses the original label set. For all the experiments, we train CRF models on the training set with the regularizer σ = 1, and then test our models on the testing set. Our scores are computed on token level. 5.2. Results In Table 3, we show the results for different combinations of features for our Vietnamese punctuation prediction system. Using unigram word features alone achieves 48.36% F1 score while using both unigram and bigram word features achieves 51.95% F1 score, an increase by 3.59%. Adding the label transition features to the word features further increases the F1 score to 52.89%, although it decreases the precision of the model to 81.24%. It is also interesting to note that considering words at further back position (i.e., features U05 and U06) increases the performance of the model. This may be because people often use appositions to add extra information in news which leads to the dependency between the punctuation of the current word and the words at further back positions. In Table 4, we show the comparison between the CRF trained using the expanded label set and the CRF trained using the original label set. Both models are trained with the best set of features (i.e., the third set in Table 3). For the CRF trained using the expanded label set, the scores are obtained by first removing the suffixes in the O labels and then computing the scores like a normal label set. From Table 4, using the expanded label set increases the F1 score of the system from 52.89% to 56.12%. Notably, it increases the F1 scores of comma, period, and colon punctuations considerably, although it decreases the precisions of comma and period. Using the expanded label set also decreases the scores for the question mark, but it does not affect the overall scores significantly since the frequency of this label in our data set is very small.

6. Conclusion and Future Works We have studied the punctuation prediction task for Vietnamese texts and presented preliminary results for a system trained using the CRF model. The system has a promising result on three punctuations: period, comma, and colon; and it can be further investigated to improve the performance on other punctuations. For future works, we would like to include more sophisticated types of features in order to achieve better results in prediction. These features may include POS tags, person name dictionary, etc. Since the bad performance on the rare punctuations may be due to the lack of training examples, we also would like to increase the size of our corpus to include more examples with those punctuations. Finally, our experiments are somehow limited since we constructed our corpus from online news 7

Pham Nguyen Cuong

Punctuation

Expanded Label Set

Original Label Set

P (%)

R (%)

F (%)

P (%)

R (%)

F (%)

Comma Period Colon Semicolon Ellipsis Question mark Exclamation mark

59.72 82.58 95.45 0 0 0 0

35.48 59.27 71.18 0 0 0 0

44.52 69.01 81.55 0 0 0 0

64.33 83.46 91.01 0 0 50.0 0

28.94 56.78 66.39 0 0 2.27 0

39.92 67.58 76.77 0 0 4.34 0

Micro-average

76.51

44.31

56.12

81.24

39.21

52.89

Table 4: Comparison of the CRF trained using the expanded label set and the CRF trained using the original label set.

texts, not from real transcribed speeches. So, in the future, we may need to consider the domain where data come from real transcribed speeches. In this case, it would be interesting to investigate how to transfer the knowledge learned from the online news texts domain to this new domain.

References Doug Beeferman, Adam Berger, and John Lafferty. Cyberpunc: A lightweight punctuation annotation system for speech. In IEEE International Conference on Acoustics, Speech and Signal Processing, 1998. Eunah Cho, Jan Niehues, and Alex Waibel. Segmentation and punctuation prediction in speech language translation using a monolingual translation system. In Workshop on Spoken Language Translation, 2012. Nguyen Viet Cuong, Nan Ye, Wee Sun Lee, and Hai Leong Chieu. Conditional random field with high-order dependencies for sequence labeling and segmentation. Journal of Machine Learning Research, 15:981–1009, 2014. Manning D. Christopher, Raghavan Prabhakar, and Schacetzel Hinrich. An Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008. Dinh Dien, Hoang Kiem, and Nguyen Van Toan. Vietnamese word segmentation. In Natural Language Processing Pacific Rim Symposium, 2001. Jing Huang and Geoffrey Zweig. Maximum entropy model for punctuation annotation from speech. In International Conference on Spoken Language Processing, 2002. Taku Kudo. CRF++: Yet another CRF toolkit. Software available at http://crfpp. sourceforge.net, 2005.

8

Punctuation Prediction for Vietnamese Using CRFs

John Lafferty, Andrew McCallum, and Fernando C.N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning, 2001. Wei Lu and Hwee Tou Ng. Better punctuation prediction with dynamic conditional random fields. In Conference on Empirical Methods in Natural Language Processing, 2010. Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Conference on Natural Language Learning, 2003. Cam-Tu Nguyen, Trung-Kien Nguyen, Xuan-Hieu Phan, Le-Minh Nguyen, and Quang-Thuy Ha. Vietnamese word segmentation with CRFs and SVMs: An investigation. In Pacific Asia Conference on Language, Information and Computation, 2006. Michael Paul. Overview of the IWSLT 2009 evaluation campaign. In Workshop on Spoken Language Translation, 2009. Lawrence Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 2003. Charles Sutton and Andrew McCallum. An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, pages 93–128, 2006. Oanh Thi Tran, Cuong Anh Le, Thuy Quang Ha, and Quynh Hoang Le. An experimental study on Vietnamese POS tagging. In International Conference on Asian Language Processing, 2009. Yanqing Zhao, Chaoyue Wang, and Guohong Fu. A CRF sequence labeling approach to Chinese punctuation prediction. In Pacific Asia Conference on Language, Information and Computation, 2012.

9