Part Of Speech Tagging and Chunking with HMM and CRF
Pranjal Awasthi Dept. of CSE IIT Madras
[email protected]
Delip Rao Dept. of CSE IIT Madras
Balaraman Ravindran Dept. of CSE IIT Madras
[email protected]
[email protected]
Abstract In this paper we propose an approach to Part of Speech (PoS) tagging using a combination of Hidden Markov Model and error driven learning. For the NLPAI joint task, we also implement a chunker using Conditional Random Fields (CRFs). The results for the PoS tagging and chunking task are separately reported along with the results of the joint task.
1 Introduction Part of Speech tagging is an important preprocessing step in many natural language processing applications and the first step in the syntactic analysis of a language. We propose a combination of statistical and rule based technique for Part of Speech tagging of Indian languages and demonstrate its performance on a Hindi dataset. Shallow parsing or chunking is the task of segmenting text into chunks of syntactically related word groups. Apart from reducing the search space of deep parsers, shallow parsing is very useful in applications like Named Entity Recognition, Information Extraction, Summarization, Question Answering and Automatic Thesaurus Generation. In this paper, the task of chunking is attempted using Conditional Random Fields and the results for the combined and individual tasks are reported.
main idea is as follows: Use the TnT tagger to perform the initial tagging and apply a set of transformation rules to correct the errors introduced by the TnT tagger. These transformation rules are induced during the training phase by iteratively extracting a set of candidate transformations from the transformation templates listed in Table 1 and selecting those transformations that maximize the error reduction on the entire training data. Thus for each iteration, a new training data is generated by applying the transformation selected in that iteration. Table 1: Transformation templates for a given word wi , from (Brill, 1995) Change tag a to b if: The previous word (wi−1 )is tagged z The next word (wi+1 ) is tagged z The word wi−2 is tagged z The word wi+2 is tagged z The word wi−1 or wi−2 is tagged z The word wi+1 or wi+2 is tagged z The word wi−1 or wi−2 or wi−3 is tagged z The word wi+1 or wi+2 or wi+3 is tagged z The previous word (wi−1 )is tagged z and the next word (wi+1 ) is tagged x The previous word (wi−1 )is tagged z and wi−2 is tagged x The next word (wi+1 )is tagged z and wi+2 is tagged x
2 The Part Of Speech Tagger Our tagging process consists of two stages, an initial stochastic tagging using the TnT tagger (Brants, 2000), which is a second order Hidden Markov Model (HMM) (Rabiner and Juang, 1986) based tagger and a post-processing stage using error driven learning, akin to (Brill, 1995). The
3 The Chunker The chunker implementation is a linear chain CRF provided by MALLET1. Our work on chunking closely follows the work by (Sha and Pereira, 1
http://mallet.cs.umass.edu/index.php/Main Page
2003). Let Y denote the chunk label sequence and X denote the corresponding observation sequence. A linear chain CRF (Lafferty et al., 2001) models the conditional probability P (Y|X) as P P
1 P (Y|X) = Z exp{ k i λk fk (yi−1 , yi , X) P P + k i µk gk (yi , X)}
The feature set {f, g} used in this work is similar to that used in (Sha and Pereira, 2003). Table 2 describes the feature set. ti and ci is the PoS tag and the chunk tag respectively for the word wi . Table 2: Feature set used for the chunker wi−2 = w wi−1 = w wi = w wi+1 = w wi+2 = w wi−1 = w′ , wi = w wi+1 = w′ , wi = w ti−2 = t ti−1 = t ti = t ti+1 = t ti+2 = t ti−1 = t′ , ti = t ti−2 = t′ , ti−1 = t ti = t′ , ti+1 = t ti+1 = t′ , ti+2 = t ti−2 = t′′ , ti−1 = t′ , ti = t ti−1 = t′′ , ti = t′ , ti+1 = t ti = t′′ , ti+1 = t′ , ti+2 = t ci−1 = c
Table 4: Part Of Speech tagging with error correction CC INTF JJ JVB NEG NLOC NN NNC NNP NNPC NVB PREP PRP QF QFN QW RB RBVB RP SYM UH VAUX VFM VJJ VNN VRB VV Overall
Precision 95.54% 41.67% 46.63% 54.72% 97.83% 74.47% 70.79% 18.89% 62.99% 46.34% 40.20% 95.91% 91.09% 82.14% 87.96% 89.47% 63.38% 0.00% 86.45% 99.07% 100.00% 90.62% 82.21% 25.00% 75.61% 79.41% 0.00% 80.74%
Recall 88.13% 100.00% 56.85% 38.67% 80.36% 71.43% 82.85% 19.10% 30.53% 19.39% 45.05% 95.43% 96.35% 70.77% 93.14% 100.00% 72.58% 0.00% 93.71% 98.53% 100.00% 87.09% 83.14% 6.67% 88.57% 87.10% 0.00% 80.74%
Fβ=1 91.69 58.82 51.23 45.31 88.24 72.92 76.35 18.99 41.13 27.34 42.49 95.67 93.65 76.03 90.48 94.44 67.67 0.00 89.93 98.80 100.00 88.82 82.67 10.53 81.58 83.08 0.00 80.74
these tasks on the testing set. The reported results were derived from a modified CONLL evaluation script3 for the same task.
Table 3: Part of Speech tagging results 4.1 Part of Speech Tagging Model CRF TnT TnT+TBL
Precision 69.40 78.94 80.74
Recall 69.40 78.94 80.74
Fβ=1 69.40 78.94 80.74
4 Experiments and Results We use the limited category dataset supplied for the NLPAI shared task2 for training our PoS tagger and chunker. This is referred as the “training data” unless explicitly stated otherwise. We report the precision, recall and F1 scores for each of 2
http://ltrc.iiit.net/nlpai contest06/
We tried the Part of Speech tagging task using Conditional Random Fields (CRFs) (Lafferty et al., 2001) using wi−1 , wi−1 wi , wi+1 , and wi wi+1 as features for the current word wi . In addition to CRFs we also tried Brant’s TnT tagger that uses Hidden Markov Models (HMM). From Table 3 it is clear that TnT outperforms CRF in the current PoS tagging task. This is probably due to the large number of output labels for the task (26 PoS tags) and relatively small amount of training data. Part of Speech tagging for the final system was performed as follows. First we split the training 3
http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt
data randomly into two halves. The first half is used to train the TnT tagger and the second half is used for testing. Any error in this process results in learning of appropriate transformation rules as explained in Section 2. These transformation rules are then used to correct the results produced by the TnT tagger on the test set. We report the performance measures by averaging over five random 50:50 splits of the training data. Table 4 and Table 5 show the results before and after the application of transformation rules respectively. Table 5: Part Of Speech tagging without error correction CC INTF JJ JVB NEG NLOC NN NNC NNP NVB PREP PRP QF QFN QW RB RBVB RP SYM UH VAUX VFM VJJ VNN VRB VV Overall
4.2
Precision 95.54% 41.67% 42.71% 50.88% 97.83% 71.43% 69.46% 43.03% 58.96% 32.79% 95.78% 91.09% 82.14% 87.74% 89.47% 63.38% 0.00% 85.90% 99.07% 100.00% 89.47% 81.24% 20.00% 75.61% 79.41% 0.00% 79.66%
Recall 88.13% 100.00% 56.16% 38.67% 80.36% 71.43% 75.67% 37.97% 30.15% 43.96% 95.17% 96.35% 70.77% 91.18% 100.00% 72.58% 0.00% 93.71% 98.34% 100.00% 86.79% 80.68% 13.33% 88.57% 87.10% 0.00% 79.66%
Fβ=1 91.69 58.82 48.52 43.94 88.24 71.43 72.43 40.34 39.90 37.56 95.47 93.65 76.03 89.42 94.44 67.67 0.00 89.63 98.71 100.00 88.11 80.96 16.00 81.58 83.08 0.00 79.66
Chunk labeling
We train a linear chain CRF using the training data with 77620 features. The training process converged after 100 iterations of LBFGS. For details refer (Lafferty et al., 2001). The model thus induced is then used to label the test set. Table 6 shows the results of chunking by using the POS
tags provided by the test set while Table 7 shows the result of chunking using the POS tags generated by our tagger. Table 6: Chunking with reference POS tags
B-BLK B-JJP B-NP B-RBP B-VG I-BLK I-JJP I-NP I-RBP I-VG Overall
Precision 80.49% 100.00% 87.25% 89.83% 90.80% 30.00% 100.00% 90.20% 72.73% 94.26% 89.69%
Recall 76.39% 72.73% 89.98% 96.36% 89.43% 13.64% 38.46% 91.23% 80.00% 92.06% 89.69%
Fβ=1 78.38 84.21 88.60 92.98 90.11 18.75 55.56 90.71 76.19 93.15 89.69
Table 7: Chunking with generated POS tags
B-BLK B-JJP B-NP B-RBP B-VG I-BLK I-JJP I-NP I-RBP I-VG Overall
Precision 80.33% 15.00% 76.66% 60.94% 64.75% 22.22% 14.29% 84.34% 59.09% 87.09% 79.58%
Recall 68.06% 13.64% 83.10% 70.91% 63.77% 9.09% 7.69% 84.75% 65.00% 81.06% 79.58%
Fβ=1 73.68 14.29 79.75 65.55 64.26 12.90 10.00 84.54 61.90 83.97 79.58
5 Discussion of the Results We obtain an overall F-measure of 79.66 using the TnT tagger. This low score could be attributed to the sparsity of the training data used. The use of transformations in post processing improves the overall F-measure to 80.74. It should be noted this result is obtained as an average of the results obtained using five random 50:50 splits of the training data as mentioned in Section 2. In order to clearly understand the improvements obtained by transformation based learning we show the difference in F-measures for the tags where the difference is nonzero in Table 8. It is interesting to note that the transformation rules improve the F-
Table 8: Changes in POS tagging F-score before and after the application of transformation rules POS tag JJ JVB NLOC NN NNC NNP NVB PREP QFN RP SYM VAUX VFM VJJ
Before 48.52 43.94 71.43 72.43 40.34 39.9 37.56 95.47 89.42 89.63 98.71 88.11 80.96 16
After 51.23 45.31 72.92 76.35 18.99 41.13 42.49 95.67 90.48 89.93 98.8 88.82 82.67 10.53
Difference 2.71 1.37 1.49 3.92 -21.35 1.23 4.93 0.20 1.06 0.30 0.09 0.71 1.71 -5.47
measures of all tags except for NNC and VJJ. A reduction in the F-measure could have been avoided by selecting a richer set of transformation templates than those listed in Table 1 as the transformation based learning process is highly sensitive to the templates used. In general the chunking accuracy for the combined task is lesser than the chunking task alone with the reference tags. This is caused by the propagation of errors introduced during the tagging process to the chunking stage.
6 Conclusion We have demonstrated the use of an off-the-shelf statistical tagger combined with an error driven learning procedure for Part of Speech tagging Hindi. The chunking task with Conditional Random Fields was also explored. We have reported our results for each of these tasks separately and also the results for the joint task of POS tagging and chunking.
References Thorsten Brants. 2000. TnT–A statistical part-ofspeech tagger. In Proceedings of Association for Neuro-Linguistic Programming (ANLP), pages 224– 231. Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565.
John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conf. on Machine Learning, pages 282–289. Morgan Kaufmann, San Francisco, CA. L. R. Rabiner and B. H. Juang. 1986. An introduction to hidden markov models. IEEE ASSP Magazine, pages 4–16, January. Fei Sha and Fernando C. N. Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL.