Arbitrary Phrase Identification using Linear Kernel

0 downloads 0 Views 139KB Size Report
system using linear SVM kernel and a new technique called mask method. Pre- ..... possible chunk class dictionary (ID) from all training parts except for part i.
Arbitrary Phrase Identification using Linear Kernel with Mask Method Chia-Hui Chang1, and Yu-Chieh Wu1 1

Department of Computer Science National Central University, No.300, Jhongda Rd., Jhongli City, Taoyuan County 32001, Taiwan (R.O.C.) {bcbb, chia}@db.csie.ncu.edu.tw

Abstract. In this paper, we proposed an efficient and accurate text chunking system using linear SVM kernel and a new technique called mask method. Previous researches indicated that systems combination or external parsers can highlight the chunking performance. However the cost of constructing multiclassifiers is even higher than developing a single processor. Besides, the use of external resources will complicate the original tagging process. To remedy these problems, we employ richer features and propose a masked-based method to solve unknown word problem to enhance system performance. In this way, no external resources and complex heuristics are necessary for the chunking system. The experiments show that when training with the CoNLL-2000 chunking data set, our system achieves 94.12 in F(β) rate and 94.21 with SVM POS-tagger. Furthermore, our chunker is quite efficient since it adopts linear kernel SVM. The turn around tagging time on CoNLL-2000 testing data is less than 52 seconds.

1 Introduction The goal of shallow parsing is the same as automatic text chunking, which aims to recognize non-overlap phrases in a given text. These phrases are non-overlapped, i.e. they can not be included in other chunks [1]. A Chunk is usually meaningful and contains more than one word rather than a single word. For example, a phrase “Natural Language Processing” is much less ambiguous than its individual words “Natural”, or “Language”, or “Processing”. The Noun Phrase chunk (NP-chunk) recognition is similar to the text chunking problem, whereas it focused on extracting noun phrases in texts rather than all of other chunk types. Chunk information is usually used to extract shallow syntactic features in text. In many Natural Language Processing (NLP) areas, the chunker provide a down-stream syntactic features for further analysis, like chunking-based full parsing [1] [17][19], clause identification [3] [19], semantic role labeling (SRL) [4], kernel method [16], and text categorization [13]. Text chunking problem can be viewed as a sequence of word tagging [8] [14], which assigns the optimal tag sequence for an input sentence. In tradition, a set of human-made rules is used to extract all of the phrases in a text. Usually, such rulebased systems can perform well on a specific domain, but it is difficult to apply to

another. Therefore Machine Learning (ML) approaches are proposed to learn such rules for predicting new data without human intervention. In recent, many machine learning-based chunking systems are presented, e.g. transformation-based errordriven learning [2] [14], Hidden Markov Model (HMM) [12], Winnow [20], Memory-based learning [7] [8] [19], voted-perceptrons [3], and voted-SVMs [10], etc. In terms of the system performance, the SVM, Winnow, and voted-perceptrons achieved better results than the other systems (> 93.53 in F rate). On the other hand, the HMM, Memory-based chunkers are more efficient than the others. In most cases, a high-performance chunker is required, for example, the error propagation problem of chunking-based parsers [17] [19] is due to the initial chunking error. Similarly, for the chunking-based SRL systems, the chunk information is relatively important to identify the boundaries of an argument type. Thus, the higher performance the chunker achieves, the more accurate information it provides to other applications. Zhang et al. [20] proposed a text chunking system based on Winnow algorithm which employed an external parser that was called English Slot Grammar (ESG). The learning speed of Winnow is quite fast (about 12 minutes). It achieves the best performance on the Computational Natural Language Learning (CoNLL) chunking dataset (94.17 in F(β) rate). However, using parsers is far away from the purpose of chunking tasks. The chunker generates phrase-level granule to the parser which constructs dependency and tree structures of the sentence. Without ESG, their generalizedWinnow chunker obtained 93.57 F(β) rate. Kudoh and Matsumoto [10] developed an SVM chunker with polynomial kernel type. Their system reached the highest performance in that year of CoNLL-2000 shared task (93.48 F(β) rate). But the training time and tagging speed is not efficient since the kernel is polynomial. It costs 24 hours to train with CoNLL-2000 chunking data and identify 20 tokens per second. They further combine eight different single chunkers to improve system performance (from F(β) rate 93.48 to 93.91) in their study [10]. Besides, both Tjon Kim Sang [19] (used Majority voting with five memory-based learners) and Halteren [7] (employed Weighted Probability Distribution Voting models) presented variant frameworks to combine several memory-based classifiers. The F(β) rates are 92.50 and 93.32 correspondently. Similar to Kudoh’s later results, the combination of several classifiers performs better results than one individual system. Although the combinations of multiple classifiers and integration of external resources can relatively enhance the chunking performance, the use of multi-classifiers can largely increase the tagging speed as well as complicate the system design. In addition, external resources are not always available in many domains and languages. To remedy this impact, we present an efficient SVM-based chunker without external resources and multiple processing procedures. Both training and testing time are largely reduced when comparing with previous kernel-method approaches since we employ a simple linear kernel SVM instead of more complex kernels. Theoretically speaking, the linear kernel type is the simplest and most efficient kernel to perform similarity calculation. For other kernel types, like polynomial, which require more computation time for training and testing. Thus, the linear kernel type is a better choice to develop efficient learners. Another observation is that many tagging errors occurred when the words were unknown to the training set. When unknown word occurs during testing, the lexical-

related features, like unigram, are missed. The chunk class of the unknown word is determined by the other non-lexical-related features, like Part-of-speech (POS) tags. In general, the lexical feature is more discriminated than non-lexical features. Even though unknown words had been identified and labeled with POS tags via POStaggers. However in chunking task, those features are too general to guessing the unknown word. To solve this, we propose a simple masked method to collect unknown word examples. These examples are derived from existing training instances by mask some known words. In this way, we provide new training examples that do not contain complete lexical information. Therefore, when the learning algorithm is trained with these additional examples, the system performance can be improved. In our experiments, when the chunker is trained with only known examples, the F(β) rate is 93.53 while combining the unknown word instances, it achieves 94.12. By the help of the unknown word examples and simple linear kernel to SVM, our chunker achieves the state-of-the-art performance where time cost is promise to be satisfiable. The remainder of this paper is organized as follows. In Section 2, we discuss text chunking task. Section 3 explains our SVM chunking model and the feature selection techniques. The proposed mask method will be discussed in Section 4. Experimental results on benchmarking corpus (CoNLL-2000 shared task) and another large corpus are shown in Section 5. Concluding remarks and future work are given in Section 6.

2 Chunking tasks Chunking is generally considered as a shallow parsing [1]. In 1995, Ramshaw and Marcus used transformation-based error driven learning to derive transformation rules for tagging the Noun Phrase (NP) chunk. They regarded finding out chunks in a text as a sequence of word tag classification. Ramshaw and Marcus defined the representation of chunking task that is called B-I-O tags. The I tag denotes the word insides a chunk. The O tag denotes a word that is outside a chunk. Finally, the B tag denotes the first word of the second noun phrase indicating the boundary between two immediate chunks. For example, the phrase structure of the sentence “The luxury auto maker last year sold 1,214 cars in the U.S.” is, (NP The luxury auto maker) (NP last year) (VP sold) (NP 1,214 cars) (PP in) (NP the U.S.) All of the words in the same bracket belong to the same chunk, for example, words “the U.S.” is a noun phrase.

3 SVM-based chunking model This section describes the proposed chunking system based on SVM using linear kernel. SVM learns to predict new items by training with these vector sets. The system is divided into two parts: the known part and unknown word instances. We de-

scribe the known part in this section and leave the unknown part to Section 4. will be responsible for paying for this service). 3.1 Example representation One of the advantages of linear kernel of SVM is that it can handle high dimensional data effectively [9] since it compares with the “active” features rather than the complete dimensions. We can therefore impose richer feature types upon each training examples to enhance system performance. As reported [3] [6], the richer feature set had shown to be more effective than simple feature set. In general, contextual information is often used as the seed feature type [2] [10], the other features can then be derived based on the surrounding words. For example, let Wi is the current word, Wi-1 be the one word before Wi, and Wi+1 be the word next to Wi. To classify Wi, we use the words and features surrounding Wi. For example, when we tag the word “police”, the context words are: the (Wi-2) price (Wi-1) police (Wi) could (Wi+1) help (Wi+2) If we include two words surrounding the current word, both the preceding two words and following two words should be considered. Therefore a vector is consists of five sections, which is equal to the location of the contextual words. In this paper, we set the context window size to 2 which concides with previous researches [10] [20]. Using a good feature selection technique not only eases the searching for important or discriminative examples, but also increases the accuracy for classification tasks. The SVM chunker proposed by [10] encoded only unigram, POS tag, and chunk information but require a polynomial kernel to perform well on. According to [3], high quality POS tagging performance can be achieved by employing a richer feature set, such as OrthoFlags, Affixes, N-grams to represent the meaning and relations of these words. Therefore, we adopt more features while use linear kernel to reduce training time. y Lexical information (Unigram, and Bigram) Lexical feature is the most commonly used feature in NLP for word itself carries some certain meanings. For example, the words “Mr.” or “Mrs.” are good lexicons indicating the start of a noun phrase, while the word “was” or “were” usually precede a verb phrase. Bigram lexical information is generally considered more informative than unigram lexicons. In the vector representation, each unigram/bigram dimension indicates the presence or absence of a unigram/bigram. The unigram/bigram lexical dictionary is the set of frequent unigrams/bigrams in the training data. y POS tag information (Uni-POS, Bi-POs, and Tri-POS) POS tags are another informative features used in NLP, since they show more abstracted and syntactic relations. The size of the tag set is 48, i.e. there are 48 different POS tags. We adopt Uni-POS, Bi-POS and Tri-POS tags. For the above example, the Bi-POS of the word “police” include “NN(price)

NN(police)”, and “NN(police) MD(could)”. Similarly the Tri-POS include “DT(the) NN(price) NN(police)”, “NN(price) NN(police) MD(could)”, and “NN(police) MD(could) VB(help)”. y Affix (2~4 suffix and prefix letters) The affix feature is widely used for guessing an unseen word from the training examples [6]. For this feature type, we extract different length of prefix/suffix letters in the head/tail of a word. We select length 2-, 3-, 4- letters from the beginning and ending of a given word. For example, the affix feature of “police” are: prefix: “po”, “pol”, and “poli” suffix: “ce”, “ice”, and “lice”. y Token feature Token feature type is a language dependent feature type that is commonly used in named entity recognition. It recognizes by specific symbols and punctuations, such like “$”, or “%”. For the example, the word “1990s” which matches the token category “Year decade” (See Table 1). Each term will be assigned to a token class type via the pre-defined word category mapping. Table 1. Token feature category list Feature description

Example text Feature description

Example text

1-digit number

3

1/10th

2-digit number 4-digit number Year decade

30 2004 2000s

Only digits Number contains one slash Number contains two slash Number contains money Number contains percent Number contains hyphen Number contains comma Number contains period Number contains colon

1234 3/4 2004/8/10 $199 100% 1-2 19,999 3.141 08:00

Number contains alpha and slash All capital word Capital period (only one) Capital periods (more than one) Alpha contains money Alpha and periods Capital word Number and alpha Initial capitalization Inner capitalization All lower case Others

NCU M. I.B.M. US$ Mr. Taiwan F-16 Mr., Jason, Jones WordNet am, is, are 3\/4

y Previous chunk information (uni/bi-chunks) This feature type is generated by including previous word chunk tags. Most chunking systems use this feature type to help to find the context relations between previous chunks and current word. In this paper, we include both unichunk and bi-chunk in our feature set. y Possible chunk classes For the current word to be tagged, we recall its possible chunk tags in the training data and use its possible chunk class as a feature.

3.2 SVM Learning Algorithms We use SVM as the classification algorithm, which has been shown to perform well on text classification problems [9]. All of the training and testing data are converted into vectors as described above. In this paper, we use SVMlight [9] as the learning algorithm. We do not change the parameter settings to SVM, i.e. we simply use the default parameters of SVMlight. Since the SVM algorithm is a binary classifier, we have to convert it into several binary problems. Here we use the “’One-Against-All” type to solve the problem.

4 The mask method In real world, training data is insufficient since only a subset of the vocabularies can appear in the testing date. During testing, if a term is an unknown word (or one of its context words is unknown), then its unigram, bigram, and possible chunk class features will be set to zero because we don’t have any information of the term from the training data. In this case, the chunk class of this word is mainly determined by nonlexical features. Thus, there is no lexical information when the term is unknown. The most common way for solving unknown word problem is to use different feature sets for unknown word and divide the training data into several parts to collect unknown word examples [2] [6]. However the selection of these feature sets for known word and unknown word were often arranged heuristically and it is difficult to select when the feature sets are various. Besides, they just extract the unknown word examples and miss the instances that contain unknown contextual words. To avoid this problem and let the chunker be able to handle the unknown word or context, we change the view of unknown word tagging problem. Our solution is to derive the examples from the training set using “mask” technique. Fig. lists the proposed mask algorithm. Suppose we have derived lexicon-related features from the training data set, including unigram dictionary (UD), bigram dictionary (BD), and possible chunk class dictionary (ID) from all training parts except for part i. Then, we generate new training examples by mapping the new dictionary set Fi (Fi is created from F by replacing UD, BD, and ID with UDi, BDi, and IDi respectively). Technically, we create a mask mi of length |F| where a bit is set for a lexicon in Fi and clear if the lexicon is not in Fi. We then generate new vectors for all examples by logic “AND” it with mask mi. Thus, items which appear only in part i are regarded as unknown word. The process is repeated for k times and a total of (k+1)*n example vectors are generated (n is the original number of training words). For example, suppose the unigram dictionary of the training data F includes 6 dimensions, A to G. If the new created dictionary Fi includes lexicons B, C, D, and E, then the mask is (0, 1, 1, 1, 1, 0). Suppose an example has a vector (1, 0, 0, 0, 0, 0) by logical “AND” the example vector with the mask, we create new example vector (0, 0, 0, 0, 0, 0). That is, the lexicon A in the example is regarded as unknown word. The mask technique can transform the known lexical features into unknown unknown lexical features and add k times training materials from the original training set. Thus, it is not necessary to prepare additional training materials to derive un-

known word examples and effectively reuse the original data. As outlined in Fig. 1, new examples are generated through mapping into the masked feature dictionary set Fi. This is quite different from previous approaches, which employed variant feature sets for unknown words. The main purpose of the proposed method aims to emulate examples that do not contain lexical information, since chunking errors often occurs due to the lack of lexical features. Traditionally, the learners are given sufficient and complete lexical information, therefore, the trained models can not generalize to examples that contain incomplete lexical information. By including incomplete examples, the learners can take into consideration the effect of unknown words and adjust weights on lexical related features. Step 0. Let F be the feature dictionary constructed in Section 3. Let lexical-related unigram dictionary UD: the unigram dictionary, BD: the bigram dictionary, ID: the known label dictionary (UD ⊆ F, BD ⊆ F, and ID ⊆ F) Step 1. Divide the training data into k (k=2) parts. k

T = U Ti i =1

Step 2. For each part i, mask Ti by compute T’ = T-Ti 2.1 Generate lexical-related dictionaries from T’ UDi: the unigram dictionary of T’, BDi: the bigram dictionary of T’, IDi: the known label dictionary of T’ Fi is created from F by replacing UD/BD/ID with UDi/BDi/IDi 2.2 Create a mask mi of length |F| where a bit is set for a lexicon in Fi and clear if the lexicon is not in Fi. 2.3 For each training example vj represented by feature F, vj ‘= vj AND mi N

2.4 Output S = v ' U j i j =1

Step 3. Go to Step 1 until all parts has been processed

Fig. 1. The mask method to incomplete information examples

5 Experimental Results The performance of the chunking task is usually measured by three rates, recall, precision, and F(β). CoNLL released a perl-script evaluator1 that can evaluate the three measurement rates for a chunking result. We use the evaluator to test the chunking performance in the following part. In order to test our chunking systems, we used two data sets: one is the CoNLL-2000 chunking task [18], the other is the Wall Street 1

http://lcg-www.uia.ac.be/conll2000/chunking/conlleval.txt

Journals (WSJ) sections 00-19 as the training parts; 20-24 as the testing parts for the large-scale experiments. In this way, we can fairly compare our chunker to other chunking systems. There are three parameters in our chunking system: the context window size, the frequent threshold for each feature dictionary, and the number of division parts for the unknown word examples (N). We set the first two parameters to 2 as previous chunking systems [10]. Since the training time taken by SVMs scales exponentially with the number of input examples [9], we set N as two for all of the following experiments.

5.1 Experimental results The first test (standard) is performed under a strict constraint, i.e., all of the settings should be coincided with the CoNLL-2000 shared task. Thus, the use of external resources or other components is invalid. We report two results, one is trained with complete lexical information and the other is trained with both complete and incomplete examples (i.e. use the mask method). The performance comparison is shown in Table 3, where the later model achieves the best system performance. The training time is 28 minutes and 2.7 hours, respectively. The total tagging time for the testing data is 52 seconds, in other words, the tagging speed is about 960~1000 tokens per second, respectively. Compared to the voted-SVMs and voted-perceptrons (about 24 hours), our system is better in both accuracy and efficiency. Although the Winnow algorithm is more efficient in training (12 minutes), our training time is worse than it, our chunker reaches better performance (94.12 v.s. 95.57 in F(β) rate). The overall chunking results are listed in Table 2 where incomplete examples are used. Table 2. Chunking results for each chunk type with standard and non-standard tests Standard Test

Recall

Precision

F(β=1)

ADJP ADVP CONJP INTJP LST NP PP PRT SBAR VP All

71.92 81.64 55.56 50.00 0.00 94.51 98.27 79.25 86.54 94.57 94.12

81.61 83.27 45.45 100.00 0.00 94.67 96.87 76.36 87.86 94.10 94.13

76.46 82.45 50.00 66.67 0.00 94.59 97.57 77.78 87.19 94.34 94.12

Non-Standard Test ADJP ADVP CONJP INTJP LST NP PP PRT SBAR VP All

Recall

Precision

F(β=1)

73.97 81.87 55.56 50.00 0.00 94.49 98.46 85.85 88.79 94.72 94.30

83.72 82.83 45.45 100.00 0.00 94.65 96.87 71.65 88.62 94.19 94.13

78.55 82.35 50.00 66.67 0.00 94.57 97.66 78.11 88.70 94.46 94.21

In the second test (non-standard), chunking systems could make use of external resources to enhance system performances. Therefore, we employ an advanced POS tagger proposed by Giménez & Márquez [6] instead of the original Brill-tagger. In our chunking system, rich POS N-gram features are used to extract more syntactic information. Thus, the more accurate the POS tagger is, the higher performance the

chunker can reach. In our experiments, the SVM-tagger achieves 97.20 token accuracy on the CoNLL-2000 test data compared 96.86 accuracy by the original Brilltagger. It is worth to note that the SVM-tagger was trained with WSJ sections 00-18 and section 19-21 for validation. Since the POS-tagger covers section 20 for validation, it did not use this part for training. Table 3. Chunking performance of some related systems Chunking systems SVM+incomplete examples (This paper) Voted-SVMs (Kudo & Matsumoto, 2001) Voted-perceptrons (Carreras & Marquez, 2003) Generalized Winnow (Zhang et al., 2002) SVM (This paper) Regularized Winnow (Zhang et al., 2001)

Recall 94.12 93.89 94.19 93.60 93.56 93.49

Precision 94.13 93.92 93.29 93.54 93.50 93.53

F(β) 94.12 93.91 93.74 93.57 93.53 93.51

The improvement of mask method The improvement test of the mask method can be found in Table 4. As shown in Table 4, the inclusion of incomplete examples as training instances has improved the chunking performance from 93.53 to 94.12 (from 93.73 to 94.21) for standard (nonstandard) test. As listed in Table4, the percentage of unknown words in the testing set is in the testing set is about 7.10% and the improvement for unknown words is from 92.78 to 93.73 (92.80 to 93.22). The mask method not only improves the unknown word tagging, but also improve known word tagging for both standard and nonstandard tests. Table 4. The improvement by the mask method CoNLL-2000 Unknown Known Total

Percentage 7.10% 92.89% 100.00%

Standard 92.78 → 93.73 94.62 → 95.08 93.53 → 94.12

Non-standard 92.80 → 93.22 94.79 → 95.22 93.73 → 94.21

Impact on POS taggers As discussed above, chunking system performance can be improved by employing a better POS tagger. We extend our work to compare the influence on various POS taggers. There are three POS taggers: Brill [2], SVM [6], and Tree [15] compared here. The performances of each POS taggers on the CoNLL-2000 chunking and large-scale data sets are listed in Table 5. Note that these taggers are trained from various sources. SVM-tagger achieves the best result on CoNLL-2000 data while Tree-tagger performs best in the large-scale data set. Table 6 shows the chunking performances with various POS taggers. In this experiment, four-levels of POS taggers are employed: chunking with a Brill/ Tree/ SVM, and the Gold (hand-annotated) POS tagger. The first and last results give the lower and upper bounds of the chunking performances. Interestingly, when our chunker combines with Tree-tagger, it achieves better results than SVM-tagger, although Tree-tagger does not perform better than SVM-tagger in POS tagging. However,

since the training data of Tree-tagger covers various parts in the WSJ and some of the testing data of CoNLL-2000 may be included in the Tree-tagger, we therefore use SVM-tagger on the large-scale experiment. Table 5. Results on the POS taggers on the CoNLL-2000 and large-scale test set CoNLL-2000 96.86 97.16 97.20

Brill-Tagger Tree-Tagger SVM-Tagger

Large-scale N/A 97.24 97.09

Table 6. Comparison of the chunking performance using various POS taggers Chunking + Brill-tagger Chunking + Tree-tagger Chunking + SVM-tagger Chunking + Gold-tagger

Recall 94.12 94.39 94.30 94.76

Precision 94.13 94.41 94.13 94.60

F(β) 94.12 94.40 94.21 94.68

Large-scale experiments For large-scale experiments, we use the SVM-tagger to generate POS tags. Table 7 gives the chunking results of each phrase type. The total training times of large-scale experiments are 25 hours and 4 hours without combining unknown part. Both training times are increase to more than nine times than previous results. However the total tagging time of the large-scale data is only 4 minutes which means the tagging speed is about 978 tokens per second. Fig. 2 shows the system performances in variant training size. The lower bound of our chunker achieves 93.02 F(β) rate using 0.07 million training examples, while the upper bound 94.69 is reached using all (~1 million) training examples. Compared to the lower bound 90.6 and upper bound 93.25 of HMM-based chunker reported by Molina & Pla [12], our chunking system is a better choice than HMM-based method. 95 94.5

F(β)

94 93.5 93 92.5 92 0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 Size of Training data (million examples)

1

Fig. 2. Evolution of the system performance using different training set size

Table 7. Chunking results for each chunk type on large-scale data set Chunk Type ADJP ADVP CONJP INTJP LST NP PP PRT SBAR UCP VP all

6

Recall 76.90 84.60 66.15 41.38 50.00 95.25 98.56 79.13 90.40 0.00 94.54 94.74

Precision 84.16 84.05 72.88 66.67 90.91 95.28 96.91 75.82 90.05 0.00 94.83 94.63

F(β) 80.37 84.33 69.35 51.06 64.52 95.27 97.73 77.44 90.23 0.00 94.69 94.69

Conclusion

Efficient and high-performance text chunker is an important key-issue for many real world NLP applications. Unknown words in testing examples are one of the reasons for chunking errors. This paper proposes a “mask-based” method to derive incomplete training examples and therefore improve chunking performance. The advantage is that we don’t need additional training data to extract unknown word training examples. Such training examples can be derived via mask each part of the training data. The proposed mask method improves the performance not only known words but also unknown words. From the experiments, our chunking system performs better than other systems in both standard and non-standard tests. Besides, mask method can be combined with other learners, like Winnow, and voted-perceptrons. In terms of efficiency, by using a linear SVM kernel with richer feature set and mask method, the learning time is less than 2.7 hours for CoNLL-2000 training data and the turn around time of the testing data is 52 seconds. In the large-scale test, the performance is also better than HMM-based chunking system. The online demonstration of our chunker can be found at this web site: http://140.115.155.87/bcbb/chunking.htm.

References 1. Abney, S.: Parsing by chunks. In principle-based parsing. Computation and Psycholinguistics (1991) 257-278. 2. Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics (1995) 21(4):543-565. 3. Carreras, X. and Marquez, L.: Phrase recognition by filtering and ranking with perceptrons. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (2003).

4. Carreras X. and Marquez, L.: Introduction to the CoNLL-2004 shared task: semantic role labeling. Proceedings of Conference on Natural Language Learning (2004) 89-97. 5. Charniak, E.: A maximum-entropy-inspired parser. In Proceedings of the ANLP-NAACL (2000) 132-139. 6. Giménez, J. and Márquez, L.: Fast and accurate Part-of-Speech tagging: the SVM approach revisited. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (2003) 158-165. 7. Halteren, H.: Chunking with WPDV models. In Proceedings of Conference on Natural Language Learning (2000) 154-156. 8. Halteren, H., Zavrel, J., and Daelemans, W.: Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics (2001) 27(2): 199-229. 9. Joachims, T.: A statistical learning model of text classification with support vector machines. In Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval (2001) 128-136. 10. Kudoh, T. and Matsumoto, Y.: Chunking with support vector machines. In Proceedings of the 2nd Meetings of the North American Chapter and the Association for the Computational Linguistics (2001). 11. Li, X. and Roth, D.: Exploring evidence for shallow parsing. In Proceedings of Conference on Natural Language Learning (2001) 127-132. 12. Molina, A. and Pla, F.: Shallow parsing using specialized HMMs. Journal of Machine Learning Research (2002) 2:595-613. 13. Park, S. B. and Zhang, B. T.: Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information. Journal of Information Processing and Management (2004) 40: 421-439. 14. Ramshaw, L. A. and Marcus, M. P.: Text chunking using transformation-based learning. In Proceedings of the 3rd Workshop on Very Large Corpora (1995) 82-94. 15. Schmid, H.: Probabilistic Part-of-Speech tagging using decision trees. International Conference on New Methods in Language Processing (1994) 44-49. 16. Suzuki J., Taira, H., Sasaki, Y., and Maeda, E.: Question classification using HDAG kernel. Workshop on Multilingual Summarization and Question Answering (2003) 61-68. 17. Tjong Kim Sang, E. F.: Transforming a chunker to a parser. In Computational Linguistics in the Netherlands (2000) 177-188. 18. Tjong Kim Sang , E. F. and Buchholz, S.: Introduction to the CoNLL-2000 shared task: chunking. In Proceedings of Conference on Natural Language Learning (2000) 127-132. 19. Tjong Kim Sang, E. F.: Memory-based shallow parsing. Journal of Machine Learning Research (2002) 559-594. 20. Zhang, T., Damerau, F., and Johnson, D.: Text Chunking based on a Generalization Winnow. Journal of Machine Learning Research (2002) 2:615-637.