evalbâ made by Satoshi Sekine and Michael John Collins [Sekine and Collins, evalb]. Both evaluators returned the same values. Two metrics were used: ...
Noun Phrase chunking with APL2 Suresh Manandhar* and Enrique Alfonseca*+ *
+ University of York Universidad Autónoma de Madrid {suresh, enrique}@cs.york.ac.uk
Main topic: Computational Linguistics The identification of phrases in a sentence can be useful as a pre-processing step before attempting the full parsing. There is already much literature about finding simple non-recursive non-overlapping Noun Phrases. We have modified the learning paradigm CLOG [Kazakov and Manandhar, 1998] to produce transformation lists, and we arrived to several interesting conclusions about Noun Phrase chunking. IBM APL2 was used to build a prototype that was later rewritten in C++ for performance purposes.
1. Introduction Definitions A parser or syntax analyser is a program that discovers the internal structure of a sentence, and finds the role each constituent has. The output of a parser is a tree structured in the same way as the sentence, like the following: Sentence VP
NP (Subject) V
NP(Object)
John loves
Mary
Figure 1. Parse tree example. Because of the high ambiguity of natural languages, there are usually many grammatical possibilities for the construction of the tree, and finding the correct parse tree can become a bottleneck in natural language processing (NLP) applications. Therefore, many applications use shallow parsers to identify the constituents in a rough way. We can find these partial parsers in Information Extraction, Information Retrieval and Text Summarisation systems, apart from their obvious application as pre-processors before full parsers. Text chunking is defined as the identification of simple non-recursive non-overlapping phrases in texts. We define a Noun Phrase (NP) as a constituent whose head is a noun, or something that takes the role of a noun, and a base Noun Phrase as a non-recursive non-overlapping Noun Phrase. Hence, a base Noun Phrase includes the determiners and adjectives, but it cannot include other NPs acting as modifiers. A particular case of text chunking is NP chunking, the identification and bracketing of simple noun phrases. For this task we shall use a representation borrowed from [Ramshaw and Marcus, 1995]. Instead of bracketing the Noun Phrases in the text, every word in the corpus will be labelled with one of the three following labels: I, if the word is inside a base NP; O, if a word is outside; B, if a word is at the beginning of an NP and the previous word is also inside a noun phrase.
Noun Phrase chunking with APL2
For example, the sentence In a large company that means many hundreds of complaints we can find these four base NPs: In [a large company] [that] means [many hundreds] of [complaints] .
The bracketing above is equivalent to thefollowing IOB labelling. In/O a/I large/I company/I that/B means/O many/I hundreds/I of/O complaints/I .
Note that the word that has been labelled as B, because the previous word is inside a different Noun Phrase. If we had labelled it asI, the equivalent bracketing would be In [a large company that] means [many hundreds] of [complaints] .
Finally, note that the last base Noun Phrase is indeed a prepositional complement of the previous one, so there is one complex Noun Phrase: In [a large company] [that] means [[many hundreds] of [complaints]]
There are several other methods for representing the bracketing of Noun Phrases, e.g. using always the label B at the beginning of a Noun Phrase, or labelling the starting words as with an opening bracket and the ending words with a closing bracket. [Tjong, 1999] claims that the representation does not affect the results, although [Muñoz, 1999] obtained better results with the Opening/Closing brackets. Apart from that, every word in a language can be classified according to the function it performs. For example, nouns are usually complements of verbs; adjectives are modifiers of nouns; determiners and prepositions are function words, etc. According to the function they perform, words can be classified in classes called parts-of-speech (p-o-s). To know the part-of-speech of every word in a text is a pre-requisite before we can start the syntax analysis. Of course, it is possible to build NP chunkers by hand, as Noun Phrases usually follow relatively simple grammars. However, the application described in this document addresses this problem from the point of view of automatically learning how to chunk a text using supervised training. This allows us to port easily the system through discourse domains and languages. Furthermore, bracketed texts are easily available, and they can be used for guiding the training. Our point of view is in fact rather similar to previous approaches. We have built both in APL2 and C++ an NP chunker that is at the same level as previous approaches. This paper starts with a quick overview of the start-of-the-art in this field of expertise. Section 3 describes our approach. Finally, sections 4 and 5 show the results and the conclusions.
2. Related work In [Tjong, 1999a] we can find a rather thorough comparison of the best NP chunkers reported so far. All of them have been trained on sections 15-18 of the Wall Street Journal corpus, and tested on section 20. Table 3, in section 4, lists the references and the accuracy obtained by them. The best accuracy is 92.8 %, obtained by [Muñoz, 1999], using a system based on neural networks. [Veenstra, 1998] and [Tjong and Veenstra, 1999] use memory-based learning techniques. In particular, [Tjong and Veenstra, 1999] stores, for every word in the corpus, a feature representation of the word, its context, and the correct IOB label. When labelling a word in a new text, it looks for the most similar feature case and assigns the word that label. [Ramshaw and Marcus, 1995], one of the first approaches to NP chunking, produces Transformation Lists for bracketing the Noun Phrases, with a training algorithm based on Brill’s [Brill, 1995]. [Argamon, 1998] describes another memory-based method, MBSL (Memory-based Sequence Learning). Each sentence is represented with a sequence of part-of-speech labels, and the program has to bracket it. Each candidate sequence of p-o-s labels is scored depending on the number of times that it was seen as an NP in the training corpus, and the one with the best score is chosen.
Suresh Manandhar and Enrique Alfonseca
Finally, [Cardie and Pierce, 1998]’s system Empire induces a grammar for base Noun Phrases, where the terminal symbols are parts-of-speech. In terms of training speed, [Ramshaw and Marcus, 1995] is probably the slowest system, as it takes several days to generate a transformation list for processing the corpus. Cardie’s Empire needs only a handful to a few hundred passes for pruning. But the fastest algorithm for training is probably Argamon’s MBSL, because it just needs one single pass over all the training data. After the training is done, Empire is the quickest in bracketing new texts, which is done in linear time. The Transformation-Based bracketer [Ramshaw and Marcus, 1995] can be equally fast if the transformation list is compiled into a single finite-state transducer.
3. Error-driven transformation-based learning Transformation lists Together with decision trees, rule systems are a classical method for classifying instances, tagging them with labels. We describe here the way a Transformation List (TL) works to classify several items in separate classes. A Transformation List is a sequence of rules that apply successively to the set of instances, possibly changing their labels. The most general rules will be located at the beginning of the list. For example, if we want to label numbers in a text as CD (cardinals) and ORD (ordinals), we can use a transformation list like the following: If the current word has a digit, label asCD If, labelled asCD, it finishes by th, label asORD If, labelled asCD, it finishes by st, label asORD If, labelled asCD, it finishes by nd, label asORD If, labelled asCD, it finishes by rd, label asORD 6. etc. The effect of applying each rule to this hypothetical sentence is the following: 23rd May 1999 – Researchers at Hull reported the finding of the first species of anaerobic bacteria that survives at temperatures of 1000 degrees.
23rd/CD May 1999/CD – Researchers at Hull reported the finding of the 1st/CD species of anaerobic bacteria that survives at temperatures of 1000/CD degrees.
23rd/CD May 1999/CD – Researchers at Hull reported the finding of the 1st/ORD species of anaerobic bacteria that survives at temperatures of 1000/CD degrees.
23rd/ORD May 1999/CD – Researchers at Hull reported the finding of the 1st/ORD species of anaerobic bacteria that survives at temperatures of 1000/CD degrees.
One characteristic of transformation lists is that the effect of applying one rule can affect the behaviour of the following rules. Making use of this fact, [Brill, 1995] showed that a transformation list is more expressive than a decision tree.
TL-CLOG Our algorithm has been inspired by two learning paradigms: error-driven learning of transformation lists [Brill, 1995] and CLOG (example-driven learning of decision lists) [Kazakov and Manandhar, 1998]. Brill’s algorithm learns transformation lists using an exhaustive-search approach, which makes it really slow. One of the first NP chunkers that were reported [Ramshaw and Marcus, 1995] used Brill's algorithm,
Noun Phrase chunking with APL2
and it took very long to train. On the other hand, CLOG is a Decision List learner that improved considerably the efficiency of the previous system FOIDL [Mooney and Califf, 1995] by only examining the rules relevant to one incorrect example at a time. This made us think that Brill's algorithm could be sped up in the same way. The advantage of speeding up a learning procedure is that it can now look in a greater search space containing more expressive rules. The pseudocode of TL-CLOG is the following:
• •
!
• • • •
"
$
% $ &
' # ( )# * ( !
+,
( (
!
• • •
$
(
( $
#$ (
The search performed by this algorithm is still rather slow and exhaustive. However, the fact that we always consider only the rules affecting to the current tagging error allows us to consider more complex rules.
Walkthrough example "
)
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
The following table shows the words in the sentence (column 1), together with the part-of-speech(column 2).NNPmeans “Proper noun”,VBZis “third-person singular present-tense verb”,NN is “common noun”, IN is a preposition,DT is a determiner andVBGis a gerund. The last two columns show the correct IOB labelling, and the labelling produced by a simple labeller used during the initialisation. This initial IOB tagging just considers that every noun, pronoun, adjective and determiner is a whole Noun Phrase by itself. With this labelling, approximately 68.3 % of the IOB tags are correct. Finally, they greyed areas in the table are the four base NPs in the sentence: Mr. Vinken! chairman! Elsevier N.V.!andthe Dutch publishing group. The correct bracketing for the sentence is [Mr. Vinken] is [chairman] of [Elsevier N.V.], [the Dutch publishing group] .
while the bracketing proposed initially is [Mr.] [Vinken] is [chairman] of [Elsevier] [N.V.], [the] [Dutch] publishing [group] .
-
.
/
0,
' (
+(
NNP
I
I
NNP
I
B
VBZ
O
O
NN
I
I
Suresh Manandhar and Enrique Alfonseca
(
IN
O
O
1
2
NNP
I
I
3
-0
NNP
I
B
4
!
,
O
O
5
+
DT
I
I
6
7
NNP
I
B
8#
VBG
I
O
'
NN
I
I
$(
The main loop in the program looks for an error in the current IOB tagging of the corpus. The first error that is identified is the word Vinken, which is tagged as B and should be tagged as I. The algorithm must now produce the most specific rule that fixes this problem, which is:
+
(
# +
I
+
B
9
+ (
# !
O
+
I
!/ ( ( $ Vinken
We can read it in the following way: if, in the context of the current word, the five preconditions are true, then the action will trigger. For the sentence we are working with, this rule would only trigger with the second word. Afterwards, this rule will be generalised in nearly every possible way, by removing constraints in the five preconditions. We obtain thus a set of possible rules, and we apply each one of them to the training corpus. Amongst those rules, the one that corrects more tagging errors will be chosen. The gain function used is:
For example, a good candidate is:
+
9
+
B
Noun Phrase chunking with APL2
! ( $ Vinken . :*
$ # -8 !( #$ ! $# ;
-8
.6LU6 /0'* *$*7+C B 39M6VN6N>WN6!JWN6U0 ;N9N WN6!JWN6U Once the program has found an error, functionF* D , in lines 11-12, returns the words surrounding it.
F* D %& &&$& && $L%0F* I
- QY-Z-L,T',T'[KZ-L% .QY-ZL,T',T'[KZL% / QP-QLT'T'[W9NWL%\QL% 3QP-Q-LT'T'[W9\Q-L% ; $
Function '($2 2 2* (line 13) studies an example of a word incorrectly tagged, and generates a possible rule to handle it. This rule will be highly specialised, i.e. it will only trigger when it finds a context exactly equal to the context of the erroneous word. The previous subsection explains the kinds of rules our system handles. In the example, the most specific rule (using a context of two words at each side) that corrects the tagging error of word number 2 is the one in figure 2. The data structure chosen for a rule is a 7 element vector with the following information: #$%&' • The restrictions on the part-of-speech tag and the IOB label of the second word to the left.
Suresh Manandhar and Enrique Alfonseca
• The restrictions on the part-of-speech tag and the IOB label of the first word to the left. • The restrictions on the part-of-speech tag and the IOB label of the current word. • The restrictions on the part-of-speech tag and the IOB label of the first word to the right. • The restrictions on the part-of-speech tag and the IOB label of the second word to the right. Sixth item: action to take • The action that must be taken in case that the rule triggers. Seventh item: • This is a number in base two, from 0 to 1023. From the ten preconditions, this number chooses which ones are to be considered. To get all the generalisations of this rule, we just try all the possible numbers from 0 to 1023.
Afterwards, )* 2 2*, in line 15, relaxes the preconditions of the rule generated by the previous function, and returns an array with all its possible generalisations. Amongst those rules, the best one will be chosen. The rule in figure 3 above is chosen to correct this example. Function Improvement calculates the improvement that a rule would produce over the whole corpus. It does so by calling '* +(H! which computes the improvement over a sentence. Function 'F* D is equivalent to F* D , but it only receives a single sentence as an argument. Finally, 6((8returns 1 if a rule is triggered by a given context.
+(H+* 7&" 2*&)
" 2*) -$%&'()#* +()+, * +)+- 7'* +(H'* *$&&) &, &*] B&F D&')
'* *$') '* *$ -) , .*] BN'* *$ /4=*] BYQ!*
3F D'F* D '* *$ ;=P76((8F D! -W76((8F D @) ) Q;LF DP.L')
E, , Q;LF D.L')
= * 4) Z,
76((8F* D &1 A 1 A9N-^>W7 -S!9U1 A!W7?WF* D Finally, in line 21, function 6((87 changes the IOB labels in the training corpus by using the best rule found in this step. The loop goes on until there are no more errors to correct in the training corpus, or until no rule can be found to correct the remaining errors.
4. Results The APL2 program was used as a prototype during the program development, and the final version was rewritten in C++ for speed. The system was trained on sections 15-18 of the Wall Street Journal taken from the Penn Treebank [Marcus, 1993], and the test set was section 20 of the same corpus. These training and test corpora are the same used by any other paper reported herein, and they are available at ( >>( >#>, . The texts are newswire articles taken from the Wall Street Journal, which have been annotated by hand. The output was scored using two different programs, one of them programmed by us, and the other – evalb– made by Satoshi Sekine and Michael John Collins [Sekine and Collins, evalb]. Both evaluators returned the same values. Two metrics were used:
Noun Phrase chunking with APL2
•
Recall is the fraction of correct base noun phrases that were found in the test corpus.
•
Precision is the fraction of the bracketed noun phrases that were correct. F-measure is a metric to combine precision and recall, defined as F(β)=(β2+1)·P·R / (β2·P+R),
where P is the precision and R is the recall. Taking β=1 we give the same weight to P and R. For β=0.5, recall is half as important as precision, and for β=2, recall is twice more important than precision. In a preliminary experiment, we trained and tested the system using the training and test corpora taken from version 2 of the Penn Treebank. We focused on the NP chunking task itself, i.e., on how good the results could be provided that the texts were correctly annotated with part-of-speech tags. In that experiment, we obtained a recall of 94.20% and a precision of 95.83%. Hence, the F-measure was 95.01. Those results were obtained after performing a post-pruning to the Transformation List, by removing the rules that only applied once in the training corpus. This improved slightly the accuracy on the test set. Afterwards, we realised that every other accuracy reported had been obtained by training and testing on data tagged by Eric Brill’s automatic part-of-speech tagger [Brill, 1995]. Because that tagger introduced some errors, the task of finding the base Noun Phrases is harder, and the results we obtained were a recall of 91% and a precision of 91%. The F-measure was 91.*** Table 3 shows a comparison of performance with other algorithms, taking β=1 in every case. All the other algorithms were trained and tested, using the same data sets, by Erik Tjong Kim Sang [Tjong, 1999a].
' +"?="'
(
β
-
54@
56@
56
C
56@
56@
546
+;!555
C
56@
5@
53
C
546@
53@
56
!554
-
516@
516@
516
0
!554
-
4566@
56@
516
-
5@
5@
566
=
-
5636@
56@
5656
-
566@
5656@
5616
-
566@
5636@
5616
)* /AB
! 555
/!55
8 !554 /AB
! 555 /!55 != - $!D+;!666E $ # *? (51
Table 4 shows the results of testing the generated Transformation List on several other corpora (using the correct part-of-speech tags in every case). As can be expected, there is a slight loss in accuracy. However, this loss is small, so we can affirm that the list is rather portable across discourse domains. Needless to say, we could improve these metrics by training the system on each corpus separately.
Suresh Manandhar and Enrique Alfonseca
'
4455@
433@
44
511@
54@
51
4551@
56@
54
.#
54@
56@
5
#
5@
516@
561
β
+!8 ( ( $ ! ?4( .F
9! % + ( F$ & ( G # ( H %7 ( 2 $& # ( ( H % ( #
" H , ( # ( .I ! # H %(
5. Discussion and conclusions We have built, both in APL2 and C++ , a new machine learning algorithm to detect base noun phrases, and which achieved, results comparable to others reported so far. We have used the IOB annotations, which are also fairly standard in the field, although [Tjong and Veenstra, 1999] showed empirically that the format chosen is not really significant in terms of accuracy. As can be seen, the APL2 workspace is rather small and easy to modify for testing purposes, and it took one person less than one day to rewrite it in C++. Our algorithm, when training, is slow. It took the C++ version four days to train, using a PC Pentium III 450 MHz, on the 8827 sentences. Together with [Ramshaw and Marcus, 1995], it is probably the one that trains faster. For every rule that is learnt, 64 candidate rules have to be applied to the training corpus, so the best one is chosen. Therefore, if the corpus contains N sentences with an average of M words per sentence, the complexity of generating one rule is O(64.M.N) = O(M.N). The worst case would occur if every word were incorrectly tagged, and every new rule repaired just one error in the training corpus. In that case, we would need M.N rules to correct the whole corpus, and the overall complexity for generating the entire Transformation List would be O((M.N)2). We can expect, however, that this situation is very unlikely to happen. This algorithm does not attempt to overcome the fact that [Ramshaw and Marcus, 1995]’s TL is rather slow to train. Instead, the aim is to use a more directed search through the space of rules so that we can use a more expressive rule space. Therefore, although TL-CLOG has the same training complexity than TL, it can generate rules with up to ten preconditions. Finally, the effort needed to apply a rule to a text, again with N sentences and M words per sentence, is O(M.N). Therefore, if the Transformation List contains R rules, the complexity of bracketing the Noun Phrase chunks is O(N.M.R). Furthermore, in theory it is possible to compile the Transformation List into a finite-state automaton, which can be applied to a document in linear time. As a concluding remark, we noted that the training corpus used contained many errors. Firstly, because the part-of-speech tags had been produced by Brill’s tagger [Brill, 1995], approximately 4 % of them were erroneous. Our chunker only relies in the p-o-s tags for its predictions (it does not use any lexical information at all) and it is very sensitive to those tagging errors. And, secondly, human experts had bracketed the chunks, and we detected a number of bracketing mistakes as well.
Conclusions The conclusions to which we arrive are the following: 1. Our system attained an F-measure similar to that of other systems that do not use lexical information (the words themselves). The results obtained by TL-CLOG are below those of [Argamon, 1998] and [Veenstra, 1998], but above [Cardie and Pierce, 1998], and the also above results of [Muñoz et al, 1999]
Noun Phrase chunking with APL2
and [Ramshaw and Marcus, 1995] in the versions of their systems that only rely on p-o-s tags. It is obviously a future direction to test TL-CLOG using lexical information (the words itself). 2. When we used the correctly-tagged corpus, the F-measure attained by our system was 95.01, far above the best reported so far, and four percent higher than the F-measure using automatically generated part-ofspeech labels. Therefore, TL-CLOG was really sensitive to errors in the part-of-speech tagging. It would be interesting to discover whether lexical information is used by automatic chunkers as a mean to overcome p-o-s tagging errors, or whether it really provides information not encoded in the correct p-o-s tags, by allowing TL-CLOG to use lexical information and then training and testing it on both data sets.
Future work As we stated above, NP chunking is a useful pre-processing stage before attempting to get a full parse. The following obvious simple stage is the identification of whole Noun Phrases, and finding their internal structure. In the sentence of the example above, Mr Vinken is chairman of Elsevier N.V., the Duth publishing group
there are two complete Noun Phrases: •
Mr Vinken, the subject of the sentence,
•
Chairman of Elsevier N.V., the Dutch publishing group, with the following structure: Noun Phrase noun chairman
Prepositional phrase Prep of
NP NP
Elsevier
NP(adjunct) The Dutch pubishing group
+&8
( -8
Discovering the complex structure of every Noun Phrase in a text has been defined by Erik Tjong Kim Sang [Tjong, 1999b] as NP bracketing. It might be interesting to study the portability of this algorithm across languages. We have not performed any experiment yet. However, it is likely that TL-CLOG can be directly used with other European languages like French, Spanish or German, because the words in the base-Noun Phrases are always contiguous.
Acknowledgements I would like to thank Eric Brill for making his tagger publicly available, Michael Collins and Satoshi Sekine for evalb, the scoring program, and Erik Tjong Kim Sang for providing his comparison of NP chunking algorithms.
Bibliography [Argamon et al, 1998] Shlomo Argamon and Ido Dagan and Yuval Krymolowski, 1998. A Memory-Based Approach to Learning Shallow Natural Language Patterns. Proceedings of the 17th International Conference on Computational Linguistics (COLING-ACL'98).
Suresh Manandhar and Enrique Alfonseca
[Brill, 1995] Eric Brill, 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, Dec. '95. [Cardie and Pierce, 1998] Claire Cardie and David Pierce, 1998. Error-driven pruning of Treebank grammars for base noun phrase identification. Proceedings of COLING-ACL'98. [Kazakov and Manandhar, 1998] Dimitar Kazakov and Suresh Manandhar, 1998. A Hybrid Approach to Word Segmentation. Proceedings of the 8th International Workshop on Inductive Logic Programming (ILP-98). Springer-Verlag. http://www.cs.york.ac.uk/isg/papers/suresh.manandhar/segmentation.ps.gz [Marcus, 1993] Mitchell P. Marcus, Beatrice Santorini and M. A. Marcinkiewicz, 1993. Building a Large Annotated Corpus of English: the Penn Treebank. Computational Linguistics, vol. 19, num. 2, pp. 313-330. June 1993. [Mooney and Califf, 1995] R. J. Mooney and M. E. Califf, 1995. Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 1995. file://ftp.cs.utexas.edu/pub/mooney/papers/foidl-jair-95.ps.gz [Muñoz et al, 1999] Marcia Muñoz, Vasin Punyakanok, Dan Roth and Dav Zimak, 1999. A Learning Approach to Shallow Parsing. Proceedings of EMNLP-WVLC'99, Association for Computational Linguistics. [Ramshaw and Marcus, 1995] Lance A. Ramshaw and Mitchell P. Marcus, 1995. Text Chunking Using Transformation-Based Learning. Proceedings of the Third ACL Workshop on Very Large Corpora, Natural Language Processing Using Very Large Corpora. Kluwer, 1998. Originally appeared in WVLC95, 82-94. [Sekine and Collins, evalb] Satoshi Sekine and Michael John Collins, evalb. http://cs.nyu.edu/cs/projects/proteus/evalb [Tjong, 1999] Erik F. Tjong Kim Sang, 1999. NP Chunking. http://lcg-www.uia.ac.be/\~erikt/research/np-chunking.html [Tjong, 1999b] Erik F. Tjong Kim Sang, 1999. Noun Phrase Detection by Repeated Chunking. Talk presented at the NP Identification session of the CoNLL-99 workshop, Bergen, Norway, 1999. [Tjong and Veenstra, 1999] Erik F. Tjong Kim Sang and Jorn Veenstra, 1999. Representing Text Chunks. Proceedings of EACL'99", Association for Computational Linguistics. [Tjong, 2000] Erik F. Tjong Kim Sang, 2000. Noun Phrase Representation by System Combination. Proceedings of ANLP-NAACL 2000, Seattle, Washington, USA. Morgan Kaufman Publishers, 2000.
>>?# >J , > >666
Noun Phrase chunking with APL2
[Veenstra, 1998] Jorn Veenstra, 1998. Fast NP chunking using memory-based learning techniques. In F. Verdenius and W. van den Broek eds., BENELEARN-98: Proceedings of the Eighth Belgian-Dutch Conference on Machine Learning, ATO-DLO, Wageningen, report 352, 1998.
Suresh Manandhar and Enrique Alfonseca
Noun Phrase chunking with APL2