A Neural Network Approach to Part-of-Speech Tagging* Nuno C. Marques **
Gabriel Pereira Lopes
(
[email protected])
(
[email protected])
Universidade Nova de Lisboa - Faculdade de Ciências e Tecnologia Departamento de Informática Grupo de Língua Natural 2825 Monte de Caparica Portugal http://www-ia.di.fct.unl.pt/~nmm/artigos.html
Abstract Neural networks are one of the most efficient techniques for learning from scarce data. This property is very useful when trying to build a part-of-speech tagger. Available part-of-speech taggers need huge amounts of hand tagged text, but for Portuguese there is no such corpora available. In this paper we propose a neural network that, apparently, is capable of overcoming the huge training corpus problem. Distinct network topologies are applied to the problem of learning the parameters of a part-of-speech tagger from a very small Portuguese training corpus and from a subset of the Susanne Corpus. The experiments carried out are discussed. The results obtained point to a correction rate above the 97% starting in with a hand tagged training corpus with approximately 15,000 words.
1.
Introduction
The application potential of textual corpora increases, when the corpora is annotated. The first logical level of annotation is usually part-of-speech tagging. At an upper level the text is no longer seen as a mere sequence of strings and is taken as a sequence of linguistic entities with some natural meaning. The annotated text can then be used to introduce further some new types of annotations (usually by means of syntactic parsing [Marcus et al. 1993] or [Marcken, 1990] ), or may be directly (or indirectly) used to collect statistics to different kinds of applications. Working at the word tagging level enabled applications such as: speech synthesis [Church et al.,1993], clustering [Pereira et al., 1993} or even computational lexicography [Manning, 1993]. The success of this kind of technique is certainly due to its intrinsic capability for assigning a sequence of part-of-speech tags to any sequence of words with high levels of precision using quite modest computer resources. Despite this, part-of-speech taggers are not yet as fully available as they should, especially when we are working with languages other than English. The main problem with currently available part-of-speech taggers is the lack of tagged corpora: almost every tagger needs huge amounts of hand tagged text. In this paper we propose a neural network which is able to overcome the huge training corpus problem. We start by describing from a computational perspective the general topology of the used system. Used unix scripts [Church, 1994] are presented as well as a overview of three proposed neural networks topologies to solve this problem. In section 3 these networks will be compared between themselves using the Portuguese Radiobrás Corpus. The best performing network will be also tested on the Susanne Corpus1 [Sampson, 1995]. Schmid [Schmid, 1994], presents a neural network approach capable of part-of-speech tagging. Although a 96.22% performance is reported, the neural network model presented there is more complex than the ones presented here and a corpus of 100,000 words was used to train the neural net. Schmid argues the neural net based tagging achieve better results when the size of the training corpus is small. In this paper we argue in the same direction. [Nakamura et al., 1990] has also done some related work, but their goal was the prediction of the next word to appear in the input text. It was not intended for corpus tagging (a precision of 86.9% was obtained).
2.
System Description
2.1
Neural Networks
The main processing principle of neural nets is their capability to distribute activation patterns (learned from a training set) across the links via a learning algorithm. This is done in a way similar to the basic mechanism of the human brain. The similarity however ends here. The human brain is * Work partially supported by project Corpus 1 funded by JNICT under contract number PLUS/C/LIN/805/93. ** Work supported by PhD scholarship JNICT-PRAXIS XXI/BD/2909/94. 1 The SUSANNE Corpus is a freely available, English annotated subset of the Brown corpus
(ftp://ota.ox.ac.uk/pub/ota/public/susanne). This corpus is supplied by the University of Sussex.
a living organ, capable of changing dynamically the strength of its connections while an artificial neural net (at least with the models used in this work) is a parallel algorithm, that once the training is complete, cannot change anymore and it is only capable of classifying their input vectors in a fixed and deterministic fashion. The neural network topologies were tested and trained using a general purpose neural network simulator: the Stuttgart Neural Network Simulator (SNNS) 2 [SNNS, 1994]. This simulator supports the automatic definition of neural net topologies (Specification of how the constituent units connect to each other). Neural units can be of three types: input units, hidden units or output units. Connection links are unidirectional, although recurrent links are supported3. Several algorithms for learning from a training set are available. In this work only standard backpropagation and momentum backpropagation were used. The values of the input are directly received from a pattern file4. Output units values can be passed to a result file (when we are using the network to tag text), or can receive their values from a pattern file (when we are training the network). A pattern file can have several vectors of input values and associated output values (an input vector is a series of values that is presented to all input units at a time). The SNNS also supports a dynamic mode: the input vectors (and the input units also) are ordered into sets, and the output of a net is computed for a sequence of sets. That sequence can change according to a particular step (this step is a parameter specifying the number of sets to jump in order to perform the next iteration). This feature is used to implement the n-gram model. The hidden units are units that can both receive their values from the previous units, and send their values to the next unit. Each unit in a SNNS neural net has two characteristic functions: the activation function and the output function. The activation function specifies the way the input links of a given unit are combined into a unique value. The output function specifies further changes to that value before passing it to the other units in the network. All neural networks presented in this paper use the logistic activation function [SNNS, 94]. 2.2
The Neural Network Tagger Topology
The corpus was processed so that we could use the SNNS simulator to tag the text. These procedures were implemented using simple unix text processing commands, the awk programming language [Church, 1994] and a previously developed tool called classifier. The following unix tools were developed 5: classifier_795 : Responsible for the tokenization and assignment of the lexical probability vectors to each word in a corpus file. This program is usually called from the scripts evaluate_tagger.x and train_tagger.x. build_dic_795.x Builds the dictionary using the tagged corpus as a basis. Numbers occurring in the text are ignored (just because classifier handles them automatically). Words starting with upper case are converted to lower case. The only exception to this rule refers to the words tagged as proper nouns. This script optionally adds two types of the lexical probability vectors to each entry: A vector containing the probabilities p(w|t) or p(t|w)6 for each tag. rand_split.x This script splits randomly the corpus . Given an input file, this script partitions it in two disjunct files. Sentences 7 belonging to each of the output files are selected randomly according to a pre-determined probability (supplied as a parameter). create_tdnn Translates a training or a testing set file (as generated by classifier) into a SNNS dynamic pattern file. evaluate_tagger.x Tags a test corpus with a previously tagged network, compares the results with the hand assigned tags reporting the tagging accuracy. train_tagger.x Prepares a train file for training a neural network.
2 The Stuttgart Neural Network Simulator package is fully available at the University of Stuttgart and can
be obtained by anonymous ftp from host: ftp.informatik.uni-stuttgart.de (129.69.211.2) in the subdirectory /pub/SNNS. 3 Neural networks can be feed-forward, where an input vector is passed from a input layer (set of units) to the next layer of units, until it finally reaches the set of output units (there are no loops: each unit is used only once), or it can be recurrent. In recurrent nets the input is passed to a hidden layer and then it can be passed backwards, until it reaches the same unit (there are loops). 4 The file containing the set of activation patterns of all input and output units of a neural network. 2 5 Please see http://www-ia.di.fct.unl.pt/~nmm/Software for more information. 6 Probability of a word w given a tag t and probability of a tag t given a word w, respectively. The first measure was used in [Merialdo, 1994] and the second one was used in [Schmid, 1994]. 7 The tagged corpus contains one sentence per line
unigram_tagger.x Gives the base-line, unigram tagging, acquired by assigning each word in a corpus the most frequent tag for that word. In Figure 1 we illustrate the tagging process. 2.3
The Feed-Forward Network topologies
The first topology used for solving the part-of-speech tagging problem, was a simple feedforward neural net [Haykin,1994] having only input and output units. The input units were divided into two sets of context units. Each output unit represents one of the tags. A one-to-one relation is established between each value in the lexical probability vector (acquired from the dictionary) and each input unit in each set. A simple bigram model was implemented using these two sets: the first set receives the probability vector of the word we want to classify and the second set the vector of the next word in the sentence (the context word). The network is fully connected: each input unit is connected to all output units. In a second topology we tried to increase the neural network discriminative power. A layer of hidden neurons was added: all the input units are connected to all the hidden units. The hidden units are now the only connection to the output units. Each unit is associated with a part-of-speech tag ti in these networks. The value acquired from the lexical probability vector for the word wk supplied by the dictionary is assigned to the associated input unit. Word wk can be represented in the first or in the second set of input units, depending on whether it is the word we wish to classify or the next word in the sentence. For training proposes, the output units are assigned the value 1 or the value 0, according to the part-of-speech category they were tagged in the corpus: 1 if the neuron is related to the tag assigned to the word and 0 otherwise.
unigram_tagger.x
Test set
evaluate_tagger.x
rand_split.x
Result File
Corpus
Closed Lexicon
build_dic_795.x
SNNS
classifier_795 create_tdnn
Train Set
Pattern File
Open Lexicon
build_dic_795.x
train_tagger.x
Neural Network
3
2.4
The Elman Network topology
In a third topology an Elman neural network [Elman 1990] was used. We replaced the context word's input layer of the second network by a recurrent layer: each unit was connected to itself by a link of weight 1 (identity link) and to all the hidden units. Each hidden unit was connected to one context unit using an identity link. The main idea of this topology is to supply the net with a shortterm memory, so that the context of the last word that was seen can be parameterized by the learning process. This network is very similar to the one used by Schmid [Schmid, 1994]. The training of this kind of network is done without using the recurrent links. The recurrent links are only used after a full iteration of the learning algorithm, when we are updating the value of the contextual units. This way the standard backpropagation and momentum backpropagation can be applied as described before. The neural net was trained in a way similar to the last two networks: the lexical probability vector of the word to be classified is used as the value for the input units and all the output units are set to 0, except the one associated with the tag presented in the corpus who receives the value 1. The momentum and standard backpropagation training algorithms were also used with the same parameters as before.
3.
Results
3.1
Comparing the several topologies using the Radiobrás Corpus
The Radiobrás Corpus was used to evaluate and compare the performance of the three networks [Villavicencio et al., 1995], [Villavicencio, 1995]. The Radiobrás Corpus is a small hand tagged news corpus . This corpus is based on a news bulletin distributed in Internet through electronic mail messages by the science and technology editor from “Agência Brasil”. 652 sentences from this bulletin were hand tagged using 35 distinct tags. The resulting corpus has 19141 tagged words. Using the build_dic_795.x script it was possible to build a lexicon that contained, for each word in the corpus its possible associated tags and its occurrence frequencies. If we tag our corpus with the most frequent POS tag for a given word (maxt(p(t|w)) we get a tagging accuracy baseline8. The acquired precision for this operation was of 78.3% when we used an open dictionary (induced from a 14385 words corpus) and of 92.8% when we used a closed dictionary, The SNNS neural network simulator pattern files were generated using a tagged corpus and the previous lexicon. Each training set was used for training several neural network models. Each testing file was then used to evaluate the trained tagger. This way the testing sets were tagged and so the precision of the tagger could be found, simply by comparing the taggers output with the corpus tags. If we use the entire corpus to acquire each word lexical probability vector (closed dictionary assumption), a test set of 79 sentences (containing 2229 words, 30.37% of ambiguous words) and the first network we get the following results: using training set 1 (22 sentences, 662 words) we get a tagging accuracy of 92.7%, using training set 2 (112 sentences, 3362 words), the tagging accuracy raises to 95.6% and using training set 3 (538 sentences, 15861 words) we get a tagging accuracy of 96.4%. The second and third neural networks give worst results: using training set 3, they give rise to tagging accuracy of 96.1% and 92.2%. The three neural networks were also tested using an open dictionary (only the training corpus was used to determine the probability vectors in the lexicon). The results for the three networks, using the training set 3 were: 87.5%, 82.3% and 86.3%. The results for the 1 layer network and for the Elman network are summarised in figure 1. The bottom line illustrates the behaviour of the 1 layer network when evaluated with different training corpus sizes but without a complete convergence with the training corpus. Complete convergence would require long training periods. These results show that a training corpus with 3,362 hand tagged words enables precision results over 95%, Moreover, the best results are acquired through the use of the 1 layer network. For a more extensive description of these results please see [Marques and Lopes, 1996]. Ulterior results using the lexical probability p(ti|wi) have given us even better results: 97.7% precision (88.7% with the open lexicon) was achieved with a 14385 word training corpus using the 1 layer network and a 97.5% precision (88.4% with the open lexicon) with the two layer network. 4 8 This kind of tagging is also called unigram tagging, because we don’t use any contextual information during tagging.
Variation of the precision with the training corpus size. 100
87,1 78,9 74,472,375 68,1
80 70
96,4 92,2 88,7
95,6
92,7
90
86,3
87,8
120 Iterations, 1 layer net
60 5000 to 10000 Iterations, 1 layer net
50 40
5000 to 10000 iterations, 1 Elman net
30 20
16 9,2
10
3362
662
0
15861
0
Size of the training Corpus (words) Figure 2: Variation of the precision with training corpus size
3.2
Results with the Susanne Corpus
It is difficult to compare the results with other methods, because we are working with a different language and with different tag sets. In order to solve these we are now starting to evaluate our method using the Susanne Corpus to train the Feed-Forward 1 Layer network. The original 426 tags presented in the original SUSANNE tagset have been mapped to a smaller tagset of 37 POS tags. This has been done in order to increase the number of occurrences of each tag (some of the original tags occurred only once in the whole corpus). Distinct unambiguous tags in SUSANNE tagset (such as the tags MCn and MCr denoting a Arabic numeral or a Roman numeral) are joined in the same POS class (in this case numeral). The Susanne Corpus contains total of 142524 tagged words. The rand_split.x script was used to divide this corpus into a training corpus of 135214 words and a test corpus of 7310 words. Once again an open and a closed dictionary were generated from the whole corpus (closed dictionary) and from the training set (open dictionary). This two dictionaries were used to tag the test corpus, building two test sets: an open dictionary test set and a closed dictionary test set. Tagging the closed test set with a unigram tagger achieves a 90.1 % tagging accuracy. When using the open dictionary with the unigram tagger a 85.3% precision is achieved. Several experiments have been performed using this corpus. According to the results achieved with the RadioBrás Corpus only the one layer feed-forward neural network was used. Training was performed using the momentum backpropagation algorithm until a convergence was found and with the standard backpropagation after that (until the summed squared error variation over an evaluation set were small enough ). Two training sets were used in our experiments one with only 5281 tagged words (the first 200 sentences in the corpus) and the other with 30851 tagged words (the first 1200 sentences in the corpus). The neural network trained with the 30851 word corpus achieved a precision of 94.8% (91.0% for the open dictionary). The network trained with the 5281 word corpus has achieved a precision of 94.2% (90.6% for the open dictionary). A value only 0.6% less accurate than the achieved with a much larger corpus. In figure 3 we present the graphs of the summed squared error over the test set, acquired during the training process.
5
a)
b)
6
numbers and some other textual information, that usually decreases tagging accuracy. This can be achieved by pre-processing the text before classification.
6.
Acknowledgements
We would like to acknowledge to Aline Villavicencio and to Fábio Villavicencio for supplying us with the tagged corpus and for the many fruitful discussions through electronic mail. We also would like to send our thanks to the people involved with the SNNS project for making their software freely available and also to Geoffrey Sampson and the University of Sussex for making the Susanne Corpus available.
7.
References
E. Brill. A Simple Rule Based Part-of-Speech Tagger. In Proceedings DARPA Speech and Language workshop, pages 112-116. 1992. E. Brill. Unsupervised learning of disambiguation rules for Part of Speech Tagging. In Proceedings of the Very Large Corpora Workshop. 1995. K.W. Church and Robert L. Mercer. 1993. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics ,19(1):1-25, March 1993. K. W. Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the second ACL Conference on Applied Natural Language Processing. 1988. K. W. Church. Unix for poets. Notes of a course from the European Summer School on Language and Speech Communication, Corpus Based Methods. July. 1994. D. Cutting, J. Kupiek, J. Pederson and P. Sibun. A Practical part-of-speech tagger. In Proceedings of the third ACL Conference on Applied Natural Language Processing, pages 133140, Trento, Italy. 1992. Carl G. de Marcken. Parsing the LOB Corpus. In Proceedings of the 28th Annual Meeting of ACL, pages 243-251. 1990. Evangelos Dermatas and George Kokkinakis. Automatic Stochastic Tagging of Natural Language Texts. Computational Linguistics, 21(2):137-162. 1995. Steven J. DeRose. Grammatical Category Disambiguation by Statistical Optimisation. Computational Linguistics, 14(1):31-39. 1988. J. L. Elman. Finding Structure in Time. Cognitive Science, 14:179-211, 1990. D. Elworthy. Does baum-welch reestimation help taggers? In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pages 597-600. 1994. Simon Haykin. Neural Networks: A Comprehensive Foundation. Macmillan College Publishing Company, Inc.. 1994. F. Jelinek, R. Mercer , S. Roukos. Principles of lexical language modelling for speech recognition. In S. Furi and M. Mohan editors, Advances in Speech Signal Processing. Pages 651-700. Marcel Dekker. 1995. André Kempe. Probabilistic Tagging with Feature Structures. In Proceedings of the International Conference on Computational Linguistics - COLING. 1994. Cristopher Manning. Automatic Acquisition of A Large Subcategorization Dictionary from Corpora. In Proceedings of the 31st Annual Meeting of ACL, pages 234-242. 1993. Mitchell P. Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz. Building a Large Annotated Corpus of English: The penn treebank. Computational Linguistics. 19(2): 313-329. 1993. Nuno Marques , Gabriel Pereira Lopes. Using Neural Nets for Portuguese Part-of-Speech Tagging, In Proceedings of the 5th International Conference on the Cognitive Science of Natural Language Processing. Dublin City University, September 2-5. Ireland. 1996. B. Merialdo. Tagging English with a Probabilistic Model. Computational Linguistics, 20(2):155171. 1994. M. Nakamura, K. Maruyama, T. Kawabata, and K. Shikano. Neural network approach to word 7 Proceedings of the International Conference on category prediction for English texts. Computational Linguistics - COLING. 1990.
Fernando Pereira, Naftali Tishby, Lillian Lee, Distributional Clustering of English Words. In Proceedings of the 31st Annual Meeting of ACL. 1993. G. Sampson. English for the Computer. Oxford University Press. 1995. H. Schmid. Part-of-Speech Tagging with neural networks. Proceedings of the International Conference on Computational Linguistics - COLING. 1994. University of Stuttgart - Institute for Parallel and Distributed High Performance Systems (IPVR). User Manual of the Stuttgart Neural Network Simulator. Report No. 3/94. 1994. A. Vilavicencio, N. Marques, G. Lopes, F. Vilavicencio. Part-of-Speech Tagging for Portuguese Texts. In Jacques Wainer and Ariadne Carvalho, editors, Advances in Artificial Intelligence: Proceedings of the XII Brazilian Symposium on Artificial Intelligence, Lecture Notes in Artificial Intelligence 991, pages 323-332, Campinas, October 11-13. Springer Verlag. 1995 (10 pages). A. Vilavicencio. Assessing a Part-of-Speech tagger for Portuguese. Master thesis. Universidade Federal de Rio Grande do Sul, Porto Alegre, Brasil, September. 1995 (In Portuguese). David Yarowsky. Unsupervised Word Sense Disambiguation Rivalling Supervised Methods. In Proceedings of the 33th Annual Meeting of ACL. 1995.
8