Polish. [6]. 11. Quinlan, J.: C4.5: Programms for Machine Learning. Morgan Kaufmann, San. Mateo (1993). 12. Quinlan, R.: Ross Quinlan's Personal Homepage.
Optimisation of Polish Tagger Parameters? Grzegorz Godlewski and Maciej Piasecki Wroclaw University of Technology, Institute of Applied Informatics, Wybrze˙ze Wyspia´ nskiego 27, Wroclaw, Poland
Abstract. The large tagset of the IPI PAN Corpus of Polish enforced a modular architecture of the Polish tagger called TaKIPI. The architecture introduce several parameters, for learning and tagging, that are difficult to be properly adjusted manually. In this paper a method of optimisation of the parameters values based on Genetic Algorithm is presented. A chromosome is a set of values, a specimen is a tagger together with the learning process, and the fitness function is a test of the tagger’s accuracy. The optimisation process is presented and the achieved improvement of +1.15% in the accuracy of the tagger (12.92% or error reduction) are discussed.
1
Introduction
The tagset of the IPI PAN Corpus of Polish (further IPIC) [1] is very large. There are 4179 theoretically possible tags, however only 1642 of them occurs in the manually disambiguated part of 885 669 tokens. These numbers results in data spareness, which was the main cause of low accuracy of statistical tagger of Debowski [2] constructed on the basis of IPIC. However, a positional IPIC , tag is a sequence of symbols describing different morpho-syntactic features of a token. Thus, if a tag is a sequence, we can try to identify the proper values for subsequencies in several steps. The quite good accuracy of our tagger [10] has been achieved due to its multi-module architecture introducing several phases of partial disambiguation. This architecture is set up by several parameters. The relation between parameters values and the final accuracy is hard to be grasped. A slight change in the value of one of them can cause a significant change. The current result of the tagger was achieved by manually adjusting the parameters, but we feel that it can be significantly better. The goal of the present work is to optimize the parameters automatically. As the parameters are rather heterogeneous, we have chosen Genetic Algorithm (GA) as the basis. In Sec. 2 and Sec. 3 the architecture of the tagger and the learning process are presented, respectively. The possible parameters and their optimisation are discussed in Sec.4 and Sec. 5. ?
Acknowledgement. This work was financed by the Ministry of Education and Science project No 3 T11C 018 29.
2
2
Architecture
The architecture of our tagger, called TaKIPI (Polish acronym for The Tagger of IPIC ) is presented in Fig. 1.
Fig. 1. The architecture of the tagger.
The process starts with morpho-syntactic processing of input text (Reader ). Division into tokens (tokenisation) and assignment of tags is done mainly by morphological analyser — Morfeusz [13], but previously all strings between two spaces are presented to Abbreviation Recogniser, implemented as transducer. If a string is recognised, then its full description, potentially ambiguous i.e. a set of tags, is taken from the dictionary of abbreviations, and the token is not further processed by Morfeusz. Pre-sentencer, a simple set of rules derived from [2], recognises boundaries of sentences, but in the case of a recognised dot-ended abbreviation, Pre-sentencer postpones making decision. The final decision is made by Sentecer on the basis of the results of tagging, e.g. in this phase it is already decided whether “im.” is1 : “im:ppron3 ‘.’:interp”, or the abbreviation of a form of imie, (name). 1
In the examples, we present only selected attributes of tags, mostly only the class.
3
The rest of TaKIPI works on chunks of the input text roughly corresponding to the sentences2 defined by Pre-sentencer. This main loop is not presented in Fig. 1 (for the sake of clarity). If an occurrence of an abbreviation postpones the final decision, then a block of several sentence is further processed in one iteration. Following the approach of [4] we apply hand-written rules (see Sec. 3) before application of other classifiers. The rules can delete some tags. Initial probabilities for tags are calculated by Unigram Classifier on the basis of frequencies of: htoken,tagi, stored in the unigram dictionary. The probabilities for the pairs not observed in the learning data, but possible according to the morphological analysis, are calculated by smoothing (inspired by [8], where wk is a token, ti one of its possible tags, and K is the number of possible tags): p(ti |wk ) =
f req(ti |wk ) + λ where λ = (K − 1)/K f req(wk ) + λK
(1)
The core of the tagging process is divided in our architecture into several subsequent phases, corresponding to serial combination of [4]. During each phase some tags can be deleted according to the performed partial disambiguation. The set of all possible tag attributes is divided into several layers. Attributes of the same layer are disambiguated during the same phase of tagging. The definitions of layers and their order are stored in the sequence of masks of layers. During each phase, tags are distinguished only on the basis of values of attributes of the corresponding layer. Moreover, a token can be ambiguous in some layers, and non-ambiguous in the others. A subset of tags of the given token such that all its members have identical values of attributes of the given layer is called a package of tags. During each phase, the tagger choose the best package according to the current probabilities of tags, and eliminates all packages except the best one — Package Cutoff. Each phase of tagging begins with subsequent application of classifiers to each token in the sentence. More than one classifier can be applied to any token, as it often happens. Only tokens that are ambiguous with respect to the current layer are processed. The only constraint is that a classifier should update the probabilities of tags. The way of calculating probabilities is free. In the present version of TaKIPI there are three layers: grammatical class, number and gender, and case. The other grammatical categories of IPIC are mostly dependent on the above. The only exceptions are: – aspect in case of non-past forms of verbs in present tense and third person, that are described by Morfeusz as being ambiguous in aspect (in each case the base form is the same), e.g. razi (dazzles, or offends), pozostaje (stays, remains), napotyka (encounters), – accentability and post-prepositionality in the case of personal pronouns in third person, e.g. on (he) possesing four different combinations of values {akc, nakc}×{praep, npraep}. 2
This is done due to the assumptions underlying the manuall disambiguation.
4
We leave these attributes in these cases not disambiguated. Another significant simplification assumed in TaKIPI is that we do not try to distinguish substantives (nouns) from gerunds on the other basis than number, gender and case. All tokens ambiguous: subst/ger, if they are not disambiguated on the basis of the three attributes, they are assigned two tags at the end. Except, the cases described above, TaKIPI returns one tag per one token. Detailed statistics are presented in Sec. 6. In the present version TaKIPI only classifiers based on the algorithm of Induction of Decision Trees (DT) called C4.5r8 [11], where “r8” is the 8th code release [12], are applied. However, the DT classifiers have been converted to classifiers returning probability of positive decision — selected value for some attribute, and negative decision — smoothed non-zero probability of other values. In the application of DT to tagging and their use as probabilistic classifiers we follow the main line of [8]. For each leaf of DT, the probability of its decision is calculated on the basis of the number of examples attached to this leaf during tree construction. The probability is smoothed according to the algorithm presented in [8]: (t is decision, t|X — examples with the decision t in the given leaf, kXk — all examples in the given DT, K — the number of possible decisions in the given DT): p(t|X) =
f (t|X) + λ , where λ = (K − 1)/K kXk + λK
(2)
As in [8], instead of building one big classifier for a phase, we decided to decompose the problem further into the classes of ambiguity [8]. Each class corresponds to one of many possible combinations of values of layer attributes, e.g. there is only one attribute in the first layer, namely grammatical class, and different classes of ambiguity on the first layer are different combinations of grammatical classes observed in tokens of the learning data. Examples of classes of ambiguity are: {adj, subst}, {adj, conj, qub, subst}, or {sg, m1, m2, m3} (of the second layer: number singular, but all male genders possible). The number of examples for different classes of ambiguity varies in large extent. For some classes, e.g. {gen, acc} (the third layer of case) there are thousands examples, for some only few, e.g. {acc, voc}. Following [8], we apply a kind of a backing off technique, in which an inheritance relation between ambiguity classes is defined. The inheritance relation is simply a set inclusion relation between sets of values defining ambiguity classes, e.g. the ‘superclass’ {adj fin subst} is in inheritance relation with {adj fin}, {adj subst}, etc. Construction of DT for a particular ambiguity class supports accurate choice of components of learning vectors for DT. We deliver to the given DT information specific for the given linguistic problem, e.g. concerning the distinction between nominative and genitive case. Typically, ‘superclasses’ have fewer examples than its ‘subclasseses’, but the merged class is the sum of both. From the linguistic point of view, the choice of a learning vector for merged classes is not so direct to the problem, but still there are often some common morpho-syntactic features of the examples belonging to the merged classes.
5
Thus, TaKIPI works on the basis of a collection of DTs divided into groups assigned to layers. Classifiers Manager selects the proper set of DTs for each token which is ambiguous according to the current layer. Each DT multiplies the probabilities of tags in the token with the probabilities of a decision. Next, after processing of all tokens, the tags with the lowest probability are deleted, according to the Classifier Cut-Level parameter. This cycle is repeated the definite number of times — iterations. After the iterations, only the package with the best maximal probability is left by Package Cut-off. At the end of the loop, probabilities in each token (only the winning package) are normalised. The process is repeated for each layer.
3
Learning
For the description of hand-written rules and learning examples the same language has been used, namely the language JOSKIPI [10]. The applied handwritten rules come from [10], as well. As it was said, the manually disambiguated part of IPIC (further called Learning IPIC, or LIPIC) consists of only 885 669 tokens, including punctuation and 11 576 tokens unknown to Morfeusz. On the basis of analysis of a part of LIPIC of about 655 000, we identified all possible ambiguity classes for all layers. We selected some ambiguity classes, called supported classes, that are enough supported by examples (a heuristic criterion of having size about 100 examples) or are necessary according to the inheritance structure, i.e. the lack of linguistically reasonable superclass. Sets of learning examples are generated only for supported classes, but according to the inheritance hierarchy each token belonging to one of the non-supported classes is the source of learning examples added to the sets for its superclasses. We tested several depths of inheritance finally choosing the depth equal to 0 (see Sec.4) as giving the best results. An learning example is a sequence of values produced by a sequence of operators defined in JOSKIPI for the given ambiguity class. Operators are functions taking the state of the context of some token and returning a value. There are three main classes of operators constructed from JOSKIPI primitives [10]: 1. simple operators — read some tag attribute and return a set of values (singletons for non-ambiguous attributes), 2. test operators — evaluate some logical condition and return a boolean value, a strings — some token, or a set of strings — (ambiguous) base form; 3. conditional operators — a operator plus a test operator, return the empty value when the test is not fulfilled, otherwise the value of the operator. Simple operators can read any attribute of any token. The token can be either specified by a distance from the centre or found by JOSKIPI operators llook and rlook according to some logical test. A test can be a simple equality test, a relation between sets of values, a test of a possibility of morfo-syntactic agreement or fulfilment of some constructed complex condition in some part of the context (the test can pertain to any part of a sentence). A complex condition
6
can utilise a mechanism of variables over positions of tokens. An example of a complex operator used in learning can be a test checking whether there is some potential subject of a sentence somewhere to the left of the token being disambiguated. Operators do not cross the sentence boundaries. It is worth to notice that similar operators are used as premises of hand-written rules. Generation of learning examples for the first layer is done in one go for all ambiguity classes of this layer by sequentially setting the centre of the context in subsequent ambiguous tokens and applying operators of the appropriate ambiguity classes. For the next layers the process is identical, except the initial preparation of the learning data. During tagging, DTs (or generally classifiers) of the second layer are applied to tokens partially disambiguated. During learning, we have to create a similar situation. This is achieved by learning partial taggers for subsequences of layers up till the ‘full tagger’. Before preparation of learning examples for the k layer, a partial tagger for k − 1 layers is applied and the attributes of all k − 1 layers are disambiguated. This gradual learning appeared to be superior in comparison to an ‘ideal’ disambiguation based on manual disambiguation of LIPIC. Construction of DTs by C4.5 algorithm is completely indenpent from examples generation, and is done by application of the C4.5 original software [12].
4
Variations and Parameters
During learning of DTs we tested several values of pruning confidence level [12] achieving the best results with 90%. It means that the number of pruned branches is small. A sequence of operators for each DT has been chosen on the basis of a heuristic analysis of results. Many important variants of tagger’s architecture were tested. The level of inheritance between ambiguity classes varied from 2 to 0. It expresses how many levels we are going up the hierarchy looking for superclasses matching the given token. The value 0 means that only the classes on the first matching level are taken for the given token (i.e. the exact class if exist, otherwise all in the minimal distance in the hierarchy). Values greater than 0 enlarge the set of superclasses applied. The value 0 during learning (Trees Inheritance Level in learning) and tagging (Trees Inheritance Level in tagging) resulted in the best accuracy — during tagging each token is classified by DTs learned from similar contexts. We tried to apply a mechanism of iterative improvement. Learning sequence of operators were extended with additional operators reading the attributes of winning tags in the context. DTs were applied several times during one phase in several iterations. In this versions cut-off and normalisation were applied after each iteration. But the results were lower. Probably, the number of different combinations of values increased and situations in the context did match the learned examples too often.
7
5
Genetic Algorithm
The basis is a standard Genetic Algorithm (GA) e.g. [3]. A specimen is a tagger together with its learning process. A chromosome is a set of parameters specifying the important selected features of learning and tagging. On the basis of the empirical observations and knowledge about the problem, see Sec. 4, we used GA to optimize the following parameters: – – – – –
Pruning Confidence Level of the trees, belong to h1, 100i Cut-Level, belong to h0.001, 1.000i Classifiers iterations, belong to h1, 10i Trees Inheritance Level in tagging, belong to h0, 2i Trees Inheritance Level in learning, belong to h0, 2i
The fitness functions is the tagger accuracy measured in tests on LIPIC during which the tagger parameters are set to the values from a chromosome, and the tagger was previously learned according to the appropriate learning parameters, too. Thus, the testing of a single specimen requires to go through all the learning — tagging cycle and thus we had to set a limited size of the population of 10 specimens. Numbers of algorithm’s loops was set to 100 and mutation probability was set to 0.02. Additionally Gray’s coding and homogeneous crossover were implemented. For the offspring selection method, we have chosen elite selection. The values of parameters are coded and decoded using following formulas [7]: (b − a) · 10k ≤ 2li −1 l=
m X
li
(3) (4)
i=1
b−a (5) 2li − 1 Where a and b define the scope of the values, k defines the precision level of the values, l is the number of bytes required to code the value, m is the number of the algorithm’s parameters, and p is decoded value of the chromosome. pi = a + (01001 . . . 01)10 ·
6
Evaluation and Conclusions
The reference point for accuracy of the genetic algorithm is the earlier ten-fold test on LIPIC performed with empirically chosen parameters: h Prunning Confidence Level =90%, Classifiers Iterations=1, Cut-level =0.1, Trees Generation Level in tagging=0, Trees Generation Level in learning=0i. Next, we performed ten-fold test of the tagger with the values of the parameters set to the best values obtained from the GA: h Prunning Confidence Level =99%, Classifiers iterations=1, Cut-level =0.384, Trees Generation Level in tagging=0, Trees Generation Level in learning=1i. The results are presented in Tab. 1.
8 layer all tokens all max. all min. ambiguous amb. max. amb. min. Manually adjusted parameters 1 (POS) 99.18 99.23 99.12 94.32 94.61 93.92 2 (POS,nmb,gnd) 97.12 97.25 96.99 91.10 91.40 90.75 Automatically adjusted parameters 1 (POS) 99.19 99.25 99.14 94.39 94.77 94.05 2 (POS,nmb,gnd) 97.50 97.40 97.59 92.25 92.47 92.03 Table 1. Accuracy of the tagger with manually and automatically adjusted parameters
Algorithm confirmed that number of iterations set to one, gives the best score of the tagger, and thus, the mechanism of iterative improvement seems to be ineffective. Also Trees Inheritance Level in learning greater than zero brought the lower accuracy of the tagger. In this paper we consider Cut-Level as one parameter, however it seems to be more accurate to introduce three different parameters, one for each layer. The true power of our tagger lies in operators. Taking into account compound structure of operators, which are made manually, in the new approach we are planning to use Evolutionary Programming (EP) to generate them automatically.
References 1. Przepi´ orkowski, A.: The IPI PAN Corpus Preliminary Version. Institute of Computer Science PAS (2004) 2. Lukasz Debowski: Trigram morphosyntactic tagger for Polish. In Mieczyslaw , A. Klopotek, Wierzcho´ n, S.T., Trojanowski, K., eds.: Intelligent Information Processing and Web Mining. Proceedings of the International IIS:IIPWM’04 Conference held in Zakopane, Poland, May 17-20, 2004. Springer Verlag (2004) 409–413 3. Goldberg, D. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley (1989) 4. Hajiˇc, J., Krbec, P., Kvˇetoˇ n, P., Oliva, K., Petkeviˇc, V.: Serial combination rules and statistics: A case study in czech tagging. In: Proceedings of The 39th Annual Meeting of ACL, Morgan Kaufmann Publishers (2001) 260–267 5. Mieczyslaw A. Klopotek, Wierzcho´ n, S.T., Trojanowski, K., eds.: Intelligent Information Processing and Web Mining — Proceedings of the International IIS: IIPWM’05 Conference held in Gada´ nsk, Poland, June, 2005. Advances in Soft Computing. Springer, Berlin (2005) 6. Mieczyslaw A. Klopotek, Wierzcho´ n, S.T., Trojanowski, K., eds.: Intelligent Information Processing and Web Mining — Proceedings of the International IIS: IIPWM’06 Conference held in Zakopane, Poland, June, 2006. Advances in Soft Computing. Springer, Berlin (2006) 7. Halina Kwa´snicka, eds.: Zeszyt Naukowy Sztuczna inteligencja Nr 1. Algorytmy ewolucyjne - przyklady zastosowa´ n. Prace Naukowe Wydzialowego Zakladu Informatyki Politechniki Wroclawskiej, Koo Sztucznej Inteligencji Cjant. Oficyna Wydawnicza PWr, Wroclaw (2002). 8. M´ arquez, L.: Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees. PhD thesis, Universitat Polit´ecnica de Catalunya (1999)
9 9. Piasecki, M., Gawel, B.: A rule-based tagger for Polish based on Genetic Algorithm. [5] 10. Piasecki, M., Godlewski, G.: Reductionistic, Tree and Rule Based Tagger for Polish. [6] 11. Quinlan, J.: C4.5: Programms for Machine Learning. Morgan Kaufmann, San Mateo (1993) 12. Quinlan, R.: Ross Quinlan’s Personal Homepage. http://www.rulequest.com/Personal/c4.5r8.tar.gz (2005) 13. Woli´ nski, M.: Morfeusz — a practical tool for the morphological analysis of polish. [6]