Multiple Model Text Normalization for the Polish Language

1 downloads 0 Views 142KB Size Report
Abstract. The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule- based and ...
Multiple Model Text Normalization for the Polish Language Šukasz Brocki, Krzysztof Marasek, and Danijel Korºinek Polish-Japanese Institute of Information Technology, Warsaw, Poland

The following paper describes a text normalization program for the Polish language. The program is based on a combination of rulebased and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the nal solution. The tool is also capable of producing grammar tags on words to aid in further language model creation. Abstract.

1

Introduction

In the eld of Natural Language Processing there has always been a grave demand for the employment of large quantities of textual data [2, 3]. This is especially true for Automatic Speech Recognition (ASR), or Machine Translation (MT). To make software as eective as possible, such texts need to undergo several stages of preparation, one of the most essential being normalization. During the development of Language Models (LM), which are often used in the ASR and MT tasks, corpora of over 100 million words are frequently utilized. Manual processing of such gigantic amounts of texts, even by a large team of people would be at best ineective and expensive, if not simply impossible. The only feasible and working solution is the utilization of computer programs which can perform the same task in a reasonable amount of time. Text normalization is the process of converting any abbreviations, numbers and special symbols into corresponding word sequences. The procedure must produce texts consisting exclusively of words from a given language. In particular, normalization is responsible for: 1. expansion of abbreviations in the text into their full form 2. expansion of any numbers (e.g. Arabic, Roman, fractions) into their appropriate spoken form 3. expansion of various forms of dates, hours, enumerations and articles in contracts and legal documents into their proper word sequences

2

Šukasz Brocki, Krzysztof Marasek, and Danijel Korºinek

This task, although seemingly simple, is in fact quite complicated - especially in languages like Polish which, for example contains 7 cases and 8 gender forms for nouns and adjectives, with additional dimensions for other word classes. That is why most abbreviations have multiple possible expansions and each number notation over a dozen outcomes. It is worth noting that in Polish the dictionary of only the words related to numbers contains almost 1000 entries. This is because each number can be declined by both the case and gender/number which in Polish most often changes the sux of a word, thus producing a completely new word (if unique letter sequences are treated as unique words). The amount of possible outcomes of normalization of word sequences containing abbreviations and numbers grows exponentially with the number of words that need to be expanded. It is also worth mentioning that sentences in Polish follow a strictly grammatical structure and the adjoining words in the sequence have to be grammatically correct. Example: Wypadek byª na sto dziewi e¢dziesi atym pi atym kilometrze autostrady. (translation: The accident happened on the hundred and ftieth kilometer of the highway.) The authors needed a program to normalize texts in Polish in order to prepare corpora used for training language models for use in Automatic Speech Recognition and Machine Translation. This work describes the technical aspects of the text normalization tool designed specically for the Polish language.

2

Text Acquisition

The rst step in building the software was the acquisition of a large quantity of textual data. The authors managed to obtain text corpora surpassing 1 billion words in size. They originated from various online newspapers (∼48%), specialized newspapers (∼9%), legal documents (∼18%), wikipedia (∼9%), parliament transcripts (∼9%), radio, tv, usenet, subtitles, etc. . . Most of the data was gathered directly from Internet sources using custom software. It was later balanced during the training phase to avoid overtting to certain domains and styles of speech. Most of these texts (excluding the nished corpora) were acquired during a span of several months. The task workload is estimated at around 6-12 man-months.

3

Text Preparation

All the acquired texts were initially processed to remove any unnecessary data like tags, formatting and words out of context (e.g. values from tables or contents of ads). This processing was done by a single person during a span of about 2 months. Following that, texts that were suspected to contain too much garbage were removed from the dataset. This was done using a program that counted words from a Polish spelling word list. The discarded texts had either a large amount of typos or non-Polish words. Finally, all the data was gathered and saved into a simple and manageable text format. It's worth noting, however, that at

Multiple Model Text Normalization for the Polish Language

3

that point, the most important and most dicult task - text normalization - has still not yet been performed. The authors have decided to use both rule-based mechanisms and statistical language models in the text normalization software. Paradoxically, to build even the simplest statistical language model for use in this tool, already normalized texts were required. To that end, a small group of linguistics students were trained to create a small balanced manually normalized corpus. This was later used to build the initial language model for the rst iteration of the text normalization software. A collection of carefully chosen texts amounting to 2.5 million words were used for this manual corpus. A group of 8 people performed this task in the span of 4 months. The word list was split into several smaller sub-lists. Each sub-list was assigned to two independent linguistic students that didn't know each other or had any way of communicating. The results were then merged by a program that also generated a list of all inconsistencies between the two lists. This list of inconsistencies was nally analyzed and corrected by a third, independent person, with a PhD in linguistics and most experienced of the group. This stage took another 4 months to complete. The nal result was a dictionary of most frequent Polish words, their grammatic description and lemmas. In Polish, many unique letter sequences can be derived from more than one word lemma. That is why it was necessary to reevaluate the whole normalized corpus once again in order to disambiguate all the words given their actual context. This was made easier thanks to a program that looked for such words within the corpus and allow for appropriate alterations. This whole stage took about a month. The outcome of the 9 month work of the whole group was a balanced, manually normalized text corpus with disambiguated word lemmas and grammatic features. A dictionary of the most frequent words, abbreviations and all common number forms with lemmas and grammar features was also created.

4

Synthetic Texts

After a few iterations of the software it was observed that most errors appear in sequences that occur least frequently in the training set. These were sequences that had a generally clear grammatical structure, but the used text corpus was too small for the language models to reect that structure. A common technique used to improve the statistical language models is to generate synthetic texts with a previously established grammatical structure [13]. Synthetic texts were generated to contain context-free sequences of several words from the domains including numbers (with various units of measurment), dates and times. All the words contained within the synthetic texts were also placed in a separate dictionary. These texts obviously didn't require any normalization or disambiguation because this was already included in the generation process. The texts were constantly generated and added throughout the second half of the project. The nal list contained around 3 million words.

4

5

Šukasz Brocki, Krzysztof Marasek, and Danijel Korºinek

Development of Language Models

Many experiments were done in order to create the best working language model for use in the text normalization software. The best found conguration will be described in this section. The word lexicon contained 39362 words, 14631 word lemmas and 603 grammar classes. Given that the word context during normalization of texts in Polish is often more than 5 words (e.g. while expanding long numbers) it was established that the best course of action in this case would be to create long range n-gram models. A model with a range of n=3 was used for the individual words, n=5 for word lemmas and n=7 for grammar classes. Since the manually normalized corpus was rather small, the experiments showed that expanding the range of the individual word model didn't drastically improve its performance (as witnessed by perplexity measures). A dierent result was observed with word lemma and grammar models however. Because both the number of lemmas and grammar classes is considerably smaller than the number of individual words, the former were better modelled in the text corpus and this allowed for increasing the context length of their language models. The ratio 3-5-7 was established as optimal for the given corpus, dictionary and domain. Larger contexts didn't decrease the perplexity signicantly. It is worth noting that the 7-gram range for grammar classes has a signicant advantage to lower ranges because in Polish a word at the start of the sentence determines the case for the entire sentence and thus can aect the morphology of words also at the end of the sentence. To develop the language model, only the normalized corpus consisting of 2.5 million words and synthetic corpus of 3 million words were used. The training data consisted of 10 collections, each containing texts from a certain domain. Synthetic texts comprised 4 dierent domains and the manually normalized texts had 6 dierent domains amounting to 2 million words. 250 thousand words each were chosen from the manual corpus for a testing and development set. All the models were linearly interpolated. A (µ + λ) Evolution Strategy [4, 5] that minimized the perplexity of the nal model (consisting of 3 smaller models: word, lemma and grammar) on the development set was used. The Evolution Strategy optimized hundreds of parameters, specically: 1. weights of 30 text domain sets (10 parameters for each model) 2. linear interpolation weight for all n-grams in all models. The weights depended on the frequency of occurrence of given n-gram - there were 5 ranges of frequency 3. linear interpolation weights for the word, lemma and grammar classes models (combining the smaller models into one larger) After preparing all the data and tools in previous stages, it takes about a week to generate a new language model. The best model trained on the normalized data the perplexity of 376 on an independent test set (around 400 thousand words).Perplexity [2] is a common benchmark used in estimating the quality of language modeling. The lower values are generally preferable, although they obviously depend on both the eectivness of the model, as well as the complexity

Multiple Model Text Normalization for the Polish Language

5

of the test data. In our experiment, the value isn't too small, but it's worth noting that the test set contained texts from varying domains and not a single domain, as is often the case in the majority of domain experiments, where state-of-the-art models achieve values well below 100.

6

Text Normalizer Architecture

The most important components of the software for text normalization are the decoder, language model and a set of expansion rules. The expansion rules are used in the expansion of commonly used abbreviations and written date and number forms. A synchronous Viterbi style decoder that generated a list of hypotheses ordered by the values retrieved from the language model was used. Each time the text contained a word sequence that could be expanded, all the possible expansions were fed into the decoder. Because the expansion of long numbers or some abbreviations expects that several words need to be added at once, hypotheses of varying lengths may end up competing against each other. This was remedied by the normalization of hypotheses' probabilities to their lengths. Such a normalization was equivalent to the addition of a heuristic component commonly used in asynchronous decoders like A∗ . The decoding process is generally quite fast, but word sequences that contain many abbreviations and numbers can severely slow it down. For this reason, a maximum number of hypothesis was set to 2000. Over 1500 dierent rules of abbreviation expansion were manually created including a number of algorithms for parsing of date and hour formats, converting Roman numerals to Arabic, parsing real numbers and improving the quality of the normalizer in case the source texts missed decimal points (marked by commas in Polish) or if they were replaced by spaces or periods.

7

Experiment results

Because the fragments sometimes expanded into a sequence of words, a simple word error rate seemed inappropriate for this purpose. Instead a fragment error rate was evaluated. To that end, a special test set was chosen from independent data. The source of this data was the same as the training data. This was processed by the normalizer and then manually corrected by a linguist. Each fragment that needed normalization was analyzed by the expert and marked either as correct or incorrect. The test set contained 1845 normalized fragments, 1632 of which were normalized correctly and 213 incorrectly, giving 88.5% accuracy.

8

Conclusion

This work described the development of software used for the normalization of texts in Polish language. The development took considerable eort seeing as it was necessary to manually build a training corpus, dictionary and set of rules

6

Šukasz Brocki, Krzysztof Marasek, and Danijel Korºinek

for abbreviation expansion. The project was assisted by 8 students of linguistics. The result of this work is a program that is able to normalize a 100 million word corpus on a modern computer in 2-3 days. This work represents one on the few eorts in normalization of large quantities of textual data for Polish [1] and arguably the rst used for ASR and MT purposes. The results clearly show it is possible to create a reasonably accurate working system for any domain. It is worth noting that normalization of domain independent data can be problematic. For example, systems trained on news data tend to produce many errors when used on legal documents and vice versa. More experiments on domain constraints and adaptation are necessary.

Acknowledgments

The work was sponsored by a research grant from the Polish Ministry of Science, no. N516 519439.

References 1. Grali«ski Filip, Jassem Krzysztof, Wagner Agnieszka, Wypych Mikoªaj: Text Normalization as a Special Case of Machine Translation, Proceedings of the International Multiconference on Computer Science and Information Technology, Volume 1, 2006. Wisªa, Poland 2. Jelinek, F.: Statistical Methods for Speech Recognition, Cambridge, MA, MIT Press., 1998 3. Ben Allison, David Guthrie and Louise Guthrie: Another Look at the Data Sparsity Problem, Text, Speech and Dialogue, Lecture Notes in Computer Science, 2006 4. Michalewicz, Z.: Genetic algorithms + Data Structures = Evolution Programs, Springer, 1994 Verlag 5. Michalewicz, Z., Fogel, D.B.: How to Solve It: Modern Heuristics, Springer Verlag, 1999 6. Przepiórkowski, A.: Korpus IPI PAN. Wersja wst¦pna / The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa, 2004 7. Agata Savary, Joanna Rabiega-Wi±niewska, and Marcin Woli«ski: Inection of Polish Multi-Word Proper Names with Morfeusz and Multiex, Aspects of Natural Language Processing, Vol. 5070 Springer (2009) , p. 111-141. 8. http://sgjp.pl/morfeusz/ 9. Je A. Bilmes, and Katrin Kirchho: Factored language models and generalized parallel backo, In Proceedings of HLT/NACCL (2003), p. 4-6. 10. S.F. Chen and J.T. Goodman: An empirical study of smoothing techniques for language modeling. Computer, Speech and Language, 13(4):359�393, 1999 11. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3:400-401., 1987 12. R. Kneser and H. Ney: Improved backing-o for n-gram language modeling. In International Conference on Acoustics, Speech and Signal Processing, pages 181�184, 1995 13. Grace Chung, Stephanie Sene and Chao Wang: Automatic Induction of Language Model Data for A Spoken Dialogue System, 6th SIGdial Workshop on Discourse and Dialogue Lisbon, Portugal September 2-3, 2005

Suggest Documents