later used for identification in the word. First necessary thing is a list of suffixes. This list can be obtained by studying a dictionary of words, or simple statistical.
Journal of Electrical and Electronics Engineering
85
___________________________________________________________________________________________________________________
Word Clustering for a Slovak Class-Based Language Model HLÁDEK Daniel, STAŠ Ján, JUHÁR Jozef Technical University of Košice, Faculty of Electrical Engineering and Informatics, Department of Electronics and Multimedia Communications, Park Komenského 13, 042 00 Košice, Slovak Republic, [daniel.hladek, jan.stas, jozef.juhar]@tuke.sk
Abstract – This paper proposes a method for designing better language model for a highly inflectional language, using a limited amount of training data. The model is enriched by inclusion of information about the grammar of the language. This information is extracted by using a word-clustering function and creating a class-based language model that describes relations between created word clusters. The word clustering function is heuristically designed and takes into the account morphological structure of the words. The class-based models are then interpolated with the baseline language model. Evaluation of the models shows significant decrease of the perplexity. Keywords: word clustering, class-based language model, suffix identification I. INTRODUCTION Language modeling is important part of the automatic speech recognition. The biggest problem of the n-gram language models is data sparsity, where training set does not contain enough data to correctly calculate estimates of the probabilities of a word, based on its history using the most common maximum likelihood method. It is always necessary to use one of the techniques to estimated probability also for a word and history that have not seen in the training corpus. This problem is even bigger in the case of the highly inflectional language [1, 2] with rich morphology and non-mandatory sentence word order, such as Slovak language. Large vocabulary means, that number of required word sequences in the training corpus is even higher. Most of the existing approaches [3] is designed for languages similar to English, and have limited use for this kind of languages. This paper is structured as follows. The first section gives a brief introduction to the state of the art methods of the language modeling and dealing with the data sparsity problem. In the next section, a methodology for composition of a class-based language model using linear interpolation is introduced. Then, basic evaluation of the proposed method compared to the baseline language model is performed. The conclusion summarizes the whole approach and provides future directions of the research.
II. STATE OF THE ART A.
Language Model Smoothing
The classical way of dealing with missing wordsequences in the training corpus, the back-off scheme [4] is not sufficient for efficient estimating probability of a given word sequence,, and probabilities of n-grams in the language model have to be further manipulated.. Language model smoothing techniques move part of the probability mass from the events that were seen in the training corpus to the events that were unseen. Common methods are based on adjusting counts of n-grams, such as Witten-Bell [5] or modified Kneser-Ney [6] algorithms. The problem of this approach is that these methods are designed for languages that are not very morphologically rich. As it is showed in [7, 8] is that this kind of smoothing does not bring expected positive effect for highly inflectional languages with large vocabulary. B.
Linear Interpolation
Another common approach for estimating a language model from sparse data is linear interpolation, also called Jelinek-Mercer smoothing [9]. This method allows a combination of multiple independent sources of knowledge into one that is then used to compose the final language model. In the case of trigram language model, this approach can calculate the final probability as a linear combination of unigram, bigram and trigram maximum likelihood estimates. Linear interpolation is not the only method of combining of multiple knowledge sources, other possible approaches are maximum entropy [10], log-linear interpolation [11] or generalized linear interpolation [12]. In the case of the classical linear interpolation, the final probability is calculated as a linear combination of both sources P1 and P2 according to the equation: (1) P = λ P1 + (1 − λ ) P2 Interpolation parameter λ can be set empirically, or can be calculated by one of the optimization methods, e.g. by using expectation-maximization algorithm. The coefficient λ have to be chosen, such that the final language model composed from the training corpus fits best the target domain, represented by the testing corpus.
86
Volume 5, Number 1, May 2012
___________________________________________________________________________________________________________________
C.
Class-Based Language Models
Another approach to overcome the problem of missing training data, class-based language models were proposed [13]. For word-based n-gram language model, there is a probability value for each n-gram, as well as back-off weight for lower order n-grams. On the other hand, for the class-based model, a whole set of words is reduced to a single class and class-based model describes statistical properties of that class. This approach offers ability to group words into classes and work with a class as it was a single word in the language model [14]. The advantage is that the classbased models take into the account dependencies of words, not included in the training corpus. The same classical smoothing methods that were presented above can be used for a class-based language model as well. Probability of a word, conditioned on its history P( wi | wi −1 ...wi − n +1 ) in the class-based language model can be described using equation [13]: (2) P(wi | wi−1...wi−n+1) = P(ci | ci−1...ci−n+1)P(wi | ci ) where P(ci | ci −1 ...ci − n +1 ) is probability of a class ci where word wi belongs, based on the class history. In this equation, probability of a word w according to its history of (n+1) words is calculated as a product of class-history probability and word-class probability P( wi | ci ) D.
Estimation of Class-Based Language Models
As it is described in [15] if using maximum likelihood estimation, n-gram probability can be calculated in the same way as in the word-based language models: C (ci − n +1 ..ci ) (3) P(ci | ci −1 ...ci − n +1 ) = C (ci − n +1 ...ci −1 )
The rules for creating word forms are very difficult and there is a lot of exceptions. Some methods of extraction of morphological features are in [16, 17, 18]. The statistical approach seems to be a feasible way for obtaining a list of the most common suffices that can be later used for identification in the word. First necessary thing is a list of suffixes. This list can be obtained by studying a dictionary of words, or simple statistical count-based analysis can be used: 1. a dictionary of the most common words in the language has been obtained; 2. from each word longer than 6 characters, a suffix of length 2, 3 or 4 characters has been extracted; 3. number of occurrences of each extracted suffix has been calculated; 4. a threshold has been chosen and suffixes with count higher than the threshold has been added to the list of all suffixes. If the list of the most common suffixes is created, it is possible to easily identify the stem and suffix and stem of the word using suffix subtraction method: 1. if the word is shorted than 5 characters, suffix cannot be extracted; 2. if word is longer than 5 characters, word ending of length n = 5 is examined. If it is in the list of the most common suffixes, it is the result. If the ending of length n is not in the list, algorithm continues with n > 1; 3. if no suffix has been identified, word is considered as a class by itself. Figure 1: Class-Based Language Mode Back-Off
where C (ci − n +1 ..ci ) is a count of sequence of classes in the training corpus and C (ci − n +1 ...ci −1 ) is count of the history of the class ci in the training corpus. The word-class probability can be estimated as a fraction of a word count C(w) and class total count C(c): C ( w) P( wi | ci ) = (4) C (c ) III. WORD CLUSTERING FUNCTION In order to construct a class-based model, a function that can put words to classes in the training corpus is required. This function should take into the account as much information about the grammar of the target language as it is possible. In the Slovak language a single word can have many forms. A declination of a word carries information about the grammatical function in the sentence. For this purpose, a suffix extraction method has been designed. The suffix then can be used as a class for the word, so that words with the same suffix and same grammatical function would belong to the same class.
Disadvantage of this method is that it is statistically based and it is not always precise and some suffixes found might not be grammatically correct. Also, it is not able to identify a suffix in every case, because some words are just too short to correctly to be correctly split using the subtraction method. IV. THE PROPOSED CLASS-BASED MODEL Class-based language model utilizing grammatical features [19] consists of two basic parts that are put together using linear interpolation: 1. word-based language model; 2. class-based language model that was constructed using word clustering function.
Journal of Electrical and Electronics Engineering
87
___________________________________________________________________________________________________________________
TABLE 1. Class-Based Language Models Perplexity Word Clustering method 1 2 3 4 5 Baseline
Unigram count
Bigram count
Trigram count
PPL
924 37,704 44,692 65,072 91,171 329,690
262,271 4,767,125 4,126,071 6,636,168 10,121,937 13,052,574
3,277,939 11,616,339 9,491,228 12,533,250 13,470,470 13,034,227
266 62 68 42 23 40
The first part of this language model can be created using classical language modeling methods from the training corpus. To create a class-based model, the training corpus has to be processed by the word clustering function that was presented above and every word has to be replaced by its corresponding class. From this processed training corpus, a class-based model can be built. During this process, a word-class probability function has to be estimated as it is in Eq. 4. This function expresses probability distribution of words in the class. The last step is to determine the interpolation parameter λ, which should be set to values close (but lower) to 1. If Eq. 2 and Eq. 3 are taken together, the final equation for the proposed language model is: (5) P(w | h) = λPw + (1− λ)Pg (g(w) | g(h))P(w | g(w)) where Pw is probability returned by the word-based model and Pg is probability returned by the class-based model with the word-clustering function g that utilizes information about grammar of the language. Looking at this equation, the proposed language model consists of the following components: • vocabulary V that contains a list of known word of the language model; • word-based language model constructed from the training corpus that can return word history probability P(w|h). word clustering function g(w) that maps words into classes; • class-based language model Pg(g(w)|g(h)) created from a training corpus and processed by the word clustering function g(w); • word-class probability function that assigns a probability of occurrence of a word in the given class P(w|g(w)). • interpolation constant from interval (0,1) that expresses weight of the word-based language model. V. EVALUATION Usual metric for the evaluation of the language model is called perplexity (marked as PPL). This measure expresses a weighted average of number of choices, that has to be made by the language model, when calculating probability of the given test text. Higher values of perplexity means, that the language model does not fit the testing set very much, lower
number of perplexity, means, that the prediction of the testing set is good. The focus of the experiments was given on the word clustering function. As it was mentioned, the main problem of the word-clustering function described above is that it is not able to assign a class to every word, In some cases, this feature can be taken as an advantage, because if the word is very common and unambugious, it can better serve as a feature for estimating probability of a future word and is better to not put it to the class. The question is, which words should be put into classes and which ones should be taken as they are. Several variations of the word-clustering function using suffix subtraction have been set up: • Method 1: List of 625 suffixed gathered by hand has been used. • Method 2: List 7545 suffixes has been collected using a method described above. Words shorter than 5 characters have been marked as non-separable. • Method 3: The same list of suffixes have been used and words longer than 7 characters had a suffixbased class assigned. On those words where suffix cannot be found, a morphological class have been assigned., from the morphological dictionary from [20, 21]. Morphological class was assigned only to those words, that have the same morphological tag in all contexts. Those words, that were shorter that 7 letters and does not have morphological tag were considered as a class alone. • Method 4: is the same as the method 2, but words longer than 7 have been marked as non-separable. • Method 5: is the same as method 3, but is using additional list of 70k common words that were also marked as non-separable. Each method has been used to process the training corpus. The training corpus consists of the data gathered from web [22] and from data from the Slovak Ministry of Justice. The processed corpus then has been used to construct a class-based language model. The class-based language model has been evaluated for perplexity and results are displayed in Tab. 1. TABLE 2. Interpolated Language models perplexity Word Clustering method 1 2 3 4 5
PPL
PPL Decrease
37 30 28 25 23
7,5 % 25 % 30 % 37,5 % 42,5 %
The constructed class-based language models then has been interpolated with the baseline language model to match Eq. 5. Results of evaluations of these language models are in Tab. 2.
88
Volume 5, Number 1, May 2012
___________________________________________________________________________________________________________________
VI. CONCLUSIONS Most of the class-based language models have higher perplexity that the baseline language model. The first class-based language model with word-clustering method with 675 identifiable suffices has very high perplexity. Higher count of non-separable words causes lowering of the perplexity, and final class-based model with method 5 has lower perplexity than the baseline language model. The process of interpolation of the class-based language using word-clustering function model with the baseline language model has always caused significant decrease of the perplexity. This set of experiments have shown, that utilization of the word-clustering function that takes grammatical structure of the language into the account can be used to build a language model that have much better prediction capability. The future research should be focused on finding the best set of non-separable words that might bring even higher decrease of the language model perplexity. Language models build using this methodology then should be evaluated in the real-world task of the continuous speech recognition [23]. This approach can be usable to other languages that are similar to the Slovak language where the grammatical function of the word can be identified by the form of the word. ACKNOWLEDGMENTS The research presented in this paper was supported by the Ministry of Education under the research project MŠ SR 3928/2010-11 (34%), Slovak Research and Develpment Agency under the research project APVV0369-07 (33%) and Research and Development Operational Program funded by the ERDF under the project ITMS-26220220141 (33%). REFERENCES [1] J. Nouza and J. Drabkova, “Combining lexical and morphological knowledge in language model for inflectional (Czech) language”, Proc. of ICSLP 2002, pp. 705–708, 2002. [2] J. Nouza, J. Zdansky, P. Cerva and J. Silovsky “Challenges in speech processing of Slavic languages (Case studies in speech recognition of Czech and Slovak)“, in Development of Multimodal Interfaces: Active Listening and Synchrony, LNCS 5967, Springer-Verlag, Heidelberg, pp. 225–241, 2010. [3] D. Vergyri, K. Kirchhoff, K. Duh and A. Stolcke, “Morphology-based language modeling for arabic speech recognition”, Proc. of ICSLP 2004, pp. 2245–2248, 2004. [4] S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer”, IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3), pp. 400–401, 1987. [5] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling”, Computer Speech and Language, 13(4), pp. 359–393, 1999.
[6] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling”, Proc. of ICASSP 1995, pp. 181–184, 1995. [7] J. Juhár, J. Staš and D. Hládek, “Recent progress in development of language model for Slovak large vocabulary continuous speech recognition”, Volosencu, C. (Ed.): New Technologies - Trends, Innovations and Research, (to be published), 2012. [8] J. Nouza and T. Nouza, “A voice dictation system for a million-word Czech vocabulary”, Proc. of ICCCT 2004, pp. 149–152, 2004. [9] F. Jelinek, and M. Mercer, “Interpolated estimation of Markov source parameters from sparse data”, Pattern recognition in practice, pp. 381–397, 1980. [10] A. Berger, V. Pietra, and S. Pietra, “A maximum entropy approach to natural language processing”, Computational Linguistics, 22(1), pp. 39—71, 1996. [11] D. Klakow, “Log-linear interpolation of language models”, Proc. of ICSLP 1998, pp. 0522, 1998. [12] B. J. Hsu, “Generalized linear interpolation of language models”, Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2007, pp. 136–140, 2007. [13] P. Brown, V. Pietra, P. deSouza, J. Lai and R. Mercer, “Class-based n-gram models of natural language”, Computational Linguistics, 18(4), pp. 467–479, 1992. [14] Y. Su, “Bayesian class-based language models”, Proc. of ICASSP 2011, pp. 5564–5567, 2011. [15] D. Jurafsky and J. H. Martin, “Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition (2nd Edition)”, Prentice Hall, Pearson Education, New Jersey, 2009. [16] M. Creutz and K. Lagus, “Unsupervised models for morpheme segmentation and morphology learning”, ACM Transactions on Speech and Language Processing, 4(1), pp. 1—34, 2007. [17] A. Ghaoui, F. Yvon, C. Mokbel and G. Chollet, “On the use of morphological constraints in n-gram statistical language model”, Proc. of INTERSPEECH 2005, pp. 1281—1284, 2005. [18] J. Goldsmith, “Unsupervised learning of the morphology of a natural language”, Computational Linguistics, 27(2), pp. 153–198, 2001. [19] G. Maltese, P. Bravetti, H. Crépy, B. J. Grainger, M. Herzog and F. Palou, “Combining word-and class-based language models: A comparative study in several languages using automatic and manual word-clustering techniques”, Proc. of EUROSPEECH 2001, pp. 21–24, 2001. [20] SNK, “Slovak national corpus”, 2007. URL: http://korpus.juls.savba.sk/ [21] A. Horák, L. Gianitsová, M. Šimková, M. Šmotlák, and R. Garabík, “Slovak national corpus”, P. Sojka et al. (Eds.): Text, Speech and Dialogue, TSD 2004, pp. 115– 162, 2004. [22] D. Hládek, and J. Staš, “Text mining and processing for corpora creation in Slovak language”, Journal of Computer Science and Control Systems, 3(1), pp. 65–68. 2010. [23] M. Rusko, J. Juhár, M. Trnka, J. Staš, S. Darjaa, D. Hládek, M. Cernák, M. Papco, R. Sabo, M. Pleva, M. Ritomský and M. Lojka, “Slovak automatic transcription and dictation system for the judicial domain”, Proc. of the 5th Language and Technology Conference: Human Language Technolgies as a Challenge for Computer Science and Linguistics, pp. 365–369, 2011.