Markov Models for Written Language Identification Dat Tran and Dharmendra Sharma School of Information Sciences and Engineering University of Canberra Canberra, Australia
[email protected] Abstract—The paper presents a Markov chain-based method for automatic written language identification. Given a training document in a specific language, each word can be represented as a Markov chain of letters. Using the entire training document regarded as a set of Markov chains, the set of initial and transition probabilities can be calculated and referred to as a Markov model for that language. Given an unknown language string, the maximum likelihood decision rule was used to identify language. Experimental results showed that the proposed method achieved lower error rate and faster identification speed than the current n-gram method.
I.
INTRODUCTION
Web pages and email messages are now available in several languages other than English. Identifying a specific language from such electronic documents for further action and analysis becomes an important task for natural language processing, information retrieval and multimedia information processing. It is difficult for a person to identify a large number of documents written in several languages. Therefore a written language identification method should be used to develop an automatic identification system. Current methods for language identification are based on the n-gram approach [2, 8, 10]. For instance, the trigram-based method analyzes a text document in a given language as a set of trigrams, i.e. sequences of three letters [2, 10]. The probability of a given trigram is the ratio of its frequency to the sum of the frequencies of all the trigrams. The probability set of all the trigrams in a specific language obtained from the training document is stored. In order to identify language for an unknown string, the string is also divided into a sequence of trigrams and the probability of the sequence is calculated for each language using the probability set of that language. The unknown string is then identified to the language that has the maximum probability. The trigram-based system [2] was achieved good performance when test strings were about 50 through 700 words using the n-gram frequency of 400. Other methods for language identification [4, 7, 9] include unique letter combinations, short word method, and ASCII code of character sequences.
The main principles for a language identification system is that it should be fast for real-time processing, efficient, requires minimum storage, and robust against textual errors. Based on these principles, a Markov chain-based method is proposed for language identification in this paper. The occurrences of letters in a word can be regarded as a stochastic process and hence the word can be represented as a Markov chain where letters are states. The occurrence of the first letter in the word is characterized by the initial probability of the Markov chain and the occurrence of the other letter given the occurrence of its previous letter is characterized by the transition probability. Given a text document in a specific language as a training set, the initial and transition probabilities for all Markov chains representing all words in the text document are calculated and the set of those probabilities is regarded as a Markov model for that language. In order to identify language for an unknown string, the maximum likelihood decision rule was used. Words in the string are regarded as Markov chains and for each language model built in the training session, the initial and transition probabilities taken from the language model are used to calculate the probability of the unknown string for that language. The unknown string is then identified to the language that has the maximum probability. Experiments were performed on a set of 7 languages, which are English, French, German, Indonesian, Italian, Norwegian and Spanish. There were 3000 distinct words for each training set and other 5000 strings for each test set. Results showed that the proposed Markov chain-based method achieved a lower identification error rate than the trigram-based method. Moreover, the Markov chain-based system is much faster than the trigram-based system for identification since the number of letters is smaller than the number of trigrams in any language. The rest of the paper is organized as follows. Section II presents the Markov language modeling method. The training and identification algorithms are presented in Section III. The proposed method was tested and compared with the trigram method in Section IV. Finally Section V concludes the presented work.
II. A.
B. Markov Language Model Define the following parameters
MARKOV LANGUAGE MODEL
Markov Chain Representation Let X = { X (1) , X ( 2) ,..., X ( L ) } be a set of L random
variable sequences, where X ( k ) = { X 1( k ) , X 2( k ) ,..., X T( k ) } is a k
sequence of Tk random variables, k = 1, 2, …, L and Tk > 0. Let V = {V1, V2 ,...,VM } be the set of M states in a Markov chain. Consider the conditional probabilities P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) ,..., X1( k ) = x1( k ) )
(1)
where xt(k ) , k = 1, 2, …, L and t = 1, 2, …, Tk are values taken by the corresponding variables X t(k ) . These probabilities are very complicated for calculation, so the Markov assumption is applied to reduce the complexity P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) ,..., X1( k ) = x1( k ) ) = P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) )
(2)
where k = 1, 2, …, L and t = 1, 2, …, Tk . This means that the event at time t depends only on the immediately preceding event at time t – 1. The stochastic process based on the Markov assumption is called the Markov process. In order to restrict the variables X t(k ) taking values xt(k ) in the finite set V, the time-invariant assumption is applied P( X1( k ) = x1( k ) ) = P( X1( k ) = Vi )
(3)
P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) ) = P( X t( k ) = V j | X t(−k1) = Vi ) (4) where k = 1, …, L , t = 1, …, Tk , i = 1, …, M and j = 1, …, M . Such the Markov process is called Markov chain. Figure 1 shows Markov chains having three states t, o, and n which represent some English words such as no, on, to, not, ton, too, tot, toot, and noon, French word non, German words tot and toon, and Spanish words no and tono.
q = [q(i)],
q (i ) = P( X 1( k ) = Vi )
p = [p(i, j)]
p(i, j ) = P( X t( k ) = V j | X t(−k1) = Vi ) (6)
(5)
where k = 1, …, L, L is the number of words in the training document, t = 1, …, Tk , Tk is the word length, i = 1, …, M and j = 1, …, M, M is the number of alphabetical letters The set λ = (q, p ) is called a Markov language model that represents the words in the training document as the Markov chains. A method to calculate the model set λ = (q, p ) is presented as follows. The Markov model λ is built to represent the sequence of states x, therefore we should find λ such that the probability P( X = x | λ ) is maximised. In order to maximise the probability P( X = x | λ ) , we first express it as a function of the model λ = (q, p ) , then equate its derivative to 0 to find the model set λ. p( xt −1 , xt ) = P ( X t = xt | X t −1 = xt −1 , λ ) Let q ( x1 ) = P( X 1 = x1 | λ ) we have L
t =Tk
k =1
t =2
P( X = x | λ ) = ∏ q ( x1( k ) ) ∏ p(xt(−k1) , xt( k ) )
t
t
t
(7)
Applying the time-invariant assumption in (3) and (4), and using (5) and (6), we can rewrite (7) as follows M
M
i =1
j =1
nij
P(X = x | λ ) = ∏[q(i )]ni ∏[ p(i, j )]
(8)
where ni denotes the number of values x1( k ) = Vi and nij denotes the number of pairs ( xt(−k1) = Vi , xt( k ) = V j ) observed in the sequence X (k ) . It can be seen that M
∑ ni = L
t
and
(9)
i =1
The probability in (8) can be rewritten as follows o
n
o
n
o
n
o
M
M M
i =1
i =1 j =1
log[ P( X = x | λ )] = ∑ ni log q(i) + ∑ ∑ nij log p(i, j ) (10)
n
Since Figure 1. Markov chains having 3 states t, o and n represent English words such as no, on, to, not, ton, too, tot, toot and noon, French word non, German words tot and toon, and Spanish words no and tono.
M
∑ q (i ) = 1
i =1
and
M
∑ p(i, j ) = 1 ,
the Lagrangian
j =1
method is applied to maximise the probability in (10) over λ. Using the following Lagrangian
F (q(i ), p(i, j ), a, bi ) = M
M
i =1
i =1
∑ ni log q(i) + a 1 − ∑ q(i ) +
(11)
M M
M
M
i =1 j =1
i =1
j =1
∑ ∑ nij log p(i, j ) + ∑ bi 1 − ∑ p(i, j )
B. Identification Algorithm Given an unknown language string, it is also preprocessed as shown in the training algorithm for the training documents. For each language model, the probability of the unknown string given the model is calculated. The maximum likelihood decision rule is used to identify language. The algorithm can be summarized as follows.
where a and bj are Lagrangian multipliers. Setting the derivative of F to zero and solving the equation give q (i ) =
ni
p(i, j ) =
M
nij M
(12)
s =1
Applying the equations in (12) to Markov chains of alphabetical letters, the initial probabilities q (letter x) and the transition probabilities p(letter x → letter y ) for a language can be interpreted as
number of pairs ( x, y ) ∑ number of pairs ( x, z)
z ∈letter set
The equations in (12) are used to determine the Markov language models from the training text document. LANGUAGE IDENTIFICATION ALGORITHMS
A. Training Algorithm Given a training language document, it is first preprocessed to remove all special, common characters, and punctuation marks such as commas, columns, semi-columns, quotes, stops, exclamation marks, question marks, signs, etc. The next step is to convert all the characters into lowercases. The initial and transition probabilities are then calculated. The algorithm can be summarized as follows. •
•
Using the training sets of all languages to be identified, determine a common letter set containing M alphabetical letters. For each training language set, do the following Remove all special characters and convert all letters into lowercases to obtain the set of words X Using all words in the set X, calculate the initial probabilities and the transition probabilities according to (12) Save all the probability values to a set λ and regard this set as the language model.
•
•
Given an unknown language string, preprocess it to remove all special characters and convert all letters into lowercases to obtain the set of words X
•
For each language model, calculate the probability of the word set X using (10)
•
The unknown string is then classified to the language that has the maximum probability
Save the letter set for identification purpose.
IV.
EXPERIMENTAL RESULTS
Seven roman-typed languages used in our experiments were English, French, German, Indonesian, Italian, Norwegian and Spanish. Language documents were randomly found from Web pages on the Internet. For each language, a set of 3000 distinct words was collected for training the language model. The minimum word length was set to 3 so that the training sets could also be used for the current trigram method. The set of 52 distinct letters was extracted from the 7 language training data sets. The training algorithm was used to train the 7 language models. For identification, we used the identification algorithm with different string lengths for test strings. The string lengths were set to 1, 3, 5, 10, 15, 20, 25, 30, 40, and 50 words, respectively. For each string length, a set of 5000 test strings was collected from Web pages other than the pages used for collecting the training data sets. 30 Identification Error (in %)
number of occurences of x as the first letter number of words
p (letter x → letter y ) =
III.
Read all the language models and the letter set obtained from the training session.
∑ nis
∑ ns
s =1
q (letter x) =
•
Trigram method
25
Markov method
20 15 10 5 0
0
10
20
30
40
50
Number of Words in Test Strings
Figure 2. Identification error rates (in %) vs number of words in test strings for the trigram-based method and Markov chain-based method. Number of test strings = 5000.
The proposed Markov chain-based method was compared with the current trigram-based method. Experimental results were presented in Figure 2. Both the
methods achieved lower identification error rates with longer test strings. With the string length was 40 or longer, the error rates were approximate to zero. However, the proposed Markov chain-based method outperformed the trigram-based method when the string length was between 3 and 20. Table 1 shows the confusion matrix for the proposed Markov chain-based method. The number of test strings was 5000 and the string length for test strings was set to 10. The identification result for English, French, German, Indonesian, Italian, Norwegian and Spanish are 99.4%, 98.7%, 98.7%, 98.1%, 93.7%, 97.7% and 96.5% respectively. Most of misclassified strings were identified as English strings. TABLE I. CONFUSION MATRIX FOR THE MARKOV CHAIN-BASED METHOD. NUMBER OF TEST STRINGS = 5000, STRING LENGTH = 10 WORDS
Confusion Number of test strings for the Markov chain-based method Matrix English
English
French German Indonesian Italian Norwegian Spanish
4968
0
French
42
4937
German
38
13
Indonesian
60
0
Italian
193
1
0
2
1
15
0
3
4937
0
6
6
0
5
4904
6
1
4
89
13
0
4683
6
16
Norwegian 109
0
1
1
5
4884
0
Spanish
47
11
1
58
0
4825
58
24
7
0
V.
The Markov chain-based method for automatic language identification has been presented in this paper. The occurrence of letters in a word is regarded as a stochastic process and can be represented as a Markov chain where letters are states. The initial probabilities and the transition probabilities for Markov chains representing words in the training document were calculated and the set of these probabilities was regarded as a Markov language model. Experiments were performed on the seven languages English, French, German, Indonesian, Italian, Norwegian and Spanish using 3000 words for each language to train models and 5000 strings for evaluation. The results showed that the proposed Markov chain-based method outperformed the trigram-based method when the string length was between 3 and 20 words. Moreover, the number of letters for all languages is always much smaller than the number of trigrams, therefore the speed of identification using the Markov chain-based method is very fast comparing with the trigram-based method since there were only 52 letters found for the seven languages. Note that the Markov language model is different from the bigram method. Given a bigram (i.e. a string of two letters), its probability in the bigram method is the ratio of its frequency to the sum of the frequencies of all the bigrams, whereas in the Markov language model, only the bigrams that have the same first letter were considered. REFERENCES [1]
Table 2 shows the confusion matrix for the trigrambased method. The number of test strings and the string length for test strings were the same as those in Table 1. The identification result for English, French, German, Indonesian, Italian, Norwegian and Spanish are 93.5%, 84.2%, 98.3%, 98.6%, 79.4%, 99.1% and 88.3% respectively. TABLE II. CONFUSION MATRIX FOR THE TRIGRAM-BASED METHOD. NUMBER OF TEST STRINGS = 5000, STRING LENGTH = 10 WORDS Confusion Matrix
Number of test strings for the trigram-based method English
French German Indonesian Italian Norwegian Spanish
English
4674
0
259
0
0
67
0
French
513
4210
210
8
5
30
24
German
60
0
4913
7
0
18
2
Indonesian
33
1
16
4928
0
21
1
Italian
538
260
74
13
3971
26
118
Norwegian
19
0
25
0
2
4954
0
Spanish
250
75
55
39
149
19
4413
CONCLUSION
V. Castelli, L.D. Bergman, editors, Image Databases. Wiley, New York, 2002. [2] W.B. Cavnar, and J.M. Trenkle, “N-gram-based text categorization”, Proc. 3rd Annual Symp. Document Analysis and Information, Retrieval, pp. 161-175, 1994. [3] R.A. Cole, J. Mariani, H. Uszkoreit, G.B. Varile, A. Zaenen, Zampolli, editors, Survey of the state of the art in human language Technology. Cambridge University Press, 1998. [4] E. Giguet, “Multilingual text tokenization in natural language diagnosis”, Proceedings of the 4th International Conference on Artificial Intelligence, Australia, 1996. [5] V.J. Hodge, J. Austin, “A comparison of a novel neural spell checker and standard spell checking algorithms”, Pattern Recognition Letters, vol. 35, pp. 2571-2580, 2002. [6] T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, Boston, 2002. [7] S. Johnson, “Solving the problem of language recognition”, Technical report, School of Computer Studies, University of Leeds, 1993. [8] Y.K. Muthusamy, and A. L. Spitz, Automatic language identification, Survey of the state of the art in human language technology, eds. R. A. Cole, J. Mariani, H. Uszkoreit, G. B. Varile, A. Zaenen, A. Zampolli. Cambridge University, Press, 1998. [9] T.D. Pham and D. Tran, “VQ-based written language identification”, Proc. of the Seventh International Symposium on Signal Processing ant its Applications, Paris, France, vol. I, pp. 513-516, 2003. [10] J. C. Schmitt, “Trigram-based method of language identification”, U.S. Patent number: 5062143, October 1991. [11] J.R. Ullman, “A binary-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words”, The Computer Journal, 20:2, 1977, 141-147.