Markov Models for Written Language Identification

4 downloads 2531 Views 200KB Size Report
character sequences. The main ..... preprocessed to remove all special, common characters, and ... The next step is to convert all the characters into lowercases.
Markov Models for Written Language Identification Dat Tran and Dharmendra Sharma School of Information Sciences and Engineering University of Canberra Canberra, Australia [email protected] Abstract—The paper presents a Markov chain-based method for automatic written language identification. Given a training document in a specific language, each word can be represented as a Markov chain of letters. Using the entire training document regarded as a set of Markov chains, the set of initial and transition probabilities can be calculated and referred to as a Markov model for that language. Given an unknown language string, the maximum likelihood decision rule was used to identify language. Experimental results showed that the proposed method achieved lower error rate and faster identification speed than the current n-gram method.

I.

INTRODUCTION

Web pages and email messages are now available in several languages other than English. Identifying a specific language from such electronic documents for further action and analysis becomes an important task for natural language processing, information retrieval and multimedia information processing. It is difficult for a person to identify a large number of documents written in several languages. Therefore a written language identification method should be used to develop an automatic identification system. Current methods for language identification are based on the n-gram approach [2, 8, 10]. For instance, the trigram-based method analyzes a text document in a given language as a set of trigrams, i.e. sequences of three letters [2, 10]. The probability of a given trigram is the ratio of its frequency to the sum of the frequencies of all the trigrams. The probability set of all the trigrams in a specific language obtained from the training document is stored. In order to identify language for an unknown string, the string is also divided into a sequence of trigrams and the probability of the sequence is calculated for each language using the probability set of that language. The unknown string is then identified to the language that has the maximum probability. The trigram-based system [2] was achieved good performance when test strings were about 50 through 700 words using the n-gram frequency of 400. Other methods for language identification [4, 7, 9] include unique letter combinations, short word method, and ASCII code of character sequences.

The main principles for a language identification system is that it should be fast for real-time processing, efficient, requires minimum storage, and robust against textual errors. Based on these principles, a Markov chain-based method is proposed for language identification in this paper. The occurrences of letters in a word can be regarded as a stochastic process and hence the word can be represented as a Markov chain where letters are states. The occurrence of the first letter in the word is characterized by the initial probability of the Markov chain and the occurrence of the other letter given the occurrence of its previous letter is characterized by the transition probability. Given a text document in a specific language as a training set, the initial and transition probabilities for all Markov chains representing all words in the text document are calculated and the set of those probabilities is regarded as a Markov model for that language. In order to identify language for an unknown string, the maximum likelihood decision rule was used. Words in the string are regarded as Markov chains and for each language model built in the training session, the initial and transition probabilities taken from the language model are used to calculate the probability of the unknown string for that language. The unknown string is then identified to the language that has the maximum probability. Experiments were performed on a set of 7 languages, which are English, French, German, Indonesian, Italian, Norwegian and Spanish. There were 3000 distinct words for each training set and other 5000 strings for each test set. Results showed that the proposed Markov chain-based method achieved a lower identification error rate than the trigram-based method. Moreover, the Markov chain-based system is much faster than the trigram-based system for identification since the number of letters is smaller than the number of trigrams in any language. The rest of the paper is organized as follows. Section II presents the Markov language modeling method. The training and identification algorithms are presented in Section III. The proposed method was tested and compared with the trigram method in Section IV. Finally Section V concludes the presented work.

II. A.

B. Markov Language Model Define the following parameters

MARKOV LANGUAGE MODEL

Markov Chain Representation Let X = { X (1) , X ( 2) ,..., X ( L ) } be a set of L random

variable sequences, where X ( k ) = { X 1( k ) , X 2( k ) ,..., X T( k ) } is a k

sequence of Tk random variables, k = 1, 2, …, L and Tk > 0. Let V = {V1, V2 ,...,VM } be the set of M states in a Markov chain. Consider the conditional probabilities P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) ,..., X1( k ) = x1( k ) )

(1)

where xt(k ) , k = 1, 2, …, L and t = 1, 2, …, Tk are values taken by the corresponding variables X t(k ) . These probabilities are very complicated for calculation, so the Markov assumption is applied to reduce the complexity P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) ,..., X1( k ) = x1( k ) ) = P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) )

(2)

where k = 1, 2, …, L and t = 1, 2, …, Tk . This means that the event at time t depends only on the immediately preceding event at time t – 1. The stochastic process based on the Markov assumption is called the Markov process. In order to restrict the variables X t(k ) taking values xt(k ) in the finite set V, the time-invariant assumption is applied P( X1( k ) = x1( k ) ) = P( X1( k ) = Vi )

(3)

P( X t( k ) = xt( k ) | X t(−k1) = xt(−k1) ) = P( X t( k ) = V j | X t(−k1) = Vi ) (4) where k = 1, …, L , t = 1, …, Tk , i = 1, …, M and j = 1, …, M . Such the Markov process is called Markov chain. Figure 1 shows Markov chains having three states t, o, and n which represent some English words such as no, on, to, not, ton, too, tot, toot, and noon, French word non, German words tot and toon, and Spanish words no and tono.

q = [q(i)],

q (i ) = P( X 1( k ) = Vi )

p = [p(i, j)]

p(i, j ) = P( X t( k ) = V j | X t(−k1) = Vi ) (6)

(5)

where k = 1, …, L, L is the number of words in the training document, t = 1, …, Tk , Tk is the word length, i = 1, …, M and j = 1, …, M, M is the number of alphabetical letters The set λ = (q, p ) is called a Markov language model that represents the words in the training document as the Markov chains. A method to calculate the model set λ = (q, p ) is presented as follows. The Markov model λ is built to represent the sequence of states x, therefore we should find λ such that the probability P( X = x | λ ) is maximised. In order to maximise the probability P( X = x | λ ) , we first express it as a function of the model λ = (q, p ) , then equate its derivative to 0 to find the model set λ. p( xt −1 , xt ) = P ( X t = xt | X t −1 = xt −1 , λ ) Let q ( x1 ) = P( X 1 = x1 | λ ) we have L

t =Tk

k =1

t =2

P( X = x | λ ) = ∏ q ( x1( k ) ) ∏ p(xt(−k1) , xt( k ) )

t

t

t

(7)

Applying the time-invariant assumption in (3) and (4), and using (5) and (6), we can rewrite (7) as follows M

M

i =1

j =1

nij

P(X = x | λ ) = ∏[q(i )]ni ∏[ p(i, j )]

(8)

where ni denotes the number of values x1( k ) = Vi and nij denotes the number of pairs ( xt(−k1) = Vi , xt( k ) = V j ) observed in the sequence X (k ) . It can be seen that M

∑ ni = L

t

and

(9)

i =1

The probability in (8) can be rewritten as follows o

n

o

n

o

n

o

M

M M

i =1

i =1 j =1

log[ P( X = x | λ )] = ∑ ni log q(i) + ∑ ∑ nij log p(i, j ) (10)

n

Since Figure 1. Markov chains having 3 states t, o and n represent English words such as no, on, to, not, ton, too, tot, toot and noon, French word non, German words tot and toon, and Spanish words no and tono.

M

∑ q (i ) = 1

i =1

and

M

∑ p(i, j ) = 1 ,

the Lagrangian

j =1

method is applied to maximise the probability in (10) over λ. Using the following Lagrangian

F (q(i ), p(i, j ), a, bi ) = M



M



i =1



i =1



∑ ni log q(i) + a 1 − ∑ q(i ) +

(11)

M M

M



M



i =1 j =1

i =1



j =1



∑ ∑ nij log p(i, j ) + ∑ bi 1 − ∑ p(i, j )

B. Identification Algorithm Given an unknown language string, it is also preprocessed as shown in the training algorithm for the training documents. For each language model, the probability of the unknown string given the model is calculated. The maximum likelihood decision rule is used to identify language. The algorithm can be summarized as follows.

where a and bj are Lagrangian multipliers. Setting the derivative of F to zero and solving the equation give q (i ) =

ni

p(i, j ) =

M

nij M

(12)

s =1

Applying the equations in (12) to Markov chains of alphabetical letters, the initial probabilities q (letter x) and the transition probabilities p(letter x → letter y ) for a language can be interpreted as

number of pairs ( x, y ) ∑ number of pairs ( x, z)

z ∈letter set

The equations in (12) are used to determine the Markov language models from the training text document. LANGUAGE IDENTIFICATION ALGORITHMS

A. Training Algorithm Given a training language document, it is first preprocessed to remove all special, common characters, and punctuation marks such as commas, columns, semi-columns, quotes, stops, exclamation marks, question marks, signs, etc. The next step is to convert all the characters into lowercases. The initial and transition probabilities are then calculated. The algorithm can be summarized as follows. •



Using the training sets of all languages to be identified, determine a common letter set containing M alphabetical letters. For each training language set, do the following Remove all special characters and convert all letters into lowercases to obtain the set of words X Using all words in the set X, calculate the initial probabilities and the transition probabilities according to (12) Save all the probability values to a set λ and regard this set as the language model.





Given an unknown language string, preprocess it to remove all special characters and convert all letters into lowercases to obtain the set of words X



For each language model, calculate the probability of the word set X using (10)



The unknown string is then classified to the language that has the maximum probability

Save the letter set for identification purpose.

IV.

EXPERIMENTAL RESULTS

Seven roman-typed languages used in our experiments were English, French, German, Indonesian, Italian, Norwegian and Spanish. Language documents were randomly found from Web pages on the Internet. For each language, a set of 3000 distinct words was collected for training the language model. The minimum word length was set to 3 so that the training sets could also be used for the current trigram method. The set of 52 distinct letters was extracted from the 7 language training data sets. The training algorithm was used to train the 7 language models. For identification, we used the identification algorithm with different string lengths for test strings. The string lengths were set to 1, 3, 5, 10, 15, 20, 25, 30, 40, and 50 words, respectively. For each string length, a set of 5000 test strings was collected from Web pages other than the pages used for collecting the training data sets. 30 Identification Error (in %)

number of occurences of x as the first letter number of words

p (letter x → letter y ) =

III.

Read all the language models and the letter set obtained from the training session.

∑ nis

∑ ns

s =1

q (letter x) =



Trigram method

25

Markov method

20 15 10 5 0

0

10

20

30

40

50

Number of Words in Test Strings

Figure 2. Identification error rates (in %) vs number of words in test strings for the trigram-based method and Markov chain-based method. Number of test strings = 5000.

The proposed Markov chain-based method was compared with the current trigram-based method. Experimental results were presented in Figure 2. Both the

methods achieved lower identification error rates with longer test strings. With the string length was 40 or longer, the error rates were approximate to zero. However, the proposed Markov chain-based method outperformed the trigram-based method when the string length was between 3 and 20. Table 1 shows the confusion matrix for the proposed Markov chain-based method. The number of test strings was 5000 and the string length for test strings was set to 10. The identification result for English, French, German, Indonesian, Italian, Norwegian and Spanish are 99.4%, 98.7%, 98.7%, 98.1%, 93.7%, 97.7% and 96.5% respectively. Most of misclassified strings were identified as English strings. TABLE I. CONFUSION MATRIX FOR THE MARKOV CHAIN-BASED METHOD. NUMBER OF TEST STRINGS = 5000, STRING LENGTH = 10 WORDS

Confusion Number of test strings for the Markov chain-based method Matrix English

English

French German Indonesian Italian Norwegian Spanish

4968

0

French

42

4937

German

38

13

Indonesian

60

0

Italian

193

1

0

2

1

15

0

3

4937

0

6

6

0

5

4904

6

1

4

89

13

0

4683

6

16

Norwegian 109

0

1

1

5

4884

0

Spanish

47

11

1

58

0

4825

58

24

7

0

V.

The Markov chain-based method for automatic language identification has been presented in this paper. The occurrence of letters in a word is regarded as a stochastic process and can be represented as a Markov chain where letters are states. The initial probabilities and the transition probabilities for Markov chains representing words in the training document were calculated and the set of these probabilities was regarded as a Markov language model. Experiments were performed on the seven languages English, French, German, Indonesian, Italian, Norwegian and Spanish using 3000 words for each language to train models and 5000 strings for evaluation. The results showed that the proposed Markov chain-based method outperformed the trigram-based method when the string length was between 3 and 20 words. Moreover, the number of letters for all languages is always much smaller than the number of trigrams, therefore the speed of identification using the Markov chain-based method is very fast comparing with the trigram-based method since there were only 52 letters found for the seven languages. Note that the Markov language model is different from the bigram method. Given a bigram (i.e. a string of two letters), its probability in the bigram method is the ratio of its frequency to the sum of the frequencies of all the bigrams, whereas in the Markov language model, only the bigrams that have the same first letter were considered. REFERENCES [1]

Table 2 shows the confusion matrix for the trigrambased method. The number of test strings and the string length for test strings were the same as those in Table 1. The identification result for English, French, German, Indonesian, Italian, Norwegian and Spanish are 93.5%, 84.2%, 98.3%, 98.6%, 79.4%, 99.1% and 88.3% respectively. TABLE II. CONFUSION MATRIX FOR THE TRIGRAM-BASED METHOD. NUMBER OF TEST STRINGS = 5000, STRING LENGTH = 10 WORDS Confusion Matrix

Number of test strings for the trigram-based method English

French German Indonesian Italian Norwegian Spanish

English

4674

0

259

0

0

67

0

French

513

4210

210

8

5

30

24

German

60

0

4913

7

0

18

2

Indonesian

33

1

16

4928

0

21

1

Italian

538

260

74

13

3971

26

118

Norwegian

19

0

25

0

2

4954

0

Spanish

250

75

55

39

149

19

4413

CONCLUSION

V. Castelli, L.D. Bergman, editors, Image Databases. Wiley, New York, 2002. [2] W.B. Cavnar, and J.M. Trenkle, “N-gram-based text categorization”, Proc. 3rd Annual Symp. Document Analysis and Information, Retrieval, pp. 161-175, 1994. [3] R.A. Cole, J. Mariani, H. Uszkoreit, G.B. Varile, A. Zaenen, Zampolli, editors, Survey of the state of the art in human language Technology. Cambridge University Press, 1998. [4] E. Giguet, “Multilingual text tokenization in natural language diagnosis”, Proceedings of the 4th International Conference on Artificial Intelligence, Australia, 1996. [5] V.J. Hodge, J. Austin, “A comparison of a novel neural spell checker and standard spell checking algorithms”, Pattern Recognition Letters, vol. 35, pp. 2571-2580, 2002. [6] T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, Boston, 2002. [7] S. Johnson, “Solving the problem of language recognition”, Technical report, School of Computer Studies, University of Leeds, 1993. [8] Y.K. Muthusamy, and A. L. Spitz, Automatic language identification, Survey of the state of the art in human language technology, eds. R. A. Cole, J. Mariani, H. Uszkoreit, G. B. Varile, A. Zaenen, A. Zampolli. Cambridge University, Press, 1998. [9] T.D. Pham and D. Tran, “VQ-based written language identification”, Proc. of the Seventh International Symposium on Signal Processing ant its Applications, Paris, France, vol. I, pp. 513-516, 2003. [10] J. C. Schmitt, “Trigram-based method of language identification”, U.S. Patent number: 5062143, October 1991. [11] J.R. Ullman, “A binary-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words”, The Computer Journal, 20:2, 1977, 141-147.