613 A Lexicon Based System with Multiple HMMs to ... - CiteSeerX

1 downloads 0 Views 393KB Size Report
written during the Abbasid caliphate (c. 750-1543 CE), for more details about this manuscript refer to [11]. Figure (4) shows samples of those handwritten fonts.
(‫ﻡ‬2004 ‫ ﺃﺒﺭﻴل‬/ ‫ﻫـ‬1425 ‫ ﺍﻟﻤﺩﻴﻨﺔ ﺍﻟﻤﻨﻭﺭﺓ )ﺼﻔﺭ‬، ‫ ﺠﺎﻤﻌﺔ ﺍﻟﻤﻠﻙ ﻋﺒﺩﺍﻟﻌﺯﻴﺯ‬، (‫ﺍﻟﻤﺅﺘﻤﺭ ﺍﻟﻭﻁﻨﻲ ﺍﻟﺴﺎﺒﻊ ﻋﺸﺭ ﻟﻠﺤﺎﺴﺏ ﺍﻵﻟﻲ )ﺍﻟﻤﻌﻠﻭﻤﺎﺘﻴﺔ ﻓﻲ ﺨﺩﻤﺔ ﻀﻴﻭﻑ ﺍﻟﺭﺤﻤﻥ‬

A Lexicon Based System with Multiple HMMs to Recognise Typewritten and Handwritten Arabic Words M S Khorsheed Computer Engineering Dept., King Saud University, P O Box 51178, Riyadh 11543, Saudi Arabia. Email: [email protected]

ABSTRACT. A new method to recognise words in Arabic handwritten manuscript is presented. The method injects the spectral features extracted from an input word image to a group of previously trained word models. Each word model is a single hidden Markov model. The likelihood probability of the input pattern is calculated against each model and the pattern is assigned to the model with the highest probability. The corpus includes sophisticated computer-generated fonts and handwritten scripts. Recognition results of the proposed method are compared to that output by a template-based system.

1. Introduction Arabic script presents challenges because all orthography is cursive and letter shape is context sensitive. Previous research on Arabic script recognition has tended to use an optical character recognition approach that depends on segmentation of words [12]. The word is first segmented into either primitives [2] or characters [1], features are then extracted from the segments the word is finally classified by comparing the features with a model. Another technique [10] is to decompose the word skeleton into small strokes (pieces of the character) that then are transformed into a sequence of observations that is fed to a Hidden Markov Model (HMM) [6]. The Viterbi algorithm [4] is used to find the best path through the model, which is the recognition result. These studies have confirmed the difficulties in attempting to segment Arabic words into individual characters. These difficulties probably arise due to context sensitivity, co-articulation effects, and stylistic features such as overlaps. Methods based on finding stroked paths or contours of the word are overly sensitive to noise. This paper implements a global approach to recognising cursive Arabic typewritten and handwritten words. Each word in the vocabulary is represented by a distinct HMM. The word models are trained using the Fourier feature sets extracted from the training samples, which belong to those words. It is not necessary to segment the word, and it is also not necessary to find the stroke paths or contours of the words. The recognition of an unknown word is based on computing the likelihood probability of that word image against each word model and assigning the word model with the highest probability. 2. The Fourier Featre Set Given a word image, the proposed method obtains a feature set that is invariant to the Poincaré group of transformations: dilation, translations and rotations. A two-step process was applied to transform the word into a two-dimensional Fourier spectrum [9]. The first step constructed a normalized polar map of the word, which is invariant under dilation and

613

614 rotation. The second step applied the two-dimensional Fourier Transform to the polar image. The resulting set tolerates changes in size, position and rotation. In [8], the recognition system was based on template matching. A template that includes a set of Fourier coefficients represented each word. The matching process measured the Euclidean distance between the input pattern and each of the word templates, and assigned the input pattern to the word with the minimum distance. The system performed well when applied to computer-generated fonts. However, it performed poorly when word samples are taken from more complex fonts, typewritten and handwritten. The templates were not robust enough to extract features from participating fonts. This was due to averaging the Fourier spectrum magnitudes that formed the templates. The averaging process produced a very distorted Fourier spectrum magnitude set, as far as each separate font was concerned. This motivated developing the proposed scheme where a learning method is used to train each word model using spectral features extracted from each font. Doing this enables the word model to efficiently recognise word samples from all fonts that the model is trained for.

(a)

(b)

Fig. (1) : The Fourier spectrum decomposition: (a) WRD, and (b) The proposed scheme.

3. Feature Encoding Unlike the speech signal which is one-dimensional, the resulting Fourier spectrum is a two dimensional signal. This is prohibitive for subsequent Vector Quantisation (VQ) technique [5] and feature extraction operations. In [3], the Fourier spectrum was divided into 32 wedges and 32 sectors. This resulted a 64-dimensional vector, Figure (1-a). Since the word image is a real signal, half of the Fourier is sufficient to reconstruct the entire image, hence, only this part is divided into six sectors, as shown in Figure (1-b). Each sector is a 30o wedgeshaped sampling window. The Fourier coefficients belonging to each sector represent the contributions of a separate direction group within the Fourier spectrum. Dividing the Fourier spectrum has the advantage of reducing the number of samples required and reducing the effect of small variations. The Fourier coefficient numbers vary from one sector to another. Using a single VC for all sectors of the Fourier spectrum does not preserve the sequential characteristics of these sectors. This can be remedied by treating the Fourier spectrum as a sequence of sectors where each sector is represented by a separate VQ codebook. This approach is referred to as a segmental VQ, and it is widely implemented in speech recognition [5]. As in a single vector quantiser each set of S successive codebook symbols represents one class, S=6. The discriminant score of a class equals the average value of the distance measures obtained from the successive vector quantisers. Assume I word samples as a training set, each sample has S Fourier spectrum sectors, as, s=1,2,...,S. There are S codebooks, each one of them has M vectors, âsm, m=1,2,...,M and s=1,2,...,S. The average distance measure of assigning a word sample to one class of S successive symbols is

615 DM =

 1 S  1 I min { min [d (â ms , ais )]}  ∑ ∑ s S s =1  a^ m I i =1 1< m< M 

(1)

where d(âsm, asi) is the Euclidean distance measure between the codebook vector âsm and an input vector asi. This increases the complexity of the codebook storage by a factor of S. While it preserves the same computational complexity as a single codebook based vector quantiser. 4. Hmm Implementation HMMs are a powerful tool in the field of signal processing. They have been successfully used in speech recognition [5]. Recently, the application of HMMs has been extended to include word recognition [7]. This is due to the similarities between speeches and written words since they both involve co-articulation, which suggests the processing of symbols with ambiguous boundaries and variations in appearance. Consider a system that at any instance of time may only be in one of the N state set. The system undergoes a change from one state to another according to a set of probabilities associated with each state; state transition probability A. These transitions take place in regular spaced discrete periods of time. The physical output of the system while it is in a given state may be modelled through estimating the observation symbol probability B. The word sample is now transferred into a finite sequence of observations. The system, shown in Figure (2), is designed using multiple HMMs where a separate model represents each word in the lexicon; such a system is suitable for small to medium size vocabulary. The system can be described in two phases: the training phase and the recognition phase. In the training phase, the HMMs are built using a labelled training set of data. In the recognition phase an input word observation sequence is used to compute the probability scores of the word models. The word model with the highest probability score is selected as the recognised word.

Fig. (2) :The block diagram of the HMM based word recognition system.

Three factors affect the determination of the optimum HMM for each lexicon word: the model structure, the estimate of the model parameters, and the training observation sequences. The model structure, implemented here, a left-to-right HMM. This model proceeds sequentially through the states starting from state number one. The model can be generalised to include any number of states, though an accurate estimation of the model parameters can be difficult if the number of states per model becomes too large. In this paper, the number of states N equals 6. This constrained serial model allows only one transition from a state to its successor. It can be mathematically represented as

616 aij = 0 If i > j or i < j − 1,

i, j = 1,2,..., N

The estimation of the model parameters requires tuning only A and B parameters. The initial probability parameter πi has a binary value 1 when i=1 or 0 otherwise. The training observation sequences should be proportional to the diversity of fonts recognised by the system. The recognition system aims to obtain font-independent models. The observation sequence, O, actually consists of several independent sequences Ok, k=1,2,...,K, where Ok is a training sequence which belongs to one of the participating fonts, and K is the total number of training samples. The likelihood probability of multiple observation sequences is handled in a two-step process. The first step calculates P(Ok |λ) for each sequence. The second step maximises the product of all probabilities K

P = ∏ P(O k | λ )

(2)

k =1

The Baum-Welch training procedure[5] is guaranteed to reach a local maximum. However, an alternative start of model parameter values could yield models with higher or lower values of P. Thus, the state transition probability A and the observation symbol probability B are assigned different randomly selected values, followed by a normalisation process to satisfy the constrain N

∑a

ij

=1

j =1 M

∑ b (k ) = 1

i = 1,2,..., N j = 1,2,..., N

j

k =1

where M represents the number of symbols.

(a)

(b)

(c)

(d)

Fig. (3) : Four Arabic fonts: (a) Simplified, (b) Traditional, (c) Andalus, and (d) Thuluth.

(a)

(b)

Fig. (4) Two handwritten font samples extracted from an Arabic manuscript.

5. Recognition Results To assess the performance of the proposed method, it is applied to two different cases. In each case, the vector quantiser is trained using a set of Fourier spectrum magnitude vectors. In case one, C1, the training vectors here are obtained using all word samples representing

617 four computer-generated fonts: Simplified Arabic, Andalus, Arabic Traditional and Thuluth, see Figure (3). The training set contains 10,440 vectors. In case two, C2, two of the computer-generated fonts are replaced with two handwritten scripts. Both handwritten styles belong to a manuscript entitled “Jamharat Annasab Li Ibn Alkalbi” by “Hisham Abu Almunthir Ibn Mohammad Alkalbi”. This manuscript was first written during the Abbasid caliphate (c. 750-1543 CE), for more details about this manuscript refer to [11]. Figure (4) shows samples of those handwritten fonts. Applying the K-means clustering algorithm [5], six successive vector quantisers were generated each with a separate codebook. During the course of running the algorithm the average distortion performance criterion, referred to as ||DM|| in Eq.(1) was monitored. Figure (5) shows the plots of ||DM|| versus M, on a log scale, for M=64,128, and 256. One can notice a large decrease in the average distortion when the number of clusters in each codebook increased from M=64 to M=128. This significant difference justifies the increased computation owing to the larger codebook. Where there is a small difference in the average distortion between M=128 and M=256 this justification does not apply. This can also be related to the HMM-based recognition system performance.

Average distortion

2.5

Case I

2

Case II

1.5 1 0.5 64

Log(M)

128

256

Fig. (5) : Plots of the average distortion ||DM|| versus size of codebooks.

Recognition rate

100%

95%

Template-Based C1(33%,128) C1(66%,128)

90%

C1(33%,256) C1(66%,256) 85%

80% 10

20

30

40

50

60

70

80

90 100 110 120 130 145

Words in lexicon

Fig. (6) : Recognition rates of typewritten word samples belonging to C1.

5.1. C1 Sample Results More than 1700 samples representing 145 words were used to assess the performance of the HMM/VQ word recogniser. The samples were rendered at random angles ranging from 0 to 2π, at random sizes ranging between 18pt and 48pt, and at random translations up to twice the size of the sampled word.

618 Four experiments were performed. Using six 128-symbol codebooks, the first experiment assigned each one of the six Fourier spectrum sectors to one symbol. The word image was transferred into a six-observation sequence. Each word model was then trained using observation sequences that are randomly selected from the four fonts. This experiment used 33% of the data set to train the word models and is referred to as C1(33%,128). The second experiment used the same number of symbols per codebook, and 66% of the data set to train the word models. This experiment is referred to as C1(66%,128). In the third experiment each codebook was partitioned into 256 symbols. Like the first experiment 33% of the data set was used to train the word models. This experiment is referred to as C1(33%,256). The fourth experiment is similar to the previous experiment except that 66% of the data set was used to train the word models. This experiment is referred to as C1(66%,256).

Recognition rate

100%

95%

128 Clusters 256 Clusters 90% 33%

66%

Size of training data set

100%

Fig. (7) : The effects of the codebook and the training sizes on the recognition rate for C1.

(a)

(b)

Fig. (8) : Three-dimensional depictions of the confusion matrices for the computer-generated fonts, subjected to 252300 recognition tests. (a) The performance of the template-based recogniser. (b) The HMM-based recogniser, 66% of the data set was used for training.

Figure (6) shows the recognition results of the four experiments together with the performance of the template-based recogniser described in [8]. Two remarks can be concluded from this figure: first, at least a 7% increase in the recognition rate was achieved when using the HMM. This is due to the fact that unlike the word template, which was distorted by the averaging process of different writing styles, word models adjust their parameters to learn the features extracted from each font. The second remark is regarding the amount of observation sequences used to train the word models and how they are proportional to the recognition rate. An increase of 9% was achieved when the size of the training set was doubled from 33% to 66%, see Figure (7).

619 1

Template-Based C2(128 Clusters) C2(256 Clusters)

Recognition rate

0.95

0.9

0.85

0.8 10

20

30

40

50

60

70

80

90

100

110

120

130

140

Words in lexicon

Fig. (9) : Recognition rates of word samples belonging to C2.

Figure (8) shows the confusion matrices of the computer-generated fonts in two different situations: the template-based recogniser mentioned in [8] and the proposed system here. Table (1) : Recognition rates (%) of four font word samples representing 145-word lexicon.

Fonts HW (1) HW (2) Simplified Arabic Arabic Traditional

Recognition System Template-Based HMM 128 Clusters HMM 256 Clusters Template-Based HMM 128 Clusters HMM 256 Clusters Template-Based HMM 128 Clusters HMM 256 Clusters Template-Based HMM 128 Clusters HMM 256 Clusters

Recognised Words Top-1 Top-5 Top-10 66 82 89 77 87 90 80 86 89 80 86 92 87 93 96 90 94 96 88 96 98 90 95 96 89 95 98 95 99 99 88 96 98 90 95 96

5.2. C2 Sample Results Two experiments were performed. The first experiment used 128 symbols per codebook, whereas the second experiment used 256 symbols. Both experiments used 44% of the data set for training the word models. Figure (9) illustrates the recognition results of the two experiments together with the performance of the template-based recogniser, described in [8]. Table 1 shows the recognition rates of each font separately. The table compares the performance of the two experiments together with the template-based recogniser. The results are subjected to 174435 recognition tests. The first column shows the rate when the word in a given font was recognised as the first choice (i.e. highest probability of recognition), and the other two columns represent the recognition rate when the result appear among the first five and ten choices, respectively. 6. Conclusion A method for recognising cursive words in Arabic manuscripts has been presented. The word-model method was designed using multiple HMMs, where each word in the lexicon was

620 represented by a separate HMM. The word models were trained using the word sample Fourier spectrum. The Fourier spectrum of a word sample was divided into six wedge-shaped sectors. Each sector was assigned to one codebook symbol. This transferred the word image into a sixobservation sequence. To overcome the variation of the sector sizes, the sequential VQ was implemented. This required the use of six codebooks, one for each sector group. Part of the data set was used for training the word models, while the rest was used for assessing the recognition system performance. The system achieved a higher recognition rate compared to the template-based recogniser, described in [8]. REFERENCES [1] Amin, A. and Mari, J., Machine Recognition and Correction of Printed Arabic Texts, IEEE Trans. On Systems, Man and Cybernetics, vol. 19, no. 5, pp. 1300-1306, 1989 [2] Parhami, B. and Taraghi, M., Automatic Recognition of Printed Farsi Texts, Pattern Recognition, vol. 14, no. 6, pp. 395-403, 1981 [3] Casasent, D. and Sharma, V., Fourier-Tansform Feature Space Studies, Proc. Of the International Society for Optical Engineers SPIE, vol. 449, pp. 2-8, 1983 [4] Forney, G., The Viterbi Algorithm, Proc. Of the IEEE, vol. 61, no. 3, pp. 268-278, 1973 [5] Rabiner L. and Juang, B., Fundamentals of Speech Recognition, Prentice Hall, 1993 [6] Rabiner, L., A Tutorial on HMM and Selected Applications in Speech Recognition. Proc. Of the IEEE, vol. 77, no. 2, pp. 257-286, 1989 [7] Chen, M., Kundu, A. and Zhou, J., Off-line Handwritten Word Recognition Using an HMM Type Stochastic Network, IEEE Trans. On PAMI, vol. 16, no. 5, pp. 481-496, 1994 [8] Khorsheed, M. S. and Clocksin, W. F., Multi-font Arabic Word Recognition Using Spectral Features, The Proc. Of the 15th International Conference On Pattern Recognition, vol. 4, pp. 543-546, Spain, 2000 [9] Khorsheed, M. S., and Clocksin, W. F., Spectral Features for Arabic Word Recognition, Proc. Of the IEEE International Conference on Accoustics, Speech and Signal Processing, 2000 [10] Khorsheed, M. S. and Clocksin, W. F., Structural Features of Cursive Arabic Script, Proc. Of the 10th British Machine Vision Conference, vol. 2, pp. 422-431, 1999 [11] Khorsheed, M. S., Automatic Recognition of Words in Arabic Manuscripts, PhD Dissertation, University of Cambridge, 2000 [12] Khorsheed, M. S., Off-line Arabic Character Recognition – A Review, Pattern Analysis & Applications, vol. 5, no. 1, pp. 31-45, 2002

‫‪621‬‬

‫ﻨﻅﺎﻡ ﻟﻠﺘﻌﺭﻑ ﺁﻟﻴ ﹰﺎ ﻋﻠﻰ ﺍﻟﻜﻠﻤﺎﺕ ﺍﻟﻌﺭﺒﻴﺔ ﺍﻟﻤﻁﺒﻭﻋﺔ ﻭﺍﻟﻤﻜﺘﻭﺒﺔ ﻴﺩﻭﻴ ﹰﺎ‬ ‫ﻴﻌﺘﻤﺩ ﻋﻠﻰ ﻨﻤﺎﺫﺝ ﻤﺎﺭﻜﻭﻑ ﺍﻟﻤﺨﻔﻴﺔ‬ ‫ﻤﺤﻤﺩ ﺨﻭﺭﺸﻴﺩ‬

‫ﻗﺴﻡ ﻫﻨﺩﺴﺔ ﺍﻟﺤﺎﺴﺒﺎﺕ ‪ ،‬ﺠﺎﻤﻌﺔ ﺍﻟﻤﻠﻙ ﺴﻌﻭﺩ ‪ ،‬ﺍﻟﺭﻴﺎﺽ ‪ ،‬ﺍﻟﻤﻤﻠﻜﺔ ﺍﻟﻌﺭﺒﻴﺔ ﺍﻟﺴﻌﻭﺩﻴﺔ‬ ‫ﺍﻟﻤﺴﺘﺨﻠﺹ ‪ .‬ﻴﻘﺩﻡ ﺍﻟﺒﺤﺙ ﻁﺭﻴﻘﺔ ﺠﺩﻴﺩﺓ ﻟﻠﺘﻌﺭﻑ ﻋﻠﻰ ﺍﻟﻜﻠﻤﺎﺕ ﺍﻟﻌﺭﺒﻴﺔ ﺍﻟﻤﻭﺠﻭﺩﺓ ﻓﻲ ﺍﻟﻤﺨﻁﻭﻁﺎﺕ ﺍﻟﻤﻜﺘﻭﺒﺔ‬ ‫ﻴﺩﻭﻴﹰﺎ‪ .‬ﻴﺘﻡ ﺍﺴﺘﻨﺒﺎﻁ ﻤﻴﺯﺍﺕ ﻁﻴﻔﻴﺔ ﻤﻥ ﺼﻭﺭﺓ ﺍﻟﻜﻠﻤﺔ ﻭﻤﻥ ﺜﻡ ﻴﺘﻡ ﺤﻘﻥ ﻫﺫﻩ ﺍﻟﻤﻴﺯﺍﺕ ﻓﻲ ﻨﻤﺎﺫﺝ ﺍﻟﻜﻠﻤﺎﺕ ﺍﻟﺘﻲ ﺘﻡ‬

‫ﺘﺩﺭﻴﺒﻬﺎ ﻤﺴﺒﻘﹰﺎ‪ .‬ﺘﻤﺜل ﻜل ﻜﻠﻤﺔ ﻓﻲ ﺍﻟﻘﺎﻤﻭﺱ ﺒﻨﻤﻭﺫﺝ ﻤﺎﺭﻜﻭﻑ ﻤﺴﺘﻘل‪ .‬ﺘﺤﺴﺏ ﺍﺤﺘﻤﺎﻟﻴﺔ ﺃﻱ ﻋﻴﻨﺔ ﺠﺩﻴﺩﺓ ﻤﻘﺎﺒل‬

‫ﺠﻤﻴﻊ ﻨﻤﺎﺫﺝ ﺍﻟﻜﻠﻤﺎﺕ ﻭﻤﻥ ﺜﻡ ﻴﺘﻡ ﺍﺨﺘﻴﺎﺭ ﺍﻟﻜﻠﻤﺔ ﺍﻟﺘﻲ ﻴﺤﺼل ﺍﻟﻨﻤﻭﺫﺝ ﺍﻟﺨﺎﺹ ﺒﻬﺎ ﻋﻠﻰ ﺃﻋﻠﻰ ﻨﺘﻴﺠﺔ‪ .‬ﻴﺤﺘﻭﻱ‬ ‫ﺍﻟﺠﺯﺀ ﺍﻷﺴﺎﺴﻲ ﻤﻥ ﺍﻟﻤﺨﻁﻭﻁﺎﺕ ﻋﻠﻰ ﺨﻁﻭﻁ ﻤﻨﺘﺠﺔ ﺒﻭﺍﺴﻁﺔ ﺍﻟﺤﺎﺴﺏ ﺍﻵﻟﻲ ﺇﻀﺎﻓﺔ ﺇﻟﻰ ﺨﻁﻭﻁ ﻤﻜﺘﻭﺒﺔ‬

‫ﻴﺩﻭﻴﹰﺎ‪.‬‬

Suggest Documents