Shape Code Based Word-Image Matching for Retrieval ... - IEEE Xplore

2 downloads 1333 Views 626KB Size Report
Email: [email protected] ... document image retrieval; shape code; Indian script document ... different codes to encode the extreme position of the.
2010 International Conference on Pattern Recognition

Shape Code based Word-image Matching for Retrieval of Indian Multi-lingual Documents Arundhati Tarafdar, Ranju Mondal, Srikanta Pal, Umapada Pal Computer Vision and Pattern Recognition Unit Indian Statistical Institute, Kolkata-700108, India Email: [email protected]

Graduate School of Engineering, Mie University Kurimamachiya-cho, TSU, Japan Email: [email protected] code based word-image matching (word-spotting) technique for retrieval of multi-script Indian documents. In the proposed technique, each query word image to be searched is represented by a code using zonal information of extreme points, vertical shaped based feature, crossing count, loop shape and position, and some background information. The document image to be searched is segmented into lines and words. The words that have characteristics similar to the query word are selected as candidate words. An inexact string matching technique is used between the primitive codes generated from query word image and each candidate word to retrieve the documents.

Abstract—In the current scenario retrieving information from document images is a challenging problem. In this paper we propose a shape code based word-image matching (wordspotting) technique for retrieval of multilingual documents written in Indian languages. Here, each query word image to be searched is represented by a primitive shape code using (i) zonal information of extreme points (ii) vertical shape based feature (iii) crossing count (with respect to vertical bar position) (iv) loop shape and position (v) background information etc. Each candidate word (a word having similar aspect ratio and topological feature to the query word) of the document is also coded accordingly. Then, an inexact string matching technique is used to measure the similarity between the primitive codes generated from the query word image and each candidate word of the document with which the query image is to be searched. Based on the similarity score, we retrieve the document where the query image is found. Experimental results on Bangla, Devnagari and Gurumukhi scripts document image databases confirm the feasibility and efficiency of our proposed approach.

II.

INTRODUCTION

Nowadays, in the internet we can find many digital documents of different scripts in image format. To retrieve document from these textual images in this paper we propose a shape code based word spotting technique for Indian multi-script documents, In recent years, many attempts for non-Indian languages have been made by the researchers in this area. Manmatha et al. [2] described a method of direct template matching by XOR & used the SLH algorithm which models the distortion as an affine transformation. Rath and Manmatha [3] used vertical projection profile, upper and lower boundary projection profile and applied Dynamic Time Warping for matching words. Tan et al. [4] discussed a method which uses 3 different codes to encode the extreme position of the vertical bar pattern of a word and depending on if they made a feature vector for matching purpose. Lu et al. [6] proposed a technique to retrieve document images by a word shape coding scheme based on topological shape features including character ascenders/descenders, character holes, character water reservoirs etc. Although there are many algorithms for word-spotting in non-Indian languages [1-6], much effort is not given for Indian languages [10]. In this paper we propose a shape 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.490

PROPERTIES OF BANGLA, DEVNAGARI AND GURUMUKHI SCRIPT

Bangla and Devnagari [7,8] are oriental scripts descended from the Brahmi script. These are the two most popular scripts in India. In both scripts, the writing style is from left to right and there is no concept of upper/lower case. Both the scripts have about fifty basic characters. Basic characters of Bangla and Devnagari scripts are shown in Fig.1. Gurumukhi script is used mainly in the Punjab state of India. Modern Gurmukhi has forty-one consonants (vianjan), nine vowel symbols (lāga mātrā), two symbols for nasal sounds (bindī and tippī), and one symbol that duplicates the sound of any consonant (addak).

Keywords- Document image processing; word spotting; document image retrieval; shape code; Indian script document image.

I.

Fumitaka Kimura

(a)

(b) Figure 1. Basic characters of (a) Bangla and (b) Devnagari

Vowels in these scripts generally take a modified shape in most words and are called modifiers or allographs. Modifiers generally do not disturb the shape of basic characters in the middle zone of a line. If the shape is disturbed in the middle zone, we call resultant shape a compound character. Vowel modifiers of Bangla and Devnagari scripts are shown in Fig.2.

1993 1989

each column. If count does not exceed 2*stroke width, we remove those pixels. After deletion of the headline we segment each word into two parts horizontally at the middle row between headline and base-line. The two segmented parts of Fig. 4(b) are shown in Fig.5. Now, from each disconnected component of the upper and lower part, we find the extreme points. For the component lying on the upper (lower) part, we compute upper (lower) extreme points using profile information from top (bottom). The extreme points of upper and lower parts of Fig.4.(b) are marked by gray dot in Fig.5(a) and (b), respectively.

A text line in such scripts can be partitioned into three zones. The upper zone denotes the portion above the headline, the middle zone denotes the portion between headline and baseline and the lower zone is the portion below baseline. The imaginary line separating middle and lower zone is called the base line. Different zones of a Devnagari line are shown in Fig.3. Horizontal histogram technique is used to find the headline and baseline [8].

Figure 2. Vowels modifiers of Bangla and Devnagari

Figure 5. Local extreme points of (a) upper and (b) lower portions of a word shown in Fig.4(a). Extreme points are marked by gray dot.

Figure 3. III.

Now six coding values are employed according to the position of extreme points. We use code 1 if an extreme point lies above headline, code 2 if extreme point lies on headline portion, code 3 (4) if extreme point lies in upper (lower) half of middle zone, code 5 if extreme point lies on baseline and finally code 6 if extreme point lies in lower zone (for different zones see Fig.3). For example, the code obtained from the Bangla word shown in Fig.4(a) is 5234512415515351533353. Here components are used from left to right to get positional information for coding. Since there are 22 extreme points in Fig.5 (12 upper and 10 lower extreme points), so, the length of this coded string is 22.

Different zones of a Devnagari line

PRE PROCESSING AND FEATURE EXTRACTION

At the very beginning using a histogram based threshold technique, the digitized gray images are converted to a twotone image. Next skew of a document image is corrected and the image is segmented into lines and lines into words [7]. For line separation a run-length based smearing algorithm technique followed by horizontal histogram is used [8]. For word separation from a document we statistically analyze the space between two consecutive characters, and two consecutive words of the document. Finally, based on the space information words are segmented [8]. In order to get information about the query word image in a document two steps are necessary, Feature extraction and image location detection. Feature extraction computes some information of the word image for easy searching. Image location detection tells the position of the query word present in a document. Now we shall describe the features and their coding used in our word-matching scheme. Here we propose six word shape coding schemes based on different features and they are detailed as follows.

B. Vertical shaped based coding In this coding scheme, we find the positions (column value) of the vertical line like structure in a word as shown in Fig.6.(a).Based on these positional values the coding is done using the following formula. Code value = (10* column value)/ Word width For example, the coding value of the word shown in Fig. 4(a) is 1245577. Since the word shown in Fig.4(a) has 7 vertical lines, hence the length of this code is 7. C. Crossing based coding In this coding scheme, we segment the portion of the image between two consecutive vertical lines into four equal parts by dividing it at three column positions. We then find the number of crossings in each of these three columns (the crossings for 2nd and 3rd vertical line are shown in Fig.6(b)). The coding is done based on these crossing values. We put the crossing count values sequentially in the code with respect to the corresponding column positions.

Figure 4. (a) A Bangla word (b) The word after removing headline

A. Coding based on extreme points Since in Devnagari, Bangla and Gurumukhi text, characters in a word are connected through headlines, we detect and delete the headline portion of the word, to make the characters of a word ‘more disconnected’ as shown in Fig.4(b). To remove the headline, we count the number of consecutive object pixels below and above the headline for

Figure 6. (a) Extracted vertical lines from the busy zone of Fig.4.(b) (b) Example of division of distance between two consecutive vertical lines into four equal parts.

1994 1990

D. Coding based on loop position Loop number and position varies from word to word. We included this information in our coding scheme. We find the centre of gravity (C.G.) for each loop and normalize the y coordinate of C.G. with respect to width of word using the following formula to get the code. Code value=floor[10*(y co-ordinate of C.G./word width)]

maximum (maxi) and minimum (mini) runs among all the runs in the upper part of middle zone. We also find the maximum width (mmax) among all maxi values found in a word image to normalize the code. Based on this maximum and minimum run information of the background components, the code is calculated as follows. For each component we get two code values; these two values are 10*maxi/(mmax+1), 10*mini/(mmax+1). So, if in a word there are N background components then the length of the code string of the word will be 2*N.

Figure 7. (a) Inverted Image (b) Extracted loops (c) Fill up loops are shown

Figure 8. Inverted busy zone portion of Fig. 4(a).Here loops are not considered as background. IV.

For example, the five loops of Fig.7(a) are shown in Fig.7(b) and they are coded as 03357. The y co-ordinate of the C.G. for each of the five loops, computed from left to right, are 11, 43, 55, 79 and 104 respectively, and the width of the word shown in Fig 7(a) is 140. Using these ycoordinates in the above formula and dividing it by the word width (140) we get the code 03357.

INEXACT FEATURE MATCHING

For faster retrieval, we first find some candidate word and matching is done on these words. So, the matching process is subdivided into two steps: (i) Selection of candidate words (ii) Matching query with the candidates. A. Selection of Candidate We use three properties (i) Aspect ratio (ii) Number of loops (iii) Number of background components in a word for candidate word selection. The words are selected as candidates having similarity with the query word in terms of above properties.

E. Coding based on loop shape Although loop size (height, width etc.) in a particular character of a font is fixed, the sequence of loop size from left to right may differ from one word to another. We used such information for this coding. We compute the height and width of all loops found in a word and the heights (width) of all such loops are normalized with respect to maximum height (width) found among them using the following formula Code value=floor [10*(Loop height(width)/Maximum loop height(width))] In Fig.7.(b) there are five loops and their heights are 6, 3, 3, 3, 3 from left to right, respectively; they are coded as 84444 based on the above formula. In Fig.7.(b) loop widths are 5,5,5,11,12 from left to right respectively, and so they are coded as 33389. Finally, after concatenating these two codes (height and width code), we get the final code for the loop shape.

B. Matching Based on the different features described above, each candidate word image is described by integer codes. The word-searching problem can then be stated as finding a particular sequence/subsequence in the codes of the candidate and query words. The similarity of two codes say, X and Y of two words can be computed by dynamic programming with recurrences using the method to find the length of Longest Common Sequence (LCS) [9] and then by finding the Edit Distance (ED) between them. If the LCS length between the query word and the candidate word exceeds a threshold value we find Edit Distance between the portions containing LCS of those two words, given by counting the minimum number of operations needed to transform one string into the other. Here an operation is defined as an insertion, deletion, or substitution of a single code, or a transposition of two codes using Damerau–Levenshtein [11] distance value. If any two coding strings out of six (obtained from the six features discussed in Section 3) of a candidate word match with the query word, then we say that candidate word is a correct retrieval of that query word.

F. Coding based on background components Background can provide much useful information that helps in our word spotting scheme. Run length information of the background portion between two consecutive characters is used for this purpose. We first detect the background between two characters. Such background portions are marked by black in Fig.8. Here, we have ten components but we do not consider the leftmost and the rightmost component. We also do not consider the loop for the background component. Thus the word shown in Fig.8 has 8 background components. We apply horizontal scanning for each background component to find out the

V.

RESULTS AND DISCUSSIONS

For this experiment we considered variety of printed documents (e.g. books, newspaper, magazines etc) of Bangla, Devnagari and Gurumukhi scripts. We computed 1995 1991

our result using 90 document pages (40 Bangla, 30 Devnagari, and 20 Gurumukhi). Total number of words in these document pages was 11000 (4000 Bangla, 3000 Devnagari and 2000 Gurumukhi). We have used 100 words (50 Bangla, 35 Devnagari and 15 Gurumukhi) of different length as query words. To get the idea of word spotting result of the proposed system, in Fig. 9(b) we have shown one Devnagari image and the words of the image that match with the query word (shown in Fig.9(a)) are marked by gray.

The minor percentage of unsuccessful and false retrieval is mainly because of improper segmentation of words. Sometimes due to noise, loop information varies and as a result corresponding background information also changes and some errors are generated. Distributions of word retrieval results from various length query words of different scripts are given in Table-I. We obtained an overall accuracy retrieval accuracy of 95.82%, 96.08% and 92.70% in Bangla, Devnagari and Gurumukhi scripts, respectively. VI.

(a)

CONCLUSIONS AND FUTURE WORK

In this paper we have proposed a shape code based wordimage matching technique for the retrieval of documents written in Indian languages. Each query word image is searched based on the shape code obtained from (i) zonal information of the extreme points (ii) vertical bar code feature and (iii) crossing count information (iv) loop shape and position (v) background components information etc. From the experiment of Bangla, Devnagari and Gurumukhi documents we have obtained encouraging results. In the future, we plan to use more robust features (e.g. contour based information of background portions) to improve the retrieval results and to analyze the errors in detail. VII. [1]

(b) Figure 9. (a) Devnagari Query word (b) Example of a Devnagari text image. The words of the image that match with the query word by the proposed approach are marked by gray. TABLE I. DISTRIBUTION OF WORD RETRIEVAL RESULTS FOR BANGLA, DEVNAGARI AND GURUMUKHI SCRIPTS Script Language

Length of Query Word =6 Overall Performance =6 Overall Performance =6 Overall Performance

Successful Retrieval % 91.50 93.00 96.00 99.10 99.50 95.82 93.20 95.00 95.00 98.30 98.90 96.08 89.70 91.50 91.30 94.00 97.00 92.70

REFERENCES

Lin-Lin Li, “Extraction of Textual Information from Images for Information Retrieval”, Ph.D. thesis, National University of Singapore, October 2009. [2] R. Manmatha, C. Han and E. M. Riseman, “Word Spotting: A New Approach to Indexing Handwriting”, In Proc. Computer Vision Pattern Recognition, pp.631-637, 1996. [3] Toni M. Rath and R. Manmatha, “Word Image Matching Using Dynamic Time Warping”. In Proc. Computer Vision Pattern Recognition, pp.521-527, 2003. [4] C. L. Tan, W. Huang, S. Y. Sung, Z. Yu and Y. Xu, “Text Retrieval from Document Images based on Word Shape Analysis,” Applied intelligence, vol.18, no.3, pp.257-270, 2003 [5] Y. Lu and C. L. Tan, “Information Retrieval in Document Image Databases, “IEEE Trans. on Know. & Data Engineering”, vol. 16, no. 11, pp. 1398-1410, 2004. [6] S. Lu, L.Li and C. L. Tan, “Document Image Retrieval through Word Shape Coding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp.1913-1918, 2008. [7] B. B. Chaudhuri and U. Pal, “An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)”, In Proc. 4th International Conference on Document Analysis and Recognition, pp. 1011–1015, 1997. [8] B. B. Chaudhuri and U. Pal, “A complete printed Bangla OCR system”, Pattern. Recognition, vol. 31, pp. 531–549, 1997. [9] T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein, “Introduction.to.Algorithms”, The MIT Press, 2001. [10] A. Balasubramanian, M. Meshesha and C. V. Jawahar, “Retrieval from Document Image Collections”, In Proc. International Workshop on Document Analysis Systems, pp. 1-12, 2006. [11] http://www.levenshtein.net/index.html

False Retrieval % 2.50 1.75 0.75 0.51 0.06 1.11 3.00 1.80 0.89 0.45 0.01 1.23 3.10 2.90 1.50 1.30 0.80 1.90

1996 1992

Suggest Documents