A New Invariant Algorithm for Recognition of ... - Semantic Scholar

1 downloads 0 Views 184KB Size Report
features are extracted by counting number of black pixels and white pixels in each ring. ... 2. Related Literature. Many of the researchers have addressed this problem in ... presented here employs “wildcards” to represent missing letter candidates. ..... U et al, 'Touching numeral segmentation using water reservoir concept'.
A New Invariant Algorithm for Recognition of Alphabets of Multilingual Documents G. Hemantha Kumar, *P. Shivakumara, S. Noushath, V. N Manjunath Aradhya Department of Studies in Computer Science, University of Mysore, Mysore – 570006, Karnataka, *Email: [email protected]

Abstract OCR is used to translate human readable characters into machine-readable codes because it provides a solution for processing large volumes of data automatically in a large variety of scientific and business applications. Thus OCR is an important component of any document image analysis and recognition system. In this paper, we have proposed a new algorithm which is invariant to image transformations such as Rotation, Scaling and Translation (RST), to recognize the alphabets of different languages such as Kannada, Telugu, Tamil, Malayalam, Parsian, English-upper case, English-lower case and numerals. The method maps the spatial coordinates of the image to polar coordinates of polar domain. The features are extracted by counting number of black pixels and white pixels in each ring. These features are invariant to RST. The method gives less accuracy. To improve the accuracy we have introduced a new method based on distance measures such as Euclidian and city-block. This algorithm extracts features by estimating distance between the first black pixel and centroid and last black pixel and centroid in 8 directions. However, this method is not invariant to rotation and scaling [16]. The experimental results of the proposed method reveals that the method based on distance measure gives 100% accuracy with less computations and more time compared to the method based on rings formation on image of polar coordinates for both thinned and thick characters. Keywords: Polar coordinates, Rings construction, Invariant feature, Distance measure, Character recognition.

1. Introduction Many OCR are developed for recognizing different languages through character recognition. However, in some situations like border place of states or countries, and place where the different people meet together, the document may contain different languages. In this situation, document may be one in which the user is less fluent, or a single document may contain more than a single language. To recognize such documents the existing methods, segment the different languages present in the document. In addition, the methods fail to segment the characters when the word contains different characters of different languages. Hence, we have felt that there is a lot of scope for recognizing multilingual documents through character recognition by developing single feature extraction scheme. To Deal with multilingual documents we have introduced novel concept of development of a single OCR, rather than different OCRs through segmentation, for identifying different languages through character recognition. The organization of the paper is as follows. Related literature is provided in section 2. Motivation for the work is presented in section 3. Proposed methodologies are presented in section 4. Experimental results are reported in section 5. A comparative study of the proposed methods is given in section 6. Conclusion is given in section 7.

2. Related Literature Many of the researchers have addressed this problem in literature in different context. The literature related to our work is given below. The technique proposed to identify different script lines from multi-script documents [7, 8]. In this paper, they have addressed the problem of development of an automatic technique for the identification of printed Roman, Chinese, Arabic, Devangari and Bangla text lines from a single document. The proposed method uses the script characteristic, shape-based features, statistical features obtained from the concept of water overflow from the reservoir. However, the method works only at text line level but not word level and character level. In addition method fails to identify the text lines of south Indian language documents since the structure of the ext lines is almost similar.

The concept of water reservoir to segment the touching numerals is proposed in [7]. In this paper, they have developed a new technique for automation, segmentation of unconstrained handwritten connected numerals. The scheme is mainly based on features obtained, from the concept based on water reservoir. A reservoir is metaphor to illustrate the region where numerals touch. Reservoir is obtained by considering accumulation of water poured from the top or from the bottom of the numerals. Next analyzing the reservoir boundary, touching position and topological features of the touching pattern, the best cutting point is determined. No normalization or thinning mechanism is used in this scheme. This principle is used for recognition of characters of different languages. The method fails to identify the characters as the number of character increase. In addition, the method has given accuracy about 94.8%. The technique proposed for image document text without OCR in [3]. Documents are segmented into character objects. Image features namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. The method is language independent. The method will be particularly useful if document images are of similar fonts and resolution such as in a corpus of newspaper. Handwritten character string recognition system for Japanese mail address reading, on very large vocabulary is proposed in [2]. The address phrases are recognized as a whole because there is no extra space between words. The lexicon contains 111,349 address phrases, which are stored in a tree structure. In recognition, the text line image is matched with the lexicon entries to obtain reliable segmentation and retrieve valid address phases. This method works only for English documents. The recognition of distorted Kannada characters is addressed in [11]. Hamming distance concept and viterbi algorithm were used for recognition. It works for minute distortions. And also the method fails when we increase the number of characters. A technique to deal with an OCR (Optical Character recognition) error detection and correction for a highly inflectional language (Bangla, the second-most popular language in India and fifth-most popular in world) is presented in [9]. The technique is based on morphological parsing where using two separate lexicons of root words and suffixes. The method is limited to only Bangla characters. Modified region decomposition method and optimal depth tree in the recognition of non-uniform sized characters is addressed in [6]. The method uses a standard sized rectangle, which can circumscribe standard sized characters. The rectangle is interpreted as two dimensional structure of nine parts which are called as bricks. The method is limited to printed Kannada characters only. In addition, the method is found to be computationally expensive. An on-line character recognition method presented in [5] uses both directional features, otherwise Known as off-line features, and direction-change features, which designed as on-line features. The method works for online character recognition. The method fails for Kannada characters. A survey on Feature extraction methods for character recognition is given in [1]. They have given an overview of feature extraction methods for off-line recognition of segmented (isolated) characters. The feature extraction methods are discussed in terms of invariance properties, reconstructability and expected distortions and variability of the characters. The problem for choosing the appropriate feature extraction method for a given application is also discussed. We have found that no algorithms are reported in the paper to recognize the characters of south Indian languages. A survey on online and offline hand written recognition is given in [12]. They have described the nature of handwritten languages, how it is transduced into electronic data, and the basic concept behind written language recognition algorithms. The method works for only English characters. A cursive script recognition using wildcards and multiple experts is proposed in [15]. The method presented here employs “wildcards” to represent missing letter candidates. Multiple experts are used to represent different aspects of handwriting. Each expert evaluates closeness of match and indicates its confidence. Explanation experts determine the degree to which the word alternative under consideration explains extraneous letter candidates. However, the algorithms are limited to English characters. Prototype extraction method for document-specific OCR systems presented in [13]. The method automatically generates training samples from un segmented text images and the corresponding transcripts. The method is based on new algorithms for estimating character widths, character locations and match and non-match probabilities from unsegmented text. In order to exploit this method in an

operational environment, it must be augmented with reliable font change detector. Also required for most applications is some form of contextual correction based on linguistic properties.

3. Motivation for the Work The literature survey reveals that a considerable amount of research has gone into Character Recognition and yet it has remained an open research problem in the area of Computer Science. Most of the presented algorithms are dealt with the document containing single language but not document containing different languages. We have also noticed that no algorithms are addressed the problem of segmentation south Indian script lines. However, it is found that from the literature that there are algorithms to recognize document, which contains different text lines and words but no algorithms are reported so far to recognize the document that contains mixed characters in a text line. In addition, [7] proposed that to develop a Multi-script Optical Character Recognition System, it is necessary to separate different script forms before feeding them to corresponding Script Recognizers. Since the development of the Hybrid OCR for the multiple scripts is more difficult than separate OCRs of individual scripts [7]. This is true because of the segmentation step is involved in this process. This is the motivation for taking up this work and also this is an attempt towards the development of single OCR for the recognition of characters of different languages.

4. Proposed Methodology In this section, we have presented method based on polar transformation which transforms the spatial coordinates of the character image into polar coordinates through X = R * cos (θ). Y = R * sin (θ), where R is the radius of the boundary circle (ref. Step 1 and Step 2 of algorithm 1), θ is tan-1(y/x), x and y are Cartesian coordinates of corresponding black pixel of the character image. In this way we have obtained transformed image. For this transformed image we have obtained radius to fit a boundary for the transformed image using the Step 1 and Step 2 of algorithm 1. This is considered as radius (r). Using this r the proposed method draws five concentric circles of radie r/4, r/2, 3*r/4, 4*r/5 and r on the transformed image respectively. In this way we have extracted total 15 features for each transformed image. For circle with radius r/4: S1 = B / N, S2 = W / N S3 = B / Nb. For circle with radius r/2: S4 = B / N, S5 = W / N, S6 = B / Nb. For circle with radius 3*r/4: S7 = B / N, S8 = W / N, S9 = B / Nb. For circle with radius 4*r/5: S10 = B / N, S11 = W / N, S12 = B / Nb. For circle with radius r (outermost ring):S13 = B / N, S14 = W / N, S15 = B / Nb. Where B is number of black pixels in corresponding circle of transformed image, W is the number of white pixels in corresponding circle of transformed image, N is total number of pixels (both black and white) in corresponding circle of transformed image and Nb is total number of black pixels present in transformed image. These features are invariant to rotation and scaling transformations. To achieve the scaling invariance we have converted the scaled up or down character into one particular form. That means if the character scaled up than the standard form then the method uses the technique to bring to standard form. If the character scaled down than standard form then the proposed method uses the normalization technique to bring to standard form. These character images are the input to the proposed method. Further, the procedure is continued to extract features and to recognize the characters. Algorithm 1: Polar Transformations Input: Characters Output: Features Method: Step 1: For each black pixel in the character n

∑ Xi where n is the total number of black pixels present in the

Step 1.1: Compute Cx =

i =1

n

character and i varies from 1 to n n

Yi Step 1.2: Compute Cy = ∑ i =1 n

For end Step 1.3: Centroid = (Cx, Cy) where Cx and Cy are the coordinates of the centroid

Step 2: For each pixel Step 2.1:Find out the Xmin, Xmax, Ymin and Ymax of the character Step 2.2: Compute Diffx = ABS(Xmax – Xmin)

Diffy = ABS(Ymax – Ymin)

For end Step 2.3: Compute Maxdiff = Max(Diffx, Diffy) where Maxdiff gives the maximum value between Diffx and Diffy Step 2.3: Radius (r) = Maxdiff Step 2.4: Draw a circle using the Radius (r) Step 3: Outer ring is drawn with r Step 4: For each black pixel in the character Xi = R * cos(θ). Yi = R * sin(θ). Where θ = tan-1(y/x); x and y are the Cartesian coordinate of ith black pixel, θ is calculated as we discussed above. For ends. Step 5: For each black pixel in the polar coordinates Step 5.1: Draw boundary circle for the given transformed image using the Step 1, Step 2 and Step 3 of algorithm 4.1 For ends. Step 6: Fit a circle of radius r/4. Step 7: Fit a circle of radius r/2. Step 8: Fit a circle of radius 3*(r / 4). Step 9: Fit a circle of radius 4*( r/5). Step 10 Fit a circle of radius r. Step 11: Compute feature values S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15 as described in section 3.4.2. Polar Transformation Ends.

4.1 For Thick Characters In this section we have applied the algorithm 1 for thick characters. The features are extracted for each thick character. We have scanned totally 291 characters to recognize them, which includes alphabets of south Indian languages such as Kannada, Telugu, Tamil and Malayalam. The procedure is depicted in the form of figures Fig. 1 (a)–(d).

Fig. 1 (a)

Fig. 1 (b)

Fig. 1 (c)

Fig. 1 (d).

Fig. 1 (a) represents a character in Cartesian coordinates, and Fig. 1 (b) shows the circle being fitted for the character in Cartesian coordinates. Fig. 1 (c) is the result of polar transformations. Fig. 1 (d) shows the circles are fitted to the transformed image to extract features.

4.2 For Thinned Characters In this section we have employed the algorithm 4.1 to extract features for thinned characters. The thinned characters are obtained by thinning the characters with the algorithm given in [4].

Fig. 2 (a)

Fig. 2 (b)

Fig. 2 (c)

Fig. 2 (d).

Fig. 2 (a) represents a thinned character image in Cartesian coordinates, and Fig. 2 (b) shows the circle drawn in Cartesian coordinates. Fig. 2 (c) shows result of the polar transformations. Fig. 2 (d) shows the concentric circles are drawn to extract features.

5. Experimental Results In this section we have considered the data set of different languages such as Kannada, Telugu, Tamil, Malayalam, Persian alphanumeric characters, English Uppercase letter, English lowercase letters and numerals. The total number of characters is 291. We have chosen data from the same source for the experimentation purpose. The values of accuracy in recognition and number of computations involved in the process of distance measure based method [16] and the method based on polar transformations for both thick and thinned data set are tabulated in Table 1. The accuracy is calculated for 291 characters. If the method identifies all 291 characters then the method is considered to be good method as well as it means that it achieves 100% accuracy. Fig. 3 and Fig. 4 gives different rotations of Malayalam symbol of both thick and thin type.

Fig. 3 Malayalam Character with Different Rotations of 00,100,200,350.

Fig. 4 Malayalam Thinned Character with Different Rotation of for 00,100,200,350.

The features (S1 to S15 ) of the polar transformations based method for different rotations (ref. Fig. 3 and Fig. 4) and scaled (ref. Fig. 13) of the thick and thinned character images are represented in the form line graphs as shown in Fig. 5 to Fig. 17 and also we have observed that the method based on polar transformations is invariant to rotation and scaling for both thick and thin data set. Polar Transformation(Thick 10 Degree)

Polar Transformation(Thick 0 Degree)

1.5

1

Values

Values

1.5

0.5 0

0

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

Features

Features

Fig. 5 00 Thick Character - PT

Fig. 6 100 Thick Character - PT

Polar Transformation(Thick 20 Degree)

Polar Transformation(Thick 35 Degree)

1.5

1.5

1

Values

Values

1 0.5

0.5 0

1 0.5 0

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

Features

Features

Fig. 7 200 Thick Character - PT

Fig. 8 350 Thick Character - PT

Table 3 Rotation Invariant Features obtained by Polar Transformation Based Method for Thinned Characters Polar Transformation Thin(10 Degree)

Polar Transformation Thin(0 Degree) 1.2

1

1

0.8

0.8

Values

Values

1.2

0.6 0.4

0.6 0.4 0.2

0.2

0

0 S1

S2

S3

S4

S5

S6

S7

S8

S9

S1

S10 S11 S12 S13 S14 S15

S2

S3

S4

S5

S6

S7

Fig. 9 00 Thinned Characters - PT

1.2

0.8

Values

Values

1 0.6 0.4 0.2

1 0.8 0.6 0.4 0.2 0

0 S1

S2

S3

S4

S5

S6

S7

S8

S9

S1

S10 S11 S12 S13 S14 S15

S2

S3

S4

S5

S6

S7

S8

S9

S10 S11 S12 S13 S14 S15

Features

Features

Fig. 11 200 Thinned Characters - PT

Fig. 13(a)

Fig. 12 350 Thinned Characters – PT

Fig. 13(b)

Fig. 13(c)

Polar Transformation Thick(100 DPI)

Polar Transformation Thick(200 DPI)

1.5

1.5 Values

1 0.5 0

1 0.5 0

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S1

S2

S3

S4

S5

S6

S7

Features

S8

S9

S10

S11

S12

S13

S14

S13

S14

S15

Features

Fig. 14 100 dpi Thick Character - PT

Fig. 15 200 dpi Thick Character - PT Polar Transformation Thin(100 DPI)

Polar Transformation Thick(400 DPI)

1.5 Values

1.5 Values

S10 S11 S12 S13 S14 S15

Polar Transformation Thin(35 Degree)

1.2

1 0.5 0

1 0.5 0

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13 S14

S15

S1

S2

S3

S4

S5

S6

S7

Features

S8

S9

S10

S11

S12

S15

Features

Fig. 16 400 dpi Thick Character - PT

Fig. 17 100 dpi Thinned Character - PT Polar Transformation Thin(400 DPI)

Polar Transformation Thin(200 DPI)

1.5 Values

1.5 Values

S9

Fig. 10 100 Thinned Characters - PT

Polar Transformation Thin(20 Degree)

Values

S8

Features

Features

1 0.5

1 0.5 0

0 S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

Features

Fig. 18 200 dpi Thinned Character - PT

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10 S11 S12 S13 S14 S15

Features

Fig. 19 400 dpi Thinned Character – PT

Note: PT denotes Polar Transformations

6. Comparative Study In order to compare the performance of the proposed methods we have considered accuracy, number of computations, time using linear search tree, time using binary search tree, nodes visited using linear search, nodes visited using binary search, and invariance property as decision parameters.

Table 1 Experimental Results Conducted for Thick Data Set by using all the Methods

Thick Data Set Linear Search Name of the Methods

Time

Binary Search Time 1.53 m

Nodes Visited 11250

Accuracy in recognitio n

Computation s

100%

37910

Euclidean Distance

2.07 m

Nodes Visited 42486

City-Block Distance

2.05 m

41636

1.55 m

11001

98%

37910

Polar Transformatio n

5.05 m

27615

4.16 m

7537

65%

81480

Table 2 Experimental results Conducted for Thinned Data Set by using all the Methods

Thin Data Set Name the of Methods Euclidean Distance City-Block Distance Polar Transform ation

Linear Search

Binary Search Accuracy

Computatio ns

11250

100%

46147

25.13 m

10900

97%

46147

28.06 m

7674

67%

36375

Time

Nodes Visited

Time

Nodes Visited

28.31 m

42486

24.43 m

27.12 m

41236

30.31 m

28890

Table 3 Comparative Study of Proposed Methods for both Thick and Thinned Data Set

Parameters

Name of the Methods

Accuracy Computations Time- Linear Search Time- Binary Search Nodes Visited- Linear Search

Euclidean Distance Euclidean Distance and Cityblock Distance City-block Distance City-block Distance Polar Transformation

Nodes Visited-Binary Search Invariance- Rotation Invariance-Scaling

Polar Transformation Polar Transformation Polar Transformation

From Table 3, we have observed that, if we consider the accuracy the methods based on Euclidean distance measure gives 100% accuracy in recognition. Hence these methods are good compared to the polar transformations based method in terms of accuracy. If we consider the computations, the methods based on distance measures are better than the polar transformation based method. If we consider the time for both linear search tree and binary search tree the method based on city block distance measure is better than other methods. If we consider the nodes visited then also the method based on Polar Transformation is better than the other methods. If we consider the invariance as a parameter then the method based on Polar Transformation is better than the distance measure based methods. Similar conclusions are drawn for thinned data set.

7. Conclusion We have presented an invariant algorithm for recognition of different alphabets of different languages. The proposed method is compared with the method based on distance measure [16]. Based on the experimental results we have found that the method based on distance measure is better than polar transformation in terms of number of computations and accuracy. However, the method based on polar transformations is better than the distance measure based method in terms of invariance property.

Hence there is a scope for further development of new algorithm which gives better accuracy as well as invariant to image transformations.

References [1] Anil. K Jain et al, ‘Feature Extraction Methods for Character Recognition – A Survey’. Pattern Recognition, Vol.29, No.4, pp 641-662, 1996. [2] Cheng-Lin Liu et al., ‘Lexicon-Driven Segmentation and Recognition of Handwritten Character Strings for Japanese Address Reading’. IEEE transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.11, 2002. [3] Chew Lim Tan et al, ‘Imaged Document Text Retrieval Without OCR’. IEEE transaction on Pattern Analysis and Machine Intelligence, Vol.24, No.6, 2002. [4] Maher Ahmed and Rabab Ward, ‘A Rotation Invariant Rule-Based Thinning Algorithm for Character Recognition’. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.l2, December 2002. [5] Masayoshi Okamoto and Kazuhiko Yamamoto, ‘On-line handwriting character recognition using directionchange features that consider imaginary strokes’. Journal of Pattern Recognition Society, Vol.32, pp1115-1128, 1999. [6] Nagabhushan. P and Radhika.M.Pai, ‘Modified region decomposition method and optimal depth decision tree in the recognition of non-uniform sized characters – An experimentation with Kannada characters’. The Journal of Pattern Recognition Society, Vol.20, pp 1467-1475, 1999. [7] Pal. U and Chaudhuri. B.B, ‘Identification of different script lines from multi-script documents’. Image and Vision Computing, Vol.20, pp 945-954, 2002. [8] Pal. U and Chaudhuri. B.B, ‘Machine-printed and Hand-written text lines identification’. Pattern Recognition Letters, Vol.22, pp4 31-441, 2001. [9] Pal. U et al, ‘OCR Error Correction of an Inflectional Indian Language using Morphological Parsing’. Journal of Information Science and Engineering, Vol.16, pp903-922, 2000. [10] Pal. U et al, ‘Touching numeral segmentation using water reservoir concept’. Pattern Recognition Letters, Vol.24, pp 261-272, 2003. [11] Rammohan T.R and Chatterji B.N, ‘Recognition of Distorted Kannada Characters’. Institution of Electronics and Communication Engineers, Vol.30, No.6, pp 223-225, 1984 [12] Rejean Plamondon and Sargur N.Srihari, ‘On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey’. IEEE transactions on Pattern Analysis and Machine Intelligence, Vol.22, No.1, Jan-2000. [13] Yihong Xu and George Nagy, ‘Prototype Extraction and Adaptive OCR’. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No.12, 1989. [14] Rafael C. Gonzalez and Richard E. Woods, ‘Digital Image Processing’. Pearson Education,2003. [15] Hennig. A and Sherkat. N, ‘Cursive Script Recognition using Wildcards and Multiple Experts’. Pattern Analysis and Application, Vol.4, pp51-60, 2001. [16] Hemantha Kumar et al. ‘Feature Extraction for Alphanumeric Symbols Recognition: An Approach Based on Distance Measures’. Proceedings of Ist Indian International Conference on Artificial Intelligence (IICAI-03), Hyderabad, India, December 18-20 2003.