The OCR : A Neural Network Application - TENCON '96 ... - IEEE Xplore

9 downloads 0 Views 573KB Size Report
We proposed to apply the artificial neural networks (ANNsj together with some pre-processing and post-processing techniques to solve the Thai OCR problem.
1996 IEEE E N C O N - Digital Signal Processing Applications

Thai OCR :A Neural Network Application Chularat Tanprasert and Thaweesak Koanantakool National Electronics and Computer Technology Center (NECTEC) National Science and Technology Development Agency Ministry of Science, Technology and Environment Bangkok, THAILAND 10400 Thai characters to be more difficult. To be practical, Thai OCR should also be able to recognize the English language as well. Then, the total number of characters to be recognized for Thai and English OCR together would be 78 -t- 26 + 26 + 10 = 140 characters. Since this number is quite large, it is hard to obtain a good recognition engine for it. This problem is quite a challenging one. However, we emphasize on only printed Thai characters in this paper. Mixing English and Thai OCR is under researching and development process at NECTEC, Thailand at the present time.

ABSTRACT : Thai Optical Character Recognition (Thai OCR} is one of the most desirable computer application in Thailand at present. Many researches thus have been conducted on such topics. Though there are several proposed techniques for solving the problem, none seems to produce a satisfactory practical result. Many limited factors are encountered such as the inconsistency of scanning process, the incomplete andor noisy original documents, and the shift and position variance of recognition technique. We proposed to apply the artificial neural networks (ANNsj together with some pre-processing and post-processing techniques to solve the Thai OCR problem. The experimental result confirms that ANNs is a v e q suitable technique for developing the Thai OCR software. The recognition rate on the real document of training fonts is about 90% - 95%. This leads to a possible implementation in a production-quality OCR software that NECTEC software technology laboratory working on. The detail of all processes will be explained in the paper.

11. Thai OCR Processes In real application, Thai OCR software accepts the document image as the input and produced the text file as the output. So we are required not only to develop the recognition process, but also to develop the pre-processing and post-processing parts. The diagram of processes of Thai OCR is shown in Figure 3. The preprocessing step is used to adapt the input image into a normalized shape. It contains the noise cleaning process, line and character segmentation, alignment of page, thinning, and normalization. These techniques transform the input document image into a set of normalized isolated Thai characters which will be trained andor tested with ANNs. The ANNs will be trained to learn all 78 Thai characters and if it succeeds, we use its set of weights for testing process. After obtaining the result from ANNs process, we may apply the postprocessing technique to increase the recognition rate of text files by applying rules of combining Thai words and Thai dictionary. We will emphasize on the pre-processing stage in this paper. Pre-processing is one of the most significant process of the Thai OCR software. We apply several techniques for adapting and adjusting the raw input images to be suitable ones for ANNs. The input document is scanned by a scanner to be an input image in the “tif” format which will be transformed into bit map, i.e. (0,l) format. Then the noise cleaning technique is performed to erase some unwanted “on” pixels out from the image. If the input document is not passed uprighT through the scanner or some mistakes occur just before scanning which cause the image to be declined, the alignment process will be used for adjusting the direction of page. Later, the line and character segmentation

I. Introduction In the past recent years, Artificial Neural Networks (ANNs) have been applied on various kinds of pattern recognition. ANN’S mission is to try to simulate the human being behavior. In addition, reading is one of the most interesting activity that human can do by learning and the computer still cannot do well, especially in Thai language. Many reports of character recognition of some languages by ANNs such as English[l], Korean[2]. Arabic[3], and Chinese[4] have been published. However, only a few have been conducted for Thai OCR, for example, [SI and [6] which applied the statistical pattern recognition approach in order to find a solution for Thai OCR. The results were successful with some limitations. Thai is a sophisticated language in its visual representation. Thai alphabets are composed of curves, zigzag, and circles. Some of them look very similar to each other such as n fil s ) , v cu v c11 U M W d d d H, fl R fl BI. All 78 Thai characters are illustrated in Figure 1. Thai is written from left to right with lines filling the page from top to bottom. In each line, Thai characters can be composed in four levels, depending on the type of characters being written (see Figure 2). The multilevel composition causes the recognition process of 90

0-7803-3679-8/96/$5.00 0 1996 IEEE

algorithms are applied to the input image at this period to separate each character from one another. Therefore, we obtain a set of isolated Thai characters in different sizes. The detail of alignment and line and character segmentation algorithms are described in [7]. Since the ANN must receive input image in a fixed size, normalization process is required to change each isolated character to be in the same image size. There are two major issues that we should consider at this stage - the suitable fixed size of input matrix for ANN, and a method to transform from the original character to the normalized one. From our study of the characteristics of the printed Thai characters, we found that the matrix size 8 x 23 can fit any Thai character. However, as can be noticed from Figure 1, some Thai character are fat la a),thin (111 a), no tail (0 6u n), with tail (9 “II H). Therefore, in the size of 8 x 23 matrix, only some parts are using for each alphanumeric. Furthermore, 8x23 matrix is quite a large size for an input to ANNs. It will cause us to waste a lot of time and space computational resources. The original input images have been transformed to be in shape of 8x23 matrix. The horizontal histogram of all Thai characters in this experiment of input size 8x23 is created as illustrated in Figure 4. The Y axis represents the average number of “on” pixels of each input row for all 78 Thai characters in ten Thai fonts and eight different point sizes. The X axis represents the vertical level of input image which is maximum at 23. From Figure 4, the last 13 rows have only a few “on” pixels so it may be better to cut those rows out of the image matrix in order to save computational time and space complexities. We are interested in experimenting the input image in size 8x8, 8x12, and 8x16. In the experiments, we have transformed the original input images into those input sizes; then train and test them with ANNs. The second issue is how to map the original input to the fixed size input matrix. For a small input matrix, we may transform the input in two ways. Technique I is to transform the input to be full both width and height in the new size matrix. Technique I1 is to scale the input to fit the frame while still preserve the original character aspect ratio. Examples of these two techniques are shown in Figure 5. We performed experiments on both techniques. Neural networks that we used in our experiments is the multi-layered perceptron network combining with back-propagation learning algorithm. It has been applied in a few Thai pattern recognitions [8][9] and the performances have been promising. We prepare training patterns at first and then forward them to

be trained by ANN and if the learning converges, we save the weights and use it in testing. Finally, we apply the post-processing step to fix some ambiguous characters. This part will be emphasized in our future research of Thai OCR software.

111. Experimental Results The Thai OCR software at NECTEC software technology laboratory is developed under Microsoft Windows. We used the SNNS (Stuttgart Neural Networks Simulator) [IO] on pentium 90 MHz to simulate the neural network model. The experiments are started by creating the input documents for transforming them into training patterns. We have considered to train 78 Thai characters as shown in Figure 1 in ten different fonts and eight different sizes for each font. Example of Thai sentence of each font is illustrated in Figure 5. The input document is created in size from 8 points to 22 points and theirs shapes are normal, italics, bold, and italics&bold. The total characters in each font is 2,496 characters. The documents are scanned and saved as image files then they are passed through alignment process and line and character segmentation process. Isolated Thai characters are obtained at this stage and are normalized to fixed size inputs and then forwarded to be trained by SNNS . The multi-layered perceptron network with back-propagation learning algorithm is applied in our experiments. The number of input neurons depends on the input matrix’s size we use, the number of hidden neurons also are dependent on each network. The total number of output neurons is 78. We have experimented using the binary encoding for output neurons but the output showed that it gives worse result than representing one output neuron for one character. In the training process, we also applied the validation process to control the stopping point of the training step, We consider four sizes of input matrix. They were 8x23, 8x16, 8x12, and 8x8. We divided input into two groups with five fonts in each group, except for the input size 8x23 which we divided into three groups with three fonts in each group due to the memory limitation we have in our computer system. Ten Thai fonts in our experiments are AngsanaUPC (A), BrowalliaUPC (B), CordiaUPC (C), DilleniaUPC (D), EucrosiaUPC (E), FreesiaUPC (F), IrisUPC (I), JasmineUPC (J), SV Busaba (S), and TS Burrirum (T). The naming of each set of patterns is set by using the first alphabet of each font’s name. The recognition rate is shown in Table 1. The recognition rate of each size for training patterns are very closed which is around 99% and for 91

Neural Networks, pp. 3101-310(i, Perth, Australia, 1995. I.S.I. Abuhaiba and S.A. Mahmoud, “Recognition of, handwritten cursive arabic characters,” PA&MI Vol. 16, No. 6, pp. 664672, June 1994. D.S. Yeung, “A neural network recognition system for handwritten Chinese character using structure approach,” Proceeding of the World Congress on Computational Intelligence, Vol. 7, pp. 4353-4358, USA., June 1994. Chom Kimpan and Somsak Walairacht, “Thai Characters Recognition,” Proceedings of the Symposium on Natural Language Processing in Thailand, March 1993, pp. 196-276. Pipat Hiranvanichakorn and Monlada Boonsuwan, “Recognition of Thai Characters,” Proceedings of the Symposium on Natural Language Processing in Thailand, March 1993, pp. 123-166. Chularat Tanprasert, Wasin Sinthupinyo, and Premnath Dubey, “On the Printed Thai Optical Character Recognition Software Project,” 7th NECTEC (The National Electronics and Computer Technology Center) Annual Conference, Bangkok, Thailand, May 1996. (Paper is in Thai). Chularat Khunasaraphan and Chidchanok Lursinsap, “Simulated Light Sensitive Model for Thai Handwritten Alphabets Recognition,” Artificial Neural Networks in Engineering (ANNIE), U.S.A., November 1993. Thitipong and Chularat Tanprasert, “Variable Simulated Light Sensitive Model for Handwritten Thai Digit Recognition,” The Second Symposium on Natural Language Processing, Bangkok, Thailand, August, 1995. [ 101SNNS (Stuttgart Neural Network Simulator), University of Stuttgart, Institute for parallel and distributed high performance systems (IPVR), User Manual, Version 3.2, Report No. 3/94.

testing patterns, the recognition rate is almost the same. In this case, we conclude that we should use the input size 8x8 because we get the same level of correctly recognition rate with the smallest size of input matrix so we could reduce time and space complexities in the whole processes. Even we selected to use an 8x8 input matrix for the ANN, the ways to transform the original isolated character to the normalized one can be considered in two ways, technique I and technique I1 as described in the previous section. The recognition rates of these two techniques of input size 8x8 are also illustrated in Table 1. The recognition rates of both transforming techniques are closed to each other, only the recognition rates on testing patterns of technique I1 seem to give a little better performances. By examining the mean square error (MSE) of both techniques, we found that the MSE of technique I1 is lower both for training and validating sets. The different initial sets of weights may be the reason. We are required to do more research in this topic. In addition, we have applied both transforming techniques to the real test document. It contains every Thai characters and the total number of characters is about 1540. The performance on each tested font is shown in Table 2. Transformation technique I gives a little higher recognition rate for most of the ten fonts. At this point, we believe that both techniques give almost the same level of recognized performances. Some more research’s results are required in order to decide which transforming technique to permanently apply in this software.

IV. Conclusion The experiments have illustrated that the artificial neural network concept can be applied successfully to solve the Thai optical character recognition’s problem. There are many variations of factors that effect the performance of the developed Thai OCR software. We can conclude at this moment that the input matrix in size 8x8 gives a better result than others cases. The recognition rate of the software with the real Thai documents is quite high as shown in the experimental part. However, other kinds of preprocessing and neural network models may be tested for a better recognition rate in the future research in our current Thai OCR software. V. References

A. Rajawelu, M.T. Husavi, and M.V. Shirvakar, “A neural network approach to character recognition,” IEEE Transactions on Neural Networks, Vol. 2, pp. 307-393, 1989. Kee Chul Jung, Sang Kyoon Kim and Hang Joon Kim, “Recognition-based Segmentation of On-line Cursive Korean Characters,” Proceeding of the International Conference on 92

Consonants

nbu CMW B J U

Tone Marks

Vowels

1

9

J

m

*

" I Special Symbols d

d

7

0

Figure 1. Set of Thai alphabets.

level 1

I

level 2

level 3 level 4 Figure 2. 4-level in Thai sentence.

U

U Postprocessing

Preprocessing

+

+ +

+

Page Alignment Segmentation Normalization Thining

Thai Dictionary

~~

Figure 3. Shows Thai OCR processes

93

Original

Normalized character with Technique I

Normalized character with Technique I1

Figure 5. Shows the outputs from the two transformation techniques.

v1

1

Row of input matrix

Figure 4. Illustrates the histogram of average "on" pixels of each row of input matrix size 8 x 23

AngsanaUPC BrowalliaUPC CordiaUPC DilleniaUPC EucrosiaUPC FreesiaUPC IrisUPC JasmineUPC SB Busaba

TS Burrirum

Figure 5. Shows the example of Thai sentence in 10 Thai fonts that have been used in the experiments

94

Table 1. Shows the recognition rate of ANNs that use different input sizes. Input Size

I

Training Set

MSE(T)

Validating Set

IJBST

0.01838

ACDEF

ACDEF

0.01684

IJBST

MSE(V)

% of Rccopnition

Testing ## Recognired Set ‘ Patterns

0.32194 ‘ACDEF IJBST

10030 12372

80.36 99.13

~~~

8x16 I

I

ACDEF

0.46599

I

I

12414 I

I

99.47 I

Table 2. Illustrates the recognition rate of using different two transforming techniques. Fonts

%- of Recognition Rate o f Technique I

ACDEF’s weight set

IJBST’s weight sct

[L- of Recognition Rate of Technique TI ACDEF’s ueight sct

95

IJBST’F wcight set

,

Suggest Documents