Nepali Optical Character Recognition

Project report on

Nepali Optical Character Recognition

NIRAJAN PANT Department of Computer Science and Engineering Kathmandu University Dhulikhel, Nepal

A report submitted for the fulfillment of 6 credit course Masters of Technology in Information Technology

Supervisor

Asst. Prof. Dr. Bal Krishna Bal Department of Computer Science and Engineering Kathmandu University

December 2015

Abstract This work is an extension of my 3 Credit Project work. The study of literature of Devanagari OCR and the challenges in implementation of it was done in previous work. In this paper, I have presented a unique approach to Nepali Optical Character Recognition (NOcr). NOcr uses OCR approaches specific for Nepali Language recognition. Performing segmentation and recognition on Nepali Text Image documents is relatively more difficult than Latin Text its cursive nature and features like modifiers and header line. To improve the performance of Nepali Character recognition and its accuracy, a hybrid approach (combination of holistic approach and dissection approach guided by recognition) is implemented in NOcr. NOcr is trained to recognize whole words and basic characters as well as compound characters. A 2-phase recognition model is implemented. In phase-1, text image document is segmented into words and recognition is performed. In phase-2 character level segmentation is performed over poorly recognized words. And then character level recognition is performed. In this work I have successfully implemented both recognition models. Both the models are successfully implemented and trained in this work. The error in character level segmentation was found to be 11.61%. The two factors that affected the character level segmentation are broken characters (which results over-segmentation) and Shadow Characters (which are the results of under-segmentation). The accuracy of word recognition module was found to be 81.44%.

II

Contents Chapter 1 Proposed Framework.................................................................................................................... 1 1.1

Overview ....................................................................................................................................... 1

1.2

Training: ........................................................................................................................................ 2

1.2.1

Data Set Generation: ............................................................................................................ 3

1.2.2

Feature Extraction: ............................................................................................................... 6

1.3

Recognition: .................................................................................................................................. 8

1.3.1

Line and Word Segmentation .............................................................................................. 8

1.3.2

Character Segmentation: ................................................................................................... 13

Chapter 2 Hypothesis .................................................................................................................................. 16 Chapter 3 Test and Result ........................................................................................................................... 17 1.4

Segmentation: ............................................................................................................................ 17

1.5

Recognition:................................................................................................................................ 18

1.3.3

Word Level Recognition: .................................................................................................... 18

1.3.4

Character Level Recognition: ............................................................................................. 19

Chapter 4 Future Work ............................................................................................................................... 20 Chapter 6 Conclusion.................................................................................................................................. 21 Bibliography ................................................................................................................................................ 22

I

List of Figures Figure 1: Proposed Framework of Nepali OCR ............................................................................................. 1 Figure 2 Steps of DataSet Generation........................................................................................................... 5 Figure 3 Feature Extraction........................................................................................................................... 7 Figure 4 Nepali text words as Blobs .............................................................................................................. 8 Figure 5 Blob Detection – Words in improper order .................................................................................. 10 Figure 6 Blob Detection - Words arranged in proper order ....................................................................... 10 Figure 7 Character Segmentation ............................................................................................................... 13

II

List of Tables Table 1: Character Segmentation Results ................................................................................................... 18 Table 2: Word Recognition Results ............................................................................................................. 18

III

Chapter 1 Proposed Framework 1.1

Overview

The research works on Optical Character Recognition suggests that the segmentation process cannot achieve full accuracy because of noise, touching characters, compound characters, variation in typefaces, and many similar looking characters. Because of presence of language-specific constructs, in the domain of non-Latin scripts, such as “dika” (Devanagari), modifiers (south-east Asian scripts), writing order, or irregular word spacing (Arabic and Chinese) requires different approaches to segmentation (Agrawal, Ma, & Doermann, 2010). Devanagari Script also possess its own constructs which totally differ from Latin one. The most practiced character dissection method for Devanagari works by removing the header line (dika) and separating the lower modifiers and upper modifiers. Which makes easy to extract the basic characters

Nepali Text Files

Training Feature Set

Feature Extractor

Generated Tokens

Text

Image Data Set Generator

Calculate Classification Confidence

Post Processor Conf > Th

Word Segmentation

Conf < Th

Segment word into Character / Ligature

Compound Character Separator

Image DataSet

WordList Generator

Recognizer

Tokenizer

Input Image

but increases the complexity of extracting modifiers. The modifiers get broken and it is difficult to note

*Conf – Confidence **Th – Threshold Value

Figure 1: Proposed Framework of Nepali OCR

their position in a sequence of segmented characters and restore their original shape. To minimize the overhead of character level segmentation and minimize the errors due to inaccurate dissection I have 1

proposed hybrid approach which combines Holistic Method and Dissection Method. The holistic methods follow a two-step scheme – 1) The first step performs feature extraction and 2) The second step performs global recognition by comparing the representation of the unknown word with those of the references stored in the lexicon (Chaudhuri & Pal, 1997). Kompalli et al (Kompalli, Setlur, & Govindaraju, Devanagari OCR using a recognition driven segmentation framework and stochastic language models, 2009) have also proposed a novel graph-based recognition driven segmentation methodology for Devanagari script OCR using hypothesize and test paradigm which is promising work and an inspiring work for using a hybrid approach to OCR. Harit and Bag have also highlighted the need of new approaches because the problems are unique to Devanagari and Bangla, and hence the solutions adopted by the OCR systems for other scripts cannot be directly adapted to these scripts (Bag & Harit, 2013). The proposed framework uses two phase recognition: Phase 1: Segment input text image into words and recognize words using Holistic Approach. Measure the confidence of classification. If the confidence is lower than threshold then we go for Phase 2 recognition. Phase 2: Words that are poorly classified in Phase 1 are segmented into characters using projection profile: Segmentation results are characters and compound characters (conjuncts, shadow characters, consonantconsonant-vowel combinations, consonant vowel combinations, and characters including diacritics). These characters are then classified. Upon checking the classification confidence of these characters the process of further dissection of some segments or merging of two or more segments may be applied. Here the dissection is somewhat guided by recognition. A general framework of DOCR is given in figure 1. The general framework of my approach consists of two main parts – training and recognition:

1.2

Training:

The main goal of the training phase is to prepare the OCR application to be used for text recognition for a Nepali Language. Training takes raw Nepali text data as input and outputs a dataset of words and a dataset of ligatures (compound characters), where each data is described by feature vector. Training phase consists of two main steps: 1. Generation of a dataset of images for the possible words and ligatures (Compound Characters) of the Nepali language to be used by the application. 2. Extracting features that describe each word and ligature in the dataset generated by the previous step.

2

1.2.1

Data Set Generation:

Data is a primary need for the machine learning projects. The algorithms involves training using managed and well organized datasets. Thus to create a suitable training dataset is important and necessary part of the work. In this work Nepali word images, character images, and compound character images dataset is required. This data set will be used for the training purposes. And the second type of data required is Text Image DataSet, which will be used for testing purposes of the proposed OCR system. According to my information until this date there is no such DataSet maintained for Nepali Language. This is the main constraint for this research. Thus before the development of the System, it is necessary to manage the Training DataSet and Test DataSet. There may be various ways of generating training DataSet and Test DataSet for OCR system. One of the approach is to create DataSet by analyzing the Text Images that means cropping the individual words from the images and create a database of word image and respective text and similarly for basic character and compound character level DataSet. This method is good for training variety of Data. Since we can collect data from paper books, paper newspapers, pamphlets, and of different fonts. But this task is tedious and very much time consuming and this is not in the scope of this research work. Second approach is to use automated computer program to generate the necessary Training DataSet and Test DataSet. This system uses text corpus of target language and involves analysis of textual data to generate the dictionary of words, basic characters, and compound characters. Which will be later used for rendering images representing corresponding text. This method is fast and takes less effort. Thus I have used this approach for generating required DataSet. Various steps involved in Data Set generation are: 1.2.1.1 Text Data Collection: In this project, I am using Text Corpus Collected by Madan Puraskar Pustakalaya (MPP) under Bhasha Sanchar Project1. Which includes different types of articles from websites and books such as BBC News Articles, Biography, Academic Magazines, Business Magazines, Criticism, Drama, Autobiography, fiction, editorial, interview, law, memoir, poetry, prose, translations, opinions, Science, polities, sociology, spiritual, sport, relation, Variety of news articles form Gorkhapatra, and Articles from Magazine Yubamancha. Along with the text corpus from MPP I have also collected text data from different news portals (e.g. www.ekantipur.kantipur.com, www.annapurnapost.com, www.nagariknews.com, and

1

This corpus has been constructed by the Nelralec / Bhasha Sanchar Project, undertaken by a consortium of the Open University, Madan Puraskar Pustakalaya (मदन पुरस्कार पुस्तकालय), Lancaster University, the University of Göteborg, ELRA (the European Language Resources Association) and Tribhuvan University.

3

www.himalkhabar.com), online Nepali literature websites (e.g. samakalinsahitya.com, manjari.com). Which makes more than 200 MB of textual data (about 2,500 articles). Reason behind collecting textual data from various resources is to generate the list of all possible words used in Nepali language. It is obvious that some words that are used in one area are not used in other areas. For example – political words are not spoken or written in science articles. Thus it is necessary to analyze; to get the complete list of words and compound characters in addition populating list of frequent words. A dictionary of words and compound characters is maintained. The dictionary maintained will be later used for the training of the Classifier and for Post-processing. 1.2.1.2 Filtering Text data: The text data collected is not always found in the required form. It may contain different noises. Thus it is necessary to make data clean and noise free. Filtering of text data involves: -

Removing of non-Devanagari Characters and

-

Correcting and removing of typos (e.g. misplacement of modifiers)

Since scope of this research work is Devanagari (Nepali) Language, non-Devanagari characters (except special characters e.g. punctuation marks) found are considered as noise and they must be removed from text. Secondly, there are many errors regarding typing (typos) are found in these articles. These typos are also noises and must be corrected or eliminated. For this work I have written a TextDataFilter program in C# which takes raw text as input and gives noise free text data.

4

Database

Degradation Methods

Fonts

Apply Degradation

img_1.Jpeg img_1.Jpeg img_1.Jpeg img_2.Jpeg img_2.Jpeg img_2.Jpeg img_3.Jpeg img_3.Jpeg img_3.Jpeg img_49.Jpeg img_49.Jpeg img_49.Jpeg … .... .... ……… … … .. ..

लाई लाई लाई लाई लाई लाई लाई लाई लाई का का का

… … …

Extract Words

Render Images

Nepali Language Text Documents

Apply Degradation

Extract Compound Characters/ Ligatures

Database img_240.Jpeg यय img_240.Jpeg img_240.Jpeg य img_241.Jpeg ला ला img_241.Jpeg img_241.Jpeg ला img_242.Jpeg ला ला img_242.Jpeg img_242.Jpeg ला …........ … … … … .. .. …

Render Images

Figure 2 Steps of DataSet Generation

1.2.1.3 Create Distinct Words List: The text data from previous step is feed to WordList Generator, a program written in C#. This program searches for the words and maintains the dictionary in the form of tuples. This wordfrequency list is further processed to separate most frequent word and at the same time creates a list of all possible words. The number of words extracted for Nepali if over 150,000 having different length. 1.1.1.4 Create Basic and Compound Characters List: Filtered text corpus is feed to CompoundCharacter Separator program, which is also a part of WordList generator. It separates basic characters and compound characters following the Character Combination rules of Devanagari. These rules include: -

Consonant-vowel combination

-

Consonant-consonant combination

After tokenizing the text corpus into basic characters and compounds a list of is determined. From this list we can categorize the characters and compound characters into frequent occurring characters and least occurring characters. And a list of all characters is also populated. The lists populated will be used for generating Character Level Training DataSet. The frequent character list may be used for giving more emphasis to such character which will be convincing for increasing the 5

accuracy of recognition. The number of basic characters and compound characters extracted for Nepali is over 7,000 having different word length. 1.2.1.4 DataSet Generation: This step involves the process of Image DataSet generation. In order to generate a DataSet of words and Compound Characters (including basic characters), the words lists and compound characters list (populated in previous steps) is feed to the DataSet Generator Program written in C#. The following steps are carried out: -

Images for each extracted word and character are rendered using a rendering engine. This involves rendering the text using 5 different Devanagari Unicode Fonts namely Mangal, Arial Unicode MS, Samanata, Kokila, Adobe Devanagari. Other fonts may also be used to increase the diversity of data.

-

In order for the system to be able to handle input images with noise, the system is trained on degraded images as well as clean images. The degraded images are generated by applying four different image processing operations to the images rendered in previous step. The degradation models I have used are – threshold, blur, erode, and mean filter. The number of degraded images for each word or character depends on how frequent the word or character is. More frequent the word, more degraded images will be generated for that word or character.

-

The rendered image data is then classified into 6 different classes based on the ratio of image width and height. There is no concrete approach or ratio to differentiate the classes. It is based on experience and I have set the ratio of width and height in order of 𝑥 ≤ 1, 1 < 𝑥 ≤ 2, 2 < 𝑥 ≤ 4, 4 < 𝑥 ≤ 6, 6 < 𝑥 ≤ 8, 𝑎𝑛𝑑 𝑥 ≥ 8; where 𝑥 is the ratio of width and height of image being categorized. By analyzing the images and the threshold ratios image data is arranged in the form of format in a database. Six different databases are created at the end of this step. Label is the text value of corresponding image. This is later used to train multiple Classifiers.

1.2.2

Feature Extraction:

The second main step of the training phase is to extract a feature vector representing each word and compound character included in dataset. For this, the following steps are done: 1. Normalize each image to a fixed width and height

6

2.

Compute the Histogram of Oriented Gradients (HOG) descriptor Data Set Sample

HOG

HOG Descriptor

Image Normalization

Figure 3 Feature Extraction

1.2.2.1 Normalization: The whole training data set is classified into 6 different classes in data set generation step based on the ratio of data set sample width and height. Because the data set contains word data as well as character and compound character image data the size varies in big amount thus all the image data cannot be normalized to any standard size. The image data in Class 1 are much smaller in size than the image data in Class 6. To overcome this problem the normalization is performed on classes. Different normalization sizes are fixed for each class. The normalization sizes I have fixed are: , , , , , and . 1.2.2.2 Compute HOG Descriptor Navneet Dalal and Bill Triggs, researchers for the French National Institute for Research in Computer Science and Automation (INRIA), first described HOG descriptors at the 2005 Conference on Computer Vision and Pattern Recognition (CVPR)2. HOG is a feature descriptor used in the field of image processing and computer vision which counts the occurrences of gradient orientation in localized portions of an image. The essential thought behind the histogram of oriented gradients descriptor is that local object appearance and shape within an image can be described by the distribution of intensity gradients or edge directions. The HOG descriptor has a few key advantages over other descriptors. Since it operates on local cells, it is invariant to geometric and photometric transformations, except for object orientation3.

2

Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 1, pp. 886-893). IEEE. 3 https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients

7

To extract the HOG features from dataset I have used hog routine implemented in skimage.feature. The routine allows to manage orientations, pixels per cell, and cells per block. The HOG descriptors are extracted by the following setup: fd = hog(image, orientations=8, pixels_per_cell=(8, 8),cells_per_block=(3, 3), visualise=False, normalise=False)

The process of feature extraction is shown in figure 3.

1.3

Recognition:

The recognition part takes as input an image which is specified by the user through the user interface. Its main task is to recognize any text that occurs in the input image. The recognized text is presented as an output to the user in an editable format. The recognition of the text in an input image is done using the following steps: 1. Segment page image into lines and words. 2. Describe each unknown word image using HOG descriptor. 3. Classify each unknown word using Support Vector Machine. 4. Calculate the confidence of Classification 5. If classification confidence is lower than acceptable a. Segment words into characters/ligatures b. Go to step 2 6. Recognized words/ligatures from the classification step form the output editable text.

1.3.1

Line and Word Segmentation

In literature of Devanagari OCR, most of the researchers have used Horizontal Projection Profile and Vertical Projection Profile method for detecting lines and for dissecting words. This approach involves binarization of text image and then counting the score of black pixels in the whole image. By analyzing the horizontal score of black pixels the lines are predicted and by analyzing the vertical score of black pixels inside each predicted lines, words are predicted. While predicting the lines the discontinuity of score is used. For example – between two words there is a space thus the score of black pixels will be zero at that place. That zero score is responsible for the

Figure 4 Nepali text words as Blobs

discontinuity of scores. This approach is easy to implement 8

and use but with noisy images it does not perform well. The segmentation may result false lines and words. We need to filter such false lines and words. In this research work instead of Projection Profile I am using Blob Detection based approach for word segmentation. Blobs are bright on dark or dark on bright regions in an image4 5. A question may arise, why Blob Detection? Is it beneficial? Yes! Of course! In Devanagari Script each word is a bunch of characters and these characters are tied with each other by header line (dika). This property of Devanagari Script makes it easy to use blob detection for detecting individual words in a text document. In blob detection each bright dot in the image is taken as star or galaxy. Which means while working with Nepali Text Images each star or galaxy is one word. Figure 4 shows the Nepali words and each word as a separate bright region in the black background. Word segmentation using Blog Detection involves various steps. Algorithm: Word Segmentation Step 1:

Preprocessing Blurring, Binarization (Grayscale and Thresholding), and Inverting Image

Step 2:

Blob Detection

Step 3:

Sort blobs ... first OrderBy x and then ThenBy y

Step 4:

Get average of blob height

Step 5:

Check overlapping blobs

Step 6:

Reordering of blobs by Rectangle.Y

Step 7:

Calculate sequential difference of y values e.g. (a,b,c,d) => sequential difference : (a-b,b-c,c-d)

Step 8:

Calculate discontinuity of sequence .. using sequential difference

Step 9:

Split lists from point of discontiinuity … each list contains words in a line

Step 10:

Sort blobs in each list ... first OrderBy ‘x’ and then ThenBy ‘y’ … rectangles representing words in that line are in proper order

4 5

http://scikit-image.org/docs/dev/auto_examples/plot_blob.html Blob Detection, scikit-image, Accessed: 12/19/2015 http://www.aforgenet.com/framework/features/blobs_processing.html Blobs Processing, Afoge.net, Accessed: 12/19/2015

9

Figure 5 Blob Detection – Words in improper order

Figure 6 Blob Detection - Words arranged in proper order

Step 1: Preprocessing: This steps involves removing the noise from text image and preparing it for successful word segmentation. The preprocessing operations performed are blurring, binarization, and invert. Blurring is used to reduce image noise and reduce detail. By applying blurring each word structure get enhanced that is it makes the coupling of character more tight which resembles the much sharper black dot. This is also useful when modifiers or characters seems to be separated from the bunch. Mainly, such errors occurs with ‘Ukar’ and characters like श, भ where header line get discounted. The process of converting an image from color or greyscale to binary image is known as binarization. Binarization is important because all the operations involved in segmentation are going to be performed on bi-tonal images. Since the blob detection algorithm I am using detects bright regions on black background the image need to invert thus words can be detected as blobs. Step 2: Blob Detection: For blob detection I have used the blobs processing routines implemented in the Aforge.net framework which allows to count blobs, filter them, extract them, get their dimension, and order them by size etc. The sample source code for detecting blobs is given below: 1. BlobCounter blobCounter = new BlobCounter(); 2. blobCounter.ProcessImage((Bitmap)inputPicBox.Image);

3.

Blob[] blobs = blobCounter.GetObjectsInformation();

The output of blob detection in Nepali Text Image is shown in Figure 5. Each word is detected as a single blob and these blobs are not in a proper sequence. Proper sequence means the detected blobs must be in the order of words they appear in a text image. Thus it is necessary to populate a proper sequence of blobs and make clusters of blobs representing lines.

10

Step 3: Sort blobs ... first OrderBy x and then ThenBy y: Each blob is a rectangle surrounding the word. The rectangles are first Ordered by x and then ThenBy y so that left and right neighbor rectangles come together in a sequence. This step is helpful for detecting the overlapping blobs. Step 4: Get Average of word height: In this research work it is assumed that text document contains text with uniform font size. The average of height of blobs is calculated to predict the average height of line. Step 5: Check Overlapping Blobs: This step is a filtering process of false blobs. Sometimes blobs inside blobs may be detected, such blobs are false blobs or words and must be eliminated or corrected. Step 6: Reordering of Blobs by Rectangle.Y: After correcting the false blobs, the blobs are reordered by ‘y’ value of rectangles. Thus blobs representing the words in a first line comes earlier in the list of blobs and next text the blobs representing the word in the second line and so on. Step 7: Calculate sequential difference of y values: The sequential difference of a sequence is calculated as if (𝑎, 𝑏, 𝑐, 𝑑) is a sequence, the sequential difference is a sequence (𝑎 − 𝑏, 𝑏 − 𝑐, 𝑐 − 𝑑). The sequential difference of ‘y’ values of blobs is calculated and stored in a list. Step 8: Calculate Discontinuity of Sequence: The discontinuity of a sequence of blobs is calculated using sequential difference. The discontinuity of blobs separates the lines. The point of discontinuity is where 𝑠𝑒𝑞𝐷𝑖𝑓𝑓[𝑗] > 𝑎𝑣𝑔𝐻𝑒𝑖𝑔ℎ𝑡 ∗ 90/100. Step 9: Split List from Point of Discontinuity: The list of blobs is sliced at the point of discontinuity. Each slice of list contains words in a line. Step 10: Sort blobs in Each List from Step 9: The lists from step 9 separates the words by lines but the words in a line are not in proper order. Thus reordering of these lists is performed. The process involves first OrderBy ‘x’ and then ThenBy ‘y’ resulting the list of blobs representing words in that line in a proper order. The final output of the algorithm is shown in Figure 6. The C# code for performing word segmentation using blob detection is given below: 1.

List blobList = blobs.OfType().ToList(); // convert blobs array to list of blobs

2. 3.

// Sort blobs ... first OrderBy x and then ThenBy y

4.

List sortBlobs = blobList.OrderBy(b => b.Rectangle.Y).ToList();

5. 6.

// get average of word height

11

7.

double avgHeight = 0.0;

8.

double aggr = 0.0;

9.

foreach (var blb in sortBlobs)

10.

{

11.

aggr += blb.Rectangle.Height;

12.

}

13. 14.

avgHeight = aggr / sortBlobs.Count;

15. 16.

// check overlapping blobs

17.

CheckOverlappingBlobs(ref sortBlobs, avgHeight);

18. 19.

// resort blobs

20.

List sortedBlobs = sortBlobs.OrderBy(b => b.Rectangle.Y).ToList();

21. 22.

// get sequential difference of y values (a,b,c,d) => sequential difference : (a-b,b-c,c-d)

23.

List seqDiff = new List();

24. 25.

for (int i = 0; i < (sortedBlobs.Count - 1); i++)

26.

{

27.

seqDiff.Add(sortedBlobs[i + 1].Rectangle.Y sortedBlobs[i].Rectangle.Y);

28.

}

29. 30.

// find discontuinity of sequence .. using sequential difference

31.

List discontuinity = new List();

32.

discontuinity.Add(0);

33.

for (int j = 0; j < seqDiff.Count; j++)// seqDiff.Count -1

34.

{

35.

if (seqDiff[j] > avgHeight*90/100)

36.

{

37.

discontuinity.Add(j);//Add(j + 1)

38.

}

39.

}

40.

//discontuinity.Add(seqDiff.Count - 1);

12

41. 42.

discontuinity.Add(sortedBlobs.Count - 1);

43. 44.

// separate lists

45.

List sor = new List();

46. 47.

for (int i = 0; i < (discontuinity.Count - 1); i++)

48.

{

49.

List myRange = new List();

50.

if(i == 0)

51.

myRange = sortedBlobs.GetRange(discontuinity[i], discontuinity[i + 1] - discontuinity[i] + 1);

52.

else

53.

myRange = sortedBlobs.GetRange(discontuinity[i] + 1, discontuinity[i + 1] - discontuinity[i]);

54. 55.

// Sort blobs ... first OrderBy x and then ThenBy y

56.

List soBlobs = myRange.OrderBy(b => b.Rectangle.X).ToList();

57. 58.

sor.AddRange(soBlobs);// add list content to main list each list is a ordered list of words in a line

59.

1.3.2

}

Character Segmentation:

Most of the OCR system are per-character OCR. They deal with individual characters not with words or compound characters. In written Nepali every character in a word are tied together by headerline (dika) and it is not easy to separate the characters. Similarly, the character segmentation to basic constituent becomes more challenging due to its properties – use of modifiers and diacritics, compound characters and ligatures, and typos. By studying the structure of Devanagari script and use of compound characters I found that it would be better to use compound characters and ligatures as a single characters. The method is

Figure 7 Character Segmentation

13

inspired form the work by Nazly Sabbour and Faisal Shafait6, the ligature based approach to implement segmentation free Arabic and Urdu OCR (Sabbour & Shafait, 2013). On analyzing the Nepali text corpus I found about 7,000 compound characters (basic characters, conjuncts, ligatures) are used in Nepali. Though this method also possesses limitation when multiple characters are connected at places other than headerline or broken into multiple pieces due to image artifacts, I believe this method will improve the recognition. The result of character segmentation is shown in figure 7. The algorithm for character segmentation is given below: Algorithm: Character Segmentation Step 1:

Input: list of blobs

Step 2:

Apply Horizontal projection on Word Rectangle Part Hpp(word) = {r1, r2, r3, … , rn}, where r1, r2, …, rn are the score of black pixels in corresponding rows

Step 3:

Find header line in a word; HLl(word, x, y), HLh(word) HLl(word, x, y) is the location of headerline in a word, where x the location of upper row that lies in headerline and y is the location of lower row. HLh(word) = y – x, is height of headerline of word

Step 4:

Apply Veritical projection on Word Rectangle Part Vpp(word) = {v1, v2, v3, … , vn}, where v1, v2, …, vn are the score of black pixels in corresponding columns

Step 5:

Remove Header Line Vpp(word)hr = Vpp(word) – HLh(word) = {v1 - HLh(word), v2 - HLh(word) , … , vn HLh(word) }

Step 6:

Detect cut points CP(word) = ; Cut points are valleys i.e. space between two characters

Step 7:

Perform segmentation

This module takes the blobs (rectangles enclosing the word) as input. In the earlier step Horizontal Projection is applied on the word. Hpp(word) = {r1, r2, r3, …, rn} is the result of Horizontal Projection which contains score of black pixels in each row. The analysis on Hpp(word) is performed to detect the header line of a word and to calculate its height HLh(word). Those rows that have Horizontal Projection score equal to max score or near about max score and are neighbor to each other are the part of headerline. The analysis is performed on the upper half part of the word. The location of headerline HLl(word, x, y), is the position of headerline in a word, where x the location of upper row that lies in headerline and y is the location of 6

Sabbour, N., & Shafait, F. (2013, February). A segmentation-free approach to Arabic and Urdu OCR. In IS&T/SPIE Electronic Imaging (pp. 86580N86580N). International Society for Optics and Photonics.

14

lower row. The height of word is given by HLh(word) = y – x. Then vertical projection is applied on the blob. Vpp(word) = {v1, v2, v3, … , vn}, is list of scores of black pixels in each column of blob, where v1, v2, ….. vn, are scores of black pixels in respective column. The method that have been practiced so far to isolate the individual character in a word of Devanagari Script is to remove the headerline. I have also used the same method. And this method work fine for isolating compound characters. The headerline is not removed in actual but HLh(word) is subtracted from each element of Hpp(word), the result is Vpp(word)hr = {v1 - HLh(word), v2 HLh(word) , … , vn - HLh(word) }. On subtraction some element may result less that zero, in such case make that element zero because no score can be less that zero. Next task is to find the cut points, CP(word) = , where cp1, cp2, …, cpn are cutpoints, by analyzing the Vpp(word)hr , cut points are the points in a word from where we can chop a word to isolate the characters. And normally these are the points where element of Vpp(word)hr is equal to zero that means the space between two characters. Finally the cutpoints are noted and the segmentation is performed, the result of segmentation is given by CS(TextImage) = {, , …, }.

15

Chapter 2 Hypothesis The changes in the general architecture of OCR and addition of Training Module, and proposed two phase recognition method minimizes the character segmentation overhead & errors and improves the accuracy and performance of Nepali Optical Character Recognition.

16

Chapter 3 Test and Result I this section I have presented an experimental study and testing of proposed framework. To test the system various documents were generated and collected. The experimental setup and results of testing of different modules of system namely, word segmentation, character segmentation, word level recognition, and compound character level recognition are presented here.

1.4

Segmentation:

It is found that line segmentation is accurate as long as there is a specified amount of spacing between lines, it is almost 100%. The accuracy of line segmentation is somewhat reduced by the lower modifiers (Ukar, Ookar, Rrikar, and Halant) if they are separated from the parent character and by punctuation marks like comma and dot. So some extra processing is required to correct such errors. There is same condition for word segmentation, lower modifiers were detected as separate word if they are separated from the main body of word. Some of the Devanagari fonts render lower modifiers little separated from core characters. To eliminate such errors blurring as a pre-processing step is adopted. But still the accuracy is reduced marginally. When there are numerals present in a document they are taken as separate words due to spacing between them. To eliminate such errors, a significance amount of spacing between detected words is considered false and the words are merged. In this paper I have assumed that the segmentation using Projection Profile gives compound characters including basic characters and some of the modifiers. And it is better to use them for recognition. While studying a structure of Nepali language I found that there are about 7,000 compound character used in Nepali but most of them are used infrequently. And if we have good training DataSet it will not be bad idea to train a good classifier for classification. Thus reducing the burden of segmenting compound characters into basic characters and modifiers and reorganizing them into a proper order. And storing the information about their position and parent character. Literature shows that one of the main challenges in Devanagari character recognition is the segmentation of Conjuncts and Compound Characters. The accuracy of character segmentation is reduced a little of upper modifier Chanrabindu, and Lower modifiers Ukar and Ookar. Occurrence of these modifiers may result Shadow Characters which lowers the segmentation accuracy marginally.

17

The character segmentation results for 10 documents is presented here. Table 1: Character Segmentation Results Document

1

2

3

4

5

6

7

8

9

10

Characters Present

212

118

370

353

324

166

273

289

89

35

Characters Oversegmented Characters undersegmented Error

5

7

12

11

3

2

3

4

3

1

14

7

23

9

6

9

18

19

8

14

8.96%

11.86%

9.45%

5.66%

2.77%

6.62%

7.69%

7.95%

12.35%

42.8%

Note: The characters in Table 1 is a set of basic characters and Compound Character is taken as single character. The result of test performed on 10 documents is presented in Table 1. From above test it is clear that most of the errors are due to under-segmentation. The errors due to over-segmentation less than half compared to errors due to under-segmentation. The average error rate of character segmentation is found to be 11.61%.

1.5

Recognition:

1.3.3

Word Level Recognition:

Experimental Setup: Training a whole set of words in not fruitful while testing the system. Thus to make the testing process simpler I have collected a list of 519 frequent words from the analysis of Nepali Text corpus. The training DataSet is generated using the method explained in DataSet Generation Secio from this word list and the classifier is trained to recognize these words. To test the word level recognition module I generated Text Image documents using the words from 519 frequent words list. The text was rendered with varying fonts. Result: The result of recognition of 10 documents is presented in Table 2: Table 2: Word Recognition Results Document

1

2

3

4

5

6

7

8

9

10

Words Present

25

41

50

38

23

42

19

37

39

47

Correctly Recognized Words Accuracy

19

38

40

33

16

38

14

33

31

36

76%

92.68%

80%

86.84%

69.56%

90.47%

73.68%

89.18%

79.48%

76.59%

From above table we can see the word level recognition accuracy ranges from 70% to 90%. On averaging the accuracy, the accuracy of module is 81.44%. 18

1.3.4

Character Level Recognition:

Experimental Setup: Training a whole set of characters and compound characters in not feasible while testing the system for research purposes. Thus to make the testing process simpler I have extracted compound characters and basic characters from the same list of 519 frequent words. The training dataset is generated from these extracted characters and the classifier is trained to recognize this character set. To test the character level recognition module I have used the same Text Images used in the Word level recognition module. But the method of segmentation is changed to character level and classifiers is setup to recognize character rather whole words. Results: The module has been tested for the recognition of 10 documents. And the results are not promising and it suggests that the character recognition module need calibration and adjustments of parameter. From above experiment and results it is found that the proposed system might be promising. From above analysis and results it is found that the word level segmentation is nearby to 100%. The error in character level segmentation was found to be 11.61%. The two factors that affected the character level segmentation are broken character (which results over-segmentation) and Shadow Characters (which are the results of under-segmentation). The accuracy of word recognition module was found to be 81.44%. The recognition rates are promising for continuing the further work of this framework.

19

Chapter 4 Future Work The work presented in this report is a part of ongoing research. Thus it does not include all the necessary parts of the research. Though I have completed more than half of the work but there are many parts of the project that are incomplete. Those parts will be completed in the near future. The future work includes: -

Training of character recognition module and improving its accuracy

-

Calculating confidence of classification of an unknown word

-

Integration word recognition module and character/ligature recognition module to give the complete shape to the proposed model

-

Testing and analysis of the performance and accuracy of the proposed OCR system

20

Chapter 6 Conclusion A hybrid approach, a combination of holistic approach and dissection technique for recognition of Nepali text images is proposed in this paper. The proposed framework follows 2-phase recognition. In first phase word level recognition is performed and in second phase character/compound character level recognition is performed for poorly recognized words. Both the models are successfully implemented and trained in this work. The error in character level segmentation was found to be 11.61%. The two factors that affected the character level segmentation are broken character (which results over-segmentation) and Shadow Characters (which are the results of under-segmentation). The accuracy of word recognition module was found to be 81.44%.

21

Bibliography [1] M. Agrawal, H. Ma and D. Doermann, "Generalization of hindi OCR using adaptive egmentation and font files. In Guide to OCR for Indic Scripts," Springer London, pp. pp. 181-207, 2010. [2] B. B. Chaudhuri and U. Pal, "An OCR System to Read Two Indian Language Scripts: Bangla and Devanagari (Hindi)," in Proceedings of the Fourth International Conference on Document Analysis and Recognition, 1997. [3] S. Kompalli, S. Setlur and V. Govindaraju, "Devanagari OCR using a recognition driven segmentation framework and stochastic language models," IJDAR, 2009. [4] S. Bag and G. Harit, "A survey on optical character recognition for Bangla and Devanagari Script," Sadhana, pp. 133-168, 2013. [5] N. Sabbour and F. Shafait, "A Segmentation Free Approach to Arabic and Urdu OCR," in SPIE Proceedings, 2013.

22