Dhivehi OCR: Character Recognition of Thaana Script ...

82 downloads 134340 Views 688KB Size Report
Script using Machine Generated Text and ... read text. There has been a considerable accumulation of knowledge in this ..... citeseer.ist.psu.edu/142042.html.
Dhivehi OCR: Character Recognition of Thaana Script using Machine Generated Text and Tesseract OCR Engine Ahmed Ibrahim School of Computer and Security Science Faculty of Health, Engineering and Science Edith Cowan University, Australia [email protected]

ABSTRACT This paper provides technical aspects and the context of recognising Dhivehi characters using the Tesseract OCR Engine, which is a freely available OCR engine with remarkable accuracy and support for multiple languages. The experiments that were conducted showed promising results with 69.46% accuracy and more importantly highlighted limitations that are unique to Dhivehi. Detailed discussions of these issues and possible directions for future research have been presented. Keywords: Dhivehi OCR, Thaana Script, Optical Character Recognition, Tesseract OCR.

1 of 20

the recent years. The Social

INTRODUCTION

Media Phenomenon has been an

Optical Character Recognition (OCR)

is

the

process

influential force behind this

of

drive to publish more localised

mimicking humans' ability to

content

read text. There has been a in

this

area

the

audience.

considerable accumulation of knowledge

for

developments

of

Maldivian However,

in

Dhivehi

language technologies have been

computer science. It is still in

very slow and often lacking.

development, albeit being a pattern

The absence of such literature

artificial

and tools for Dhivehi OCR has

intelligence (Mori, Nishida, &

been the primary motivation for

Yamada,

is

this paper. The objective of this

currently no published literature

paper is to inform the reader of

related to OCR of Dhivehi text,

how a freely available tool,

nor are there any tools in the

Tesseract OCR Engine, can be

public

specifically

used for Dhivehi OCR. As a

developed to perform Dhivehi

result, key problems intrinsic to

OCR.

the

mature

problem

recognition

With

of

and 1999).

There

domain

the

Internet

proliferation and

Dhivehi

language

were

identified so that future research

of

can be conducted to improve

mobile

this domain.

connectivity, the demand for to

This paper would first present

a

some background information

tremendous increase of Internet

on OCR and basic building

content in Dhivehi language over

blocks of a typical system. It will

digital

content

grow.

There

continues has

been

2 of 20

also

look

certain

number of applications. Some

characteristics of Dhivehi text

examples of such Automatic

that are relevant to the problem

Identification

followed by a brief overview of

scanning

the Tesseract OCR Engine. In the

consumer products and reading

next section, the methodology

magnetic stripes from the credit

used to conduct the experiment

card. These have become trivial

has been explained. The section

because they are seamlessly

on findings would present the

integrated

results

discussion

routines. By reducing human

section delves into the success

intervention, human errors are

and failures of the experiment in

minimised

more detail. The paper is then

increased.

and

at

the

concluded with a summary and

techniques barcodes

to

our

and

are from

everyday

efficiency

is

Optical Character Recognition,

future directions of research.

more commonly abbreviated as

BACKGROUND

OCR, is a form of Automatic Identification where a computer

Optical Character Recognition

system

converts

images

of

Today there are a number of

typewritten text into a standard

ways of entering data into a

encoding scheme such as ASCII

computer in contrast to the

or Unicode, so that it is editable

traditional form of using the

on

keyboard. In order to increase

performed off-line after the text

productivity

accuracy,

has been written or printed as

automatic forms of data capture

opposed to on-line recognition

or Automatic Identification are

where the computer recognises

used more frequently in a

the

and

3 of 20

the

computer.

OCR

characters as they

is

are

drawn.

Since

its

accuracy

documents still introduce a wide

directly depends on the quality

array of challenges for OCR (Kae,

of the input document/image,

Smith, & Learned-Miller, 2011).

the more it is constrained, the

Components of an OCR System

better it performs. Thus, current OCR limited

technologies when

it

unconstrained

are

A

still

typical

multilingual

OCR

to

system is similar to a pipeline of

handwriting

sequential stages, one after the

comes

other, each feeding its output to

(Eikvil, 1993).

the next. It most commonly According to Mori, Suen, and

involves preprocessing of the

Yamamoto

input image, feature extraction

(1992),

the

first

concepts of OCR dates back to

and

even before the age of computers;

(Peng, Cao, Setlur, Govindaraju,

a patent filed in Germany by

& Natarajan, 2013).

classification/recognition

Tauschek (Tauschek, 1935) in 1929 and one that was later filed

The purpose of preprocessing is

in the U.S. By Handel (Handel,

to clean, prepare and format the

1933) in 1933. However, the first

image data so that the features

generation of commercial OCR

required for the classifier can be

systems only started appearing in

accurately extracted. As such,

the period of 1960 to 1965 (Eikvil,

preprocessing

1993). Since then, OCR has gone

binarization, layout analysis, line

through major developments and

segmentation,

word/character

significant

segmentation.

Binarization

improvements,

may

include

however, it is far from being “a

eases the recognition process

solved problem”. Different fonts

(Elagouni, Garcia, Mamalet, &

and varying qualities of input

Sébillot, 2013) by converting the

4 of 20

image from either a grayscale or

approach. These features would

colour profile to a bilevel black

then be used for structural

and

This

matching,

immensely reduces the amount

matching,

of

classification (Cheng-Lin, Jaeger,

white data

that

image. need

to

be

processed. The layout analysis

probabilistic or

statistical

& Nakagawa, 2004).

typically identifies text and non-

Characteristics

text regions in a page and also

of

Dhivehi

Language

splits multi-column text into columns (Smith, 2009). This text

Dhivehi is the native language

is then further segmented into

spoken in Maldives and is also

individual lines and often into

the official language used by all

words and/or characters. There

government institutions. Its base

are ongoing research in this area

vocabulary

to overcome issues such as

however many of its words have

character fragmentation, noise

been burrowed or derived from

separation, and distinguishing

other languages such as Arabic,

graphics or geometry from text

English, Dravidian and Sinhala

(Eikvil, 1993).

(Gair, 2010). The geographic

is

Indo-Aryan,

dispersed nature of speakers of The feature extraction greatly

Dhivehi throughout the islands

depends on the type of classifier

of Maldives has resulted in

used

remarkable dialectal differences.

for

Hence,

the features

identification. be

The most notable of those are

represented based on structural

the Huvadhoo, Fuammulaku, and

information of the characters,

Addu,

statistical

southernmost Atolls in Maldives.

input

may

patterns

of

feature vectors, or even a hybrid

However, 5 of 20

which Male'

are dialect

the has

become more widespread and is

19th

typically spoken predominantly

superseded its predecessors and

to the north of Hadhdhummathi.

is now considered as the official

Hence,

script

Male'

dialect

is

century,

of

Thaana

Dhivehi.

has

The

considered as the standard and

formulation of the Thaana script

is commonly used in education

has had significant influence

and administration all over the

from Arabic text. In fact, the first

country (Fritz, 2002).

nine characters of Thaana have a phenomenal resemblance with the numerals used in Arabic (Gair, 2010).

Despite the different dialects, there is only one standard script used today, which is called Thaana.

According

to

Fritz

The

main

(2002), the earliest form of

Thaana

Dhivehi text, Eveyla Akuru, can

following.

be dated back to 12th century in

Male'.

writing

are

of the

1. Thaana script consists of

statues found in the National Museum

characteristics

24

More

basic

consonants

(Figure 1), 14 Arabic

historical artefacts dating back

extensions (Figure 2) and

to the 17th century reveals

11 vowels (Figure 3).

another form of Dhivehi text

2. It is written from right-

called Dives Akuru. Documents

to-left

found in later periods revealed

numbers,

the use of Thaana, which was

except which

for are

written from left-to-right.

then used in conjunction with

3. It

Dives Akuru. By the end of the

does

distinction

6 of 20

not

have

a

between

upper and lower case

set. However, they are

characters.

often interchanged with

4. A character sound maybe

regular consonants with

produced by placing a

similar sounds.

vowel or diacritic mark above

or

below

7. Due to the shape of

a

characters and different

consonant (Figure 4). 5. A

consonant

font

designs,

machine

maybe

printed Thaana text can

written by itself whereas

have horizontal overlaps

a vowel can never be

among consonants and

written by itself.

also

6. Consequent numerous

to

the

vertical

overlaps

among consonant-vowel

words

pairs.

burrowed from Arabic,

8. There is also great deal of

any Arabic word can be

similarities

between

accurately

consonants,

between

written

semantically using the

vowels, and also between

Arabic extensions found

consonants and vowels.

in the Thaana character

7 of 20

Figure 1: Basic Consonants

Figure 2: Arabic Extensions

8 of 20

Figure 3: Vowels

Figure 4: Image of Dhivehi text showing baseline and Consonant/Vowel combinations improvements. In addition to the

Tesseract OCR Engine

typical Latin derived character

Initially developed by Hewlett-

sets, it is now able to recognise

Packard (HP) between 1984 and

more complex multilingual scripts

1994 (Smith, 2007), it is now an

(Smith, Antonova, & Lee, 2009).

open-source

OCR

engine

maintained

by

Google

The Tesseract OCR Engine is

(“tesseract-ocr”, n.d.). It is one of

designed to be first trained (Smith,

the most accurate OCR engines

2007) for the respective language

freely available today and was

before it can perform OCR. Once

among the top in the 4th UNLV

the

Annual Test of OCR Accuracy in

generated, it can be portably used

1995 (Rice, Jenkins, & Nartker,

by anyone wishing to OCR text

1995). Since then, it

based on that language. There are

through

major

has gone

changes

language

data

has

been

already several language data files

and

(“Downloads 9 of 20



tesseract-ocr”,

n.d.) that has been generated by

Training Phase

others. Unfortunately Dhivehi is

The training was conducted

not one of them. The next section

according to the training guide

outlines the methodology used to

outlined for a new language

train the Tesseract OCR Engine to

(“TrainingTesseract3 - tesseract-

generate the language data file for

ocr”, 2013). As per this guide, all

Dhivehi.

mandatory steps were followed.

The latest version available at the

The following outlines these

time of writing this paper was

steps in more detail. As per the

3.02, however version 3.01 has

training guide, the country code

been used for the purpose of this

div was used throughout the

paper.

training phases as the ISO 639-3 code for Dhivehi.

METHODOLOGY The

approach

used

in

conducting the experiment to perform

Dhivehi

OCR

using

Tesseract OCR Engine was split into two phases; 1) Training Phase: training of the Tesseract OCR Engine to generate the language data for Dhivehi; and 2) Testing Phase: testing the performance of Dhivehi OCR using the language data.

1.

Preparing

the

training

data: The training character set only consisted

of

the

consonants

(Figure

basic 1)

and

vowels (Figure 3). With the exclusion

of

the

Arabic

extensions (Figure 2), numerals and other special characters, the remaining characters were still sufficient for the purpose of this experiment. Therefore, a total of 276

character

images

were

generated, which included a 10 of 20

combination of every vowel with

elements due the similarities in

each

consonant,

and

each

shape between some vowels and

by

itself.

The

consonants. Figure 5 shows an

consonant-vowel

pairs

were

consonant

extract from the training set.

chosen as individual training

Figure 5: Training Set

The

training

images

were

the bounding boxes of the

generated using PIL library in

characters.

Python. The characters were

determine the most suitable font

formatted using Faruma, which

size, a series of the same images

is one of the most widely used

were generated using varying

Unicode font for Dhivehi. During

font sizes. By performing the

this step, the size of the font was

next

crucial for properly identifying

determined that size 20 of

11 of 20

step,

In

it

order

was

to

then

Faruma font was most suitable. 2.

number larger

Preparing the box file:

The purpose of generating a box file was basically to identify the boundary of each character, hence the term box. A number of issues were identified during this step.

of

boxes.

However,

font

sizes

identified

consonants and their vowels as separate characters instead of a pair (Figure 6). Vowels are similar to diacritics that appear over or below the consonants, hence must be treated as a single unit.

Smaller font sizes between 12 and 18 produced the least

Figure 6: Consonant-Vowel Separate

Figure 7: Consonant-Vowel Pair To remedy the issue of the

the

consonant

being

parameter textord_min_linesize;

separate, the box file generation

by changing its default value

command was re-executed with

from 1.25 to 2.5; and also with

and

vowel

12 of 20

additional

configuration

page segmentation mode (psm)

the box file that was previously

as

generated. This resulted in the

6.

These

configuration

options were consistently used throughout the training phases.

unicharset file. 5.

Preparing the font style

Once the box file was generated,

information:

a box file editor was used to

The

correctly label each box with the

font_properties

corresponding

or

Tesseract the styles supported

Unicode

by the trained font in the format

consonant

consonant-vowel

font

style

information, file,

characters. This can also be done



directly on the box file, however,

. Since

there

Faruma does not have any of

are

a

number

tools

available which can make this process much simpler. 3.



these, all switches were set as 0. 6.

Preparing the feature file:



tells

Preparing

the

shape

features' clusters:

The feature file *.tr was then

To create the cluster prototypes,

generated

same

first mftraining tool was used to

configuration options used in

create the inttemp, pffmtable and

the second step. This was done

Microfeat

by invoking the training mode in

cntraining, which created the

Tesseract.

normproto file.

4.

7.

using

the

Preparing the character

files.

Followed

by

Preparing the language

set:

data file:

The character set indicates all

During

possible characters Tesseract

inttemp,

can output. This was done using

normproto files were prefixed

13 of 20

the

final

step,

pffmtable

the and

with the div country code.

eliminate certain preprocessing

Finally all the generated files

tasks such as; noise elimination,

were then combined to create

skew detection and correct. This

the

would not be the case in a real

language

data

file,

div.traineddata, for Dhivehi.

world scenario, however, this enabled

Testing Phase Following

a

more

accurate

assessment of the Dhivehi OCR

the

training,

the

language data file was then used

performance using the Tesseract OCR Engine.

to perform OCR. However, a separate

test

created

to

dataset

was

measure

its

FINDINGS A total of 239 consonant-vowel

performance. This dataset was

pairs

created using the same approach

accuracy

used to create the training data.

achieved

The rationale was to ascertain

machine generated Dhivehi text.

any shortcomings in the initial

From these tests, two important

training

observations could be made.

approach,

especially

since there are no previous benchmarks for Dhivehi OCR.

were

tested

of

69.46%

for

Firstly,

and

an was

recognising

the

correct

identifications and the overall

Based on this understanding, the

accuracy shows great potential

previously

for using the Tesseract OCR

considered

consonant-vowel

pairs

were

Engine got the recognition of

then used to generate images of

Dhivehi

random

(Table 1) is an example of

Dhivehi

text.

With

machine generated text, one

text.

The

successful recognition.

significant advantage was to

14 of 20

following

Secondly,

the

Tesseract

OCR

and (c) respectively. However, the

Engine has mistaken the vowels

top lines were falsely recognised

on top and below the consonants

as consonants resulting as extra

as different lines, hence tried to

lines of consonants. This issue is

identify them as consonants. As it

discussed in greater detail in the

can be seen in Table 2 (b) and (d),

next section.

the bottom line matches the consonants in the test images (a)

Test Image

OCR Text

(b)

(a)

(d)

(c) Table 1: Successful Identifications of Dhivehi text Test Image

OCR Text

(a)

(b)

(c)

(d)

Table 2: False Identifications of Dhivehi Text 15 of 20

other

DISCUSSION Based on the observations made, it can be said that Dhivehi OCR is possible without reinventing the wheel by further developing existing technologies such as the Tesseract OCR Engine. However, despite

the

significant

improvements currently made to support multilingual scripts,

right-to-left

languages

such as Arabic, Dhivehi cannot be

written

because

without

they

are

vowels, not

just

diacritics, but are semantically essential components in forming words.

Thus,

it

absolutely

necessary to be able to recognise the vowels as well as the consonants.

there are still critical limitations

Based on the observations, the

in its design. This is because

most significant errors were the

Tesseract

was

result of failing to identify the

designed to identify English text,

consonant-vowel pairs. This is

hence

language

due to the nature in which

introduced to it brings new

vowels are aligned and spaced

challenges. The same is true for

out on the top and bottom of the

Dhivehi.

consonants. They are sometimes

at

every

its

core,

new

In Dhivehi, the baseline (Figure 4) of the letters are at the upper bounds

of

the

majority

of

consonants in Thaana script. In English, the baseline is where most of the characters sit, but in Dhivehi it is more like where all consonants are hung. Unlike

treated as a separate line during segmentation. By altering the configuration parameters, these errors were reduced but not eliminated. detailed

With

a

lack

documentation

of of

configuration parameters and its design, it would be premature to deem this as a limitation of the

16 of 20

Tesseract OCR Engine. It is

tesseract-ocr”, 2013). Further

however

more

investigations need to be done

research needs to be done on

to determine the performance of

improving this aspect to have a

integrating these options.

clear

that

practical level of accuracy. Minor errors such as false recognition

CONCLUSION

of one or two consonants or

There is a significant gap in

vowels

significantly

published literature with regard

reduced by the integration of

to Dhivehi OCR. This paper has

other language tools such as

been a small contribution to the

word lists and dictionaries.

vast body of OCR knowledge, but

can

be

The objective of this paper was not to create a production level language data file that can be practically used for identifying Dhivehi text, but to explore the suitability of a freely available OCR tool, Tesseract OCR Engine, so

that

its

potential

and

limitations could be explored. With this in mind, it was trained to a bare minimum without extra

enhancements

already

available, such as Dictionary Data and character ambiguity specifier



unicharambigs

(“TrainingTesseract3



significant within the context of Dhivehi language. As it is evident from the findings, the Tesseract OCR Engine offers promising results in accurately recognising Thaana script. However there are critical problems that are language specific that must be researched

in-depth

to

find

viable solutions. Once significant area for potential research is in solving

the

problem

of

segmentation of consonants and vowels in individual lines. The training and test data were very much constrained to control external

17 of 20

influences,

hence

generating However,

them the

via

code.

plethora

of

problems that are yet to be discovered

when

processing

scanned documents of varying fonts or even handwritten text are yet to be discovered. Thus, future research will be done by expanding the character set to include Arabic extensions and

Google and all the contributors for continuing support for the Tesseract OCR Engine, without whom this paper would not be possible. The author would also like to thank future researchers language

technologies, which is much needed for the development and language.

of

the

Jaeger,

S.,

&

Nakagawa, M. (2004). Online recognition

of

Chinese

characters: the state-of-the-art. IEEE Transactions on Pattern Analysis

and

Machine

Intelligence, 26(2), 198-213. doi: 10.1109/TPAMI.2004.1262182 L.

(1993).

Optical

recognition.

citeseer.ist.psu.edu/142042.html.

The author would like to thank

sustenance

L.,

character

ACKNOWLEDGEMENTS

Dhivehi

Cheng-Lin,

Eikvil,

scanned documents.

of

REFERENCES

Dhivehi

Elagouni, K., Garcia, C., Mamalet, F., & Sébillot, P. (2013). Text recognition

in

multimedia

documents: a study of two neural-based OCRs using and avoiding

character

segmentation.

International

Journal on Document Analysis and Recognition (IJDAR), 1-13. doi: 10.1007/s10032-013-02027 Fritz, S. (2002). The Dhivehi language:

18 of 20

a

descriptive

and

historical grammar of Maldivian

fly: a font-free approach toward

and its dialects: Ergon-Verlag.

multilingual OCR. International

Gair,

J.

W.

(2010).

Concise

encyclopedia of languages of the world (K. Brown & S. Ogilvie Eds.): Access Online via Elsevier. Google.

(2013).

TrainingTesseract3 - tesseractocr.

2013,

from

http://code.google.com/p/tesse

Journal on Document Analysis and Recognition (IJDAR), 14(3), 289-301. doi: 10.1007/s10032011-0164-6 Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition: John Wiley & Sons, Inc.

ract-

Mori, S., Suen, C. Y., & Yamamoto,

ocr/wiki/TrainingTesseract3

K. (1992). Historical review of

Google.

(n.d.).

tesseract-ocr.

2013,

from

http://code.google.com/p/tesse

(n.d.).

tesseract-ocr.

Downloads 2013,

Proceedings of the IEEE, 80(7), 1029-1058.

doi:

10.1109/5.156468

ract-ocr/ Google.

OCR research and development.

-

from

http://code.google.com/p/tesse

Peng, X., Cao, H., Setlur, S., Govindaraju, V., & Natarajan, P. (2013).

Multilingual

OCR

research and applications: an

ract-ocr/downloads/list

overview. Paper presented at the

Handel, P. W. (1933). U.S. Patent

Proceedings

of

the

1915993.

International

Workshop

4th on

Multilingual OCR, Washington, Kae, A., Smith, D. A., & Learned-

D.C.

Miller, E. (2011). Learning on the

19 of 20

Rice, S. V., Jenkins, F. R., &

Tauschek, G. (1935). U.S. Patent

Nartker, T. A. (1995). The fourth

2026329.

annual test of OCR accuracy. 1995 Annual Report of ISRI, University of Nevada, Las Vegas, 11-50. Smith, R. (2007). An overview of the Tesseract OCR engine. Paper presented

at

the

Ninth

International

Conference

Document

Analysis

on and

Recognition, 2007. ICDAR 2007. Smith, R., Antonova, D., & Lee, D. S. (2009). Adapting the Tesseract open source OCR engine for multilingual

OCR.

Paper

presented at the Proceedings of the International Workshop on Multilingual

OCR,

Barcelona,

Spain. Smith, R. W. (2009). Hybrid page layout

analysis

via

tab-stop

detection. Paper presented at the 10th International Conference on

Document

Analysis

and

Recognition, 2009. ICDAR'09.

20 of 20