Dhivehi OCR: Character Recognition of Thaana Script using Machine Generated Text and Tesseract OCR Engine Ahmed Ibrahim School of Computer and Security Science Faculty of Health, Engineering and Science Edith Cowan University, Australia
[email protected]
ABSTRACT This paper provides technical aspects and the context of recognising Dhivehi characters using the Tesseract OCR Engine, which is a freely available OCR engine with remarkable accuracy and support for multiple languages. The experiments that were conducted showed promising results with 69.46% accuracy and more importantly highlighted limitations that are unique to Dhivehi. Detailed discussions of these issues and possible directions for future research have been presented. Keywords: Dhivehi OCR, Thaana Script, Optical Character Recognition, Tesseract OCR.
1 of 20
the recent years. The Social
INTRODUCTION
Media Phenomenon has been an
Optical Character Recognition (OCR)
is
the
process
influential force behind this
of
drive to publish more localised
mimicking humans' ability to
content
read text. There has been a in
this
area
the
audience.
considerable accumulation of knowledge
for
developments
of
Maldivian However,
in
Dhivehi
language technologies have been
computer science. It is still in
very slow and often lacking.
development, albeit being a pattern
The absence of such literature
artificial
and tools for Dhivehi OCR has
intelligence (Mori, Nishida, &
been the primary motivation for
Yamada,
is
this paper. The objective of this
currently no published literature
paper is to inform the reader of
related to OCR of Dhivehi text,
how a freely available tool,
nor are there any tools in the
Tesseract OCR Engine, can be
public
specifically
used for Dhivehi OCR. As a
developed to perform Dhivehi
result, key problems intrinsic to
OCR.
the
mature
problem
recognition
With
of
and 1999).
There
domain
the
Internet
proliferation and
Dhivehi
language
were
identified so that future research
of
can be conducted to improve
mobile
this domain.
connectivity, the demand for to
This paper would first present
a
some background information
tremendous increase of Internet
on OCR and basic building
content in Dhivehi language over
blocks of a typical system. It will
digital
content
grow.
There
continues has
been
2 of 20
also
look
certain
number of applications. Some
characteristics of Dhivehi text
examples of such Automatic
that are relevant to the problem
Identification
followed by a brief overview of
scanning
the Tesseract OCR Engine. In the
consumer products and reading
next section, the methodology
magnetic stripes from the credit
used to conduct the experiment
card. These have become trivial
has been explained. The section
because they are seamlessly
on findings would present the
integrated
results
discussion
routines. By reducing human
section delves into the success
intervention, human errors are
and failures of the experiment in
minimised
more detail. The paper is then
increased.
and
at
the
concluded with a summary and
techniques barcodes
to
our
and
are from
everyday
efficiency
is
Optical Character Recognition,
future directions of research.
more commonly abbreviated as
BACKGROUND
OCR, is a form of Automatic Identification where a computer
Optical Character Recognition
system
converts
images
of
Today there are a number of
typewritten text into a standard
ways of entering data into a
encoding scheme such as ASCII
computer in contrast to the
or Unicode, so that it is editable
traditional form of using the
on
keyboard. In order to increase
performed off-line after the text
productivity
accuracy,
has been written or printed as
automatic forms of data capture
opposed to on-line recognition
or Automatic Identification are
where the computer recognises
used more frequently in a
the
and
3 of 20
the
computer.
OCR
characters as they
is
are
drawn.
Since
its
accuracy
documents still introduce a wide
directly depends on the quality
array of challenges for OCR (Kae,
of the input document/image,
Smith, & Learned-Miller, 2011).
the more it is constrained, the
Components of an OCR System
better it performs. Thus, current OCR limited
technologies when
it
unconstrained
are
A
still
typical
multilingual
OCR
to
system is similar to a pipeline of
handwriting
sequential stages, one after the
comes
other, each feeding its output to
(Eikvil, 1993).
the next. It most commonly According to Mori, Suen, and
involves preprocessing of the
Yamamoto
input image, feature extraction
(1992),
the
first
concepts of OCR dates back to
and
even before the age of computers;
(Peng, Cao, Setlur, Govindaraju,
a patent filed in Germany by
& Natarajan, 2013).
classification/recognition
Tauschek (Tauschek, 1935) in 1929 and one that was later filed
The purpose of preprocessing is
in the U.S. By Handel (Handel,
to clean, prepare and format the
1933) in 1933. However, the first
image data so that the features
generation of commercial OCR
required for the classifier can be
systems only started appearing in
accurately extracted. As such,
the period of 1960 to 1965 (Eikvil,
preprocessing
1993). Since then, OCR has gone
binarization, layout analysis, line
through major developments and
segmentation,
word/character
significant
segmentation.
Binarization
improvements,
may
include
however, it is far from being “a
eases the recognition process
solved problem”. Different fonts
(Elagouni, Garcia, Mamalet, &
and varying qualities of input
Sébillot, 2013) by converting the
4 of 20
image from either a grayscale or
approach. These features would
colour profile to a bilevel black
then be used for structural
and
This
matching,
immensely reduces the amount
matching,
of
classification (Cheng-Lin, Jaeger,
white data
that
image. need
to
be
processed. The layout analysis
probabilistic or
statistical
& Nakagawa, 2004).
typically identifies text and non-
Characteristics
text regions in a page and also
of
Dhivehi
Language
splits multi-column text into columns (Smith, 2009). This text
Dhivehi is the native language
is then further segmented into
spoken in Maldives and is also
individual lines and often into
the official language used by all
words and/or characters. There
government institutions. Its base
are ongoing research in this area
vocabulary
to overcome issues such as
however many of its words have
character fragmentation, noise
been burrowed or derived from
separation, and distinguishing
other languages such as Arabic,
graphics or geometry from text
English, Dravidian and Sinhala
(Eikvil, 1993).
(Gair, 2010). The geographic
is
Indo-Aryan,
dispersed nature of speakers of The feature extraction greatly
Dhivehi throughout the islands
depends on the type of classifier
of Maldives has resulted in
used
remarkable dialectal differences.
for
Hence,
the features
identification. be
The most notable of those are
represented based on structural
the Huvadhoo, Fuammulaku, and
information of the characters,
Addu,
statistical
southernmost Atolls in Maldives.
input
may
patterns
of
feature vectors, or even a hybrid
However, 5 of 20
which Male'
are dialect
the has
become more widespread and is
19th
typically spoken predominantly
superseded its predecessors and
to the north of Hadhdhummathi.
is now considered as the official
Hence,
script
Male'
dialect
is
century,
of
Thaana
Dhivehi.
has
The
considered as the standard and
formulation of the Thaana script
is commonly used in education
has had significant influence
and administration all over the
from Arabic text. In fact, the first
country (Fritz, 2002).
nine characters of Thaana have a phenomenal resemblance with the numerals used in Arabic (Gair, 2010).
Despite the different dialects, there is only one standard script used today, which is called Thaana.
According
to
Fritz
The
main
(2002), the earliest form of
Thaana
Dhivehi text, Eveyla Akuru, can
following.
be dated back to 12th century in
Male'.
writing
are
of the
1. Thaana script consists of
statues found in the National Museum
characteristics
24
More
basic
consonants
(Figure 1), 14 Arabic
historical artefacts dating back
extensions (Figure 2) and
to the 17th century reveals
11 vowels (Figure 3).
another form of Dhivehi text
2. It is written from right-
called Dives Akuru. Documents
to-left
found in later periods revealed
numbers,
the use of Thaana, which was
except which
for are
written from left-to-right.
then used in conjunction with
3. It
Dives Akuru. By the end of the
does
distinction
6 of 20
not
have
a
between
upper and lower case
set. However, they are
characters.
often interchanged with
4. A character sound maybe
regular consonants with
produced by placing a
similar sounds.
vowel or diacritic mark above
or
below
7. Due to the shape of
a
characters and different
consonant (Figure 4). 5. A
consonant
font
designs,
machine
maybe
printed Thaana text can
written by itself whereas
have horizontal overlaps
a vowel can never be
among consonants and
written by itself.
also
6. Consequent numerous
to
the
vertical
overlaps
among consonant-vowel
words
pairs.
burrowed from Arabic,
8. There is also great deal of
any Arabic word can be
similarities
between
accurately
consonants,
between
written
semantically using the
vowels, and also between
Arabic extensions found
consonants and vowels.
in the Thaana character
7 of 20
Figure 1: Basic Consonants
Figure 2: Arabic Extensions
8 of 20
Figure 3: Vowels
Figure 4: Image of Dhivehi text showing baseline and Consonant/Vowel combinations improvements. In addition to the
Tesseract OCR Engine
typical Latin derived character
Initially developed by Hewlett-
sets, it is now able to recognise
Packard (HP) between 1984 and
more complex multilingual scripts
1994 (Smith, 2007), it is now an
(Smith, Antonova, & Lee, 2009).
open-source
OCR
engine
maintained
by
Google
The Tesseract OCR Engine is
(“tesseract-ocr”, n.d.). It is one of
designed to be first trained (Smith,
the most accurate OCR engines
2007) for the respective language
freely available today and was
before it can perform OCR. Once
among the top in the 4th UNLV
the
Annual Test of OCR Accuracy in
generated, it can be portably used
1995 (Rice, Jenkins, & Nartker,
by anyone wishing to OCR text
1995). Since then, it
based on that language. There are
through
major
has gone
changes
language
data
has
been
already several language data files
and
(“Downloads 9 of 20
–
tesseract-ocr”,
n.d.) that has been generated by
Training Phase
others. Unfortunately Dhivehi is
The training was conducted
not one of them. The next section
according to the training guide
outlines the methodology used to
outlined for a new language
train the Tesseract OCR Engine to
(“TrainingTesseract3 - tesseract-
generate the language data file for
ocr”, 2013). As per this guide, all
Dhivehi.
mandatory steps were followed.
The latest version available at the
The following outlines these
time of writing this paper was
steps in more detail. As per the
3.02, however version 3.01 has
training guide, the country code
been used for the purpose of this
div was used throughout the
paper.
training phases as the ISO 639-3 code for Dhivehi.
METHODOLOGY The
approach
used
in
conducting the experiment to perform
Dhivehi
OCR
using
Tesseract OCR Engine was split into two phases; 1) Training Phase: training of the Tesseract OCR Engine to generate the language data for Dhivehi; and 2) Testing Phase: testing the performance of Dhivehi OCR using the language data.
1.
Preparing
the
training
data: The training character set only consisted
of
the
consonants
(Figure
basic 1)
and
vowels (Figure 3). With the exclusion
of
the
Arabic
extensions (Figure 2), numerals and other special characters, the remaining characters were still sufficient for the purpose of this experiment. Therefore, a total of 276
character
images
were
generated, which included a 10 of 20
combination of every vowel with
elements due the similarities in
each
consonant,
and
each
shape between some vowels and
by
itself.
The
consonants. Figure 5 shows an
consonant-vowel
pairs
were
consonant
extract from the training set.
chosen as individual training
Figure 5: Training Set
The
training
images
were
the bounding boxes of the
generated using PIL library in
characters.
Python. The characters were
determine the most suitable font
formatted using Faruma, which
size, a series of the same images
is one of the most widely used
were generated using varying
Unicode font for Dhivehi. During
font sizes. By performing the
this step, the size of the font was
next
crucial for properly identifying
determined that size 20 of
11 of 20
step,
In
it
order
was
to
then
Faruma font was most suitable. 2.
number larger
Preparing the box file:
The purpose of generating a box file was basically to identify the boundary of each character, hence the term box. A number of issues were identified during this step.
of
boxes.
However,
font
sizes
identified
consonants and their vowels as separate characters instead of a pair (Figure 6). Vowels are similar to diacritics that appear over or below the consonants, hence must be treated as a single unit.
Smaller font sizes between 12 and 18 produced the least
Figure 6: Consonant-Vowel Separate
Figure 7: Consonant-Vowel Pair To remedy the issue of the
the
consonant
being
parameter textord_min_linesize;
separate, the box file generation
by changing its default value
command was re-executed with
from 1.25 to 2.5; and also with
and
vowel
12 of 20
additional
configuration
page segmentation mode (psm)
the box file that was previously
as
generated. This resulted in the
6.
These
configuration
options were consistently used throughout the training phases.
unicharset file. 5.
Preparing the font style
Once the box file was generated,
information:
a box file editor was used to
The
correctly label each box with the
font_properties
corresponding
or
Tesseract the styles supported
Unicode
by the trained font in the format
consonant
consonant-vowel
font
style
information, file,
characters. This can also be done
directly on the box file, however,
. Since
there
Faruma does not have any of
are
a
number
tools
available which can make this process much simpler. 3.
these, all switches were set as 0. 6.
Preparing the feature file:
tells
Preparing
the
shape
features' clusters:
The feature file *.tr was then
To create the cluster prototypes,
generated
same
first mftraining tool was used to
configuration options used in
create the inttemp, pffmtable and
the second step. This was done
Microfeat
by invoking the training mode in
cntraining, which created the
Tesseract.
normproto file.
4.
7.
using
the
Preparing the character
files.
Followed
by
Preparing the language
set:
data file:
The character set indicates all
During
possible characters Tesseract
inttemp,
can output. This was done using
normproto files were prefixed
13 of 20
the
final
step,
pffmtable
the and
with the div country code.
eliminate certain preprocessing
Finally all the generated files
tasks such as; noise elimination,
were then combined to create
skew detection and correct. This
the
would not be the case in a real
language
data
file,
div.traineddata, for Dhivehi.
world scenario, however, this enabled
Testing Phase Following
a
more
accurate
assessment of the Dhivehi OCR
the
training,
the
language data file was then used
performance using the Tesseract OCR Engine.
to perform OCR. However, a separate
test
created
to
dataset
was
measure
its
FINDINGS A total of 239 consonant-vowel
performance. This dataset was
pairs
created using the same approach
accuracy
used to create the training data.
achieved
The rationale was to ascertain
machine generated Dhivehi text.
any shortcomings in the initial
From these tests, two important
training
observations could be made.
approach,
especially
since there are no previous benchmarks for Dhivehi OCR.
were
tested
of
69.46%
for
Firstly,
and
an was
recognising
the
correct
identifications and the overall
Based on this understanding, the
accuracy shows great potential
previously
for using the Tesseract OCR
considered
consonant-vowel
pairs
were
Engine got the recognition of
then used to generate images of
Dhivehi
random
(Table 1) is an example of
Dhivehi
text.
With
machine generated text, one
text.
The
successful recognition.
significant advantage was to
14 of 20
following
Secondly,
the
Tesseract
OCR
and (c) respectively. However, the
Engine has mistaken the vowels
top lines were falsely recognised
on top and below the consonants
as consonants resulting as extra
as different lines, hence tried to
lines of consonants. This issue is
identify them as consonants. As it
discussed in greater detail in the
can be seen in Table 2 (b) and (d),
next section.
the bottom line matches the consonants in the test images (a)
Test Image
OCR Text
(b)
(a)
(d)
(c) Table 1: Successful Identifications of Dhivehi text Test Image
OCR Text
(a)
(b)
(c)
(d)
Table 2: False Identifications of Dhivehi Text 15 of 20
other
DISCUSSION Based on the observations made, it can be said that Dhivehi OCR is possible without reinventing the wheel by further developing existing technologies such as the Tesseract OCR Engine. However, despite
the
significant
improvements currently made to support multilingual scripts,
right-to-left
languages
such as Arabic, Dhivehi cannot be
written
because
without
they
are
vowels, not
just
diacritics, but are semantically essential components in forming words.
Thus,
it
absolutely
necessary to be able to recognise the vowels as well as the consonants.
there are still critical limitations
Based on the observations, the
in its design. This is because
most significant errors were the
Tesseract
was
result of failing to identify the
designed to identify English text,
consonant-vowel pairs. This is
hence
language
due to the nature in which
introduced to it brings new
vowels are aligned and spaced
challenges. The same is true for
out on the top and bottom of the
Dhivehi.
consonants. They are sometimes
at
every
its
core,
new
In Dhivehi, the baseline (Figure 4) of the letters are at the upper bounds
of
the
majority
of
consonants in Thaana script. In English, the baseline is where most of the characters sit, but in Dhivehi it is more like where all consonants are hung. Unlike
treated as a separate line during segmentation. By altering the configuration parameters, these errors were reduced but not eliminated. detailed
With
a
lack
documentation
of of
configuration parameters and its design, it would be premature to deem this as a limitation of the
16 of 20
Tesseract OCR Engine. It is
tesseract-ocr”, 2013). Further
however
more
investigations need to be done
research needs to be done on
to determine the performance of
improving this aspect to have a
integrating these options.
clear
that
practical level of accuracy. Minor errors such as false recognition
CONCLUSION
of one or two consonants or
There is a significant gap in
vowels
significantly
published literature with regard
reduced by the integration of
to Dhivehi OCR. This paper has
other language tools such as
been a small contribution to the
word lists and dictionaries.
vast body of OCR knowledge, but
can
be
The objective of this paper was not to create a production level language data file that can be practically used for identifying Dhivehi text, but to explore the suitability of a freely available OCR tool, Tesseract OCR Engine, so
that
its
potential
and
limitations could be explored. With this in mind, it was trained to a bare minimum without extra
enhancements
already
available, such as Dictionary Data and character ambiguity specifier
–
unicharambigs
(“TrainingTesseract3
–
significant within the context of Dhivehi language. As it is evident from the findings, the Tesseract OCR Engine offers promising results in accurately recognising Thaana script. However there are critical problems that are language specific that must be researched
in-depth
to
find
viable solutions. Once significant area for potential research is in solving
the
problem
of
segmentation of consonants and vowels in individual lines. The training and test data were very much constrained to control external
17 of 20
influences,
hence
generating However,
them the
via
code.
plethora
of
problems that are yet to be discovered
when
processing
scanned documents of varying fonts or even handwritten text are yet to be discovered. Thus, future research will be done by expanding the character set to include Arabic extensions and
Google and all the contributors for continuing support for the Tesseract OCR Engine, without whom this paper would not be possible. The author would also like to thank future researchers language
technologies, which is much needed for the development and language.
of
the
Jaeger,
S.,
&
Nakagawa, M. (2004). Online recognition
of
Chinese
characters: the state-of-the-art. IEEE Transactions on Pattern Analysis
and
Machine
Intelligence, 26(2), 198-213. doi: 10.1109/TPAMI.2004.1262182 L.
(1993).
Optical
recognition.
citeseer.ist.psu.edu/142042.html.
The author would like to thank
sustenance
L.,
character
ACKNOWLEDGEMENTS
Dhivehi
Cheng-Lin,
Eikvil,
scanned documents.
of
REFERENCES
Dhivehi
Elagouni, K., Garcia, C., Mamalet, F., & Sébillot, P. (2013). Text recognition
in
multimedia
documents: a study of two neural-based OCRs using and avoiding
character
segmentation.
International
Journal on Document Analysis and Recognition (IJDAR), 1-13. doi: 10.1007/s10032-013-02027 Fritz, S. (2002). The Dhivehi language:
18 of 20
a
descriptive
and
historical grammar of Maldivian
fly: a font-free approach toward
and its dialects: Ergon-Verlag.
multilingual OCR. International
Gair,
J.
W.
(2010).
Concise
encyclopedia of languages of the world (K. Brown & S. Ogilvie Eds.): Access Online via Elsevier. Google.
(2013).
TrainingTesseract3 - tesseractocr.
2013,
from
http://code.google.com/p/tesse
Journal on Document Analysis and Recognition (IJDAR), 14(3), 289-301. doi: 10.1007/s10032011-0164-6 Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition: John Wiley & Sons, Inc.
ract-
Mori, S., Suen, C. Y., & Yamamoto,
ocr/wiki/TrainingTesseract3
K. (1992). Historical review of
Google.
(n.d.).
tesseract-ocr.
2013,
from
http://code.google.com/p/tesse
(n.d.).
tesseract-ocr.
Downloads 2013,
Proceedings of the IEEE, 80(7), 1029-1058.
doi:
10.1109/5.156468
ract-ocr/ Google.
OCR research and development.
-
from
http://code.google.com/p/tesse
Peng, X., Cao, H., Setlur, S., Govindaraju, V., & Natarajan, P. (2013).
Multilingual
OCR
research and applications: an
ract-ocr/downloads/list
overview. Paper presented at the
Handel, P. W. (1933). U.S. Patent
Proceedings
of
the
1915993.
International
Workshop
4th on
Multilingual OCR, Washington, Kae, A., Smith, D. A., & Learned-
D.C.
Miller, E. (2011). Learning on the
19 of 20
Rice, S. V., Jenkins, F. R., &
Tauschek, G. (1935). U.S. Patent
Nartker, T. A. (1995). The fourth
2026329.
annual test of OCR accuracy. 1995 Annual Report of ISRI, University of Nevada, Las Vegas, 11-50. Smith, R. (2007). An overview of the Tesseract OCR engine. Paper presented
at
the
Ninth
International
Conference
Document
Analysis
on and
Recognition, 2007. ICDAR 2007. Smith, R., Antonova, D., & Lee, D. S. (2009). Adapting the Tesseract open source OCR engine for multilingual
OCR.
Paper
presented at the Proceedings of the International Workshop on Multilingual
OCR,
Barcelona,
Spain. Smith, R. W. (2009). Hybrid page layout
analysis
via
tab-stop
detection. Paper presented at the 10th International Conference on
Document
Analysis
and
Recognition, 2009. ICDAR'09.
20 of 20