Prof. Moawad Ibrahim Moawad

19 downloads 0 Views 3MB Size Report
and a clearer understanding of previous research in printed Arabic text ..... Figure 5-20 : Results of proposed technique on PAC-OS Dataset(200DPI) . ..... speech, machine translation, text mining and key data. ... Holy Qur'an was revealed. ... Languages like Arabic, Urdu, Persian (Farsi), Sorani and Luri dialects of Kurdish, ...
Menoufia University Faculty of Computers and Information Department of Information Technology

‫جامعة المنوفية‬ ‫جامعة المنوفية‬

Arabic Optical Character Recognition Using Local Invariant Features Thesis submitted to the Faculty of Computers and Information, Menoufia University, in partial fulfillment of the requirements for the degree of Master of Computers and Information In [Information Technology]

By

Mohamed Dahi Abdel-Zaher Alkholy Demonstrator at Information Technology Department Faculty of Computers and Information Menoufia University Egypt

Supervised By Prof. Mohiy M.Hadhoud Professor Emeritus Communication Engineering, Former Vice President for Undergraduate Studies Menoufia University [

]

Dr. Noura Semary Lecturer of Information Technology Dept. Faculty of Computers and Information, Menoufia University [

]

Menoufia University Faculty of Computers and Information Department of Information Technology ‫جامعة المنوفية‬

Arabic Optical Character Recognition Using Local Invariant Features Thesis submitted to the Faculty of Computers and Information, Menoufia University, in partial fulfillment of the requirements for the degree of Master of Computers and Information In [Information Technology] By

Mohamed Dahi Abdel-Zaher Alkholy Demonstrator at Information Technology Department Faculty of Computers and Information Menoufia University Egypt

Examiner Committee Prof. Mohiy M.Hadhoud

Prof. Moawad Ibrahim Moawad Professor Emeritus of Telecommunication Engineering, Former Vice Dean of Electronic Engineering Faculty of Electronic Engineering at menouf, Menoufia University [

Professor Emeritus Communication Engineering, Former Vice President for Undergraduate Studies Menoufia University [

]

]

Asso. Prof. Khalid Mohamed Amin Asso.Professor in Information Technology Department Faculty of Computers and Information, Menoufia University [

]

2016

AUTHOR BIBLIOGRAPHY Name:

Mohamed Dahi Abdel-Zaher

Occupation:

Demonstrator at Information Technology Department

Occupation Place:

Faculty of Computers and Information, Menoufia University

Email:

[email protected]

Date of Birth:

1 October 1991

Educational Degrees:

B.Sc. in Information Technology

Grade:

Very Good

Educational Institution:

Information Technology, Faculty of Computers and Information, Menoufia University

Date of Educational:

May 2011

M.Sc. Registration Date:

June 2014

AUTHOR BIBLIOGRAPHY

II

ACKNOWLEDGEMENT

ACKNOWLEDGEMENT First and foremost, I give my deep thanks to Allah for giving me the opportunity and the strength to accomplish this work. Many people through the years have helped me reach what I am today. My family, and All of my professors and teachers over the years, and friends have made this educational journey a bearable trip. I would like to thank my supervisors prof. Mohiy Hadhoud and Dr. Noura Semary for their help and support during my work for creating a research environment that has been very inspiring. This work would not have been possible without their guidance, encouragement and motivation. They deserve my respect and thank. I would like to thank my family and my friends who were constant source of rising and supporting my spirit. Thanks go out especially to my father, my mother my brother, my sister and my dear wife for support and encouragement. I extend my sincere thanks and gratitude to our colleague Marwa Rashad for her cooperation with me during the search. I dedicate my success in this work to her soul asking God to be written in her good deeds and to enter her the paradise.

Finally special thanks to my faculty, department and my colleagues.

III

ABSTRACT

ABSTRACT Arabic Optical character Recognition (AOCR) is the science of conversion Arabic text image documents of type, printed, or handwritten into machine-encoded text. OCR role is to help or replace humans in computerizing paperwork in order to accelerate, improve and reduce cost as well as time and effort. It provide although the ability to electronically editing, storing more compactly and searching documents. It is not a recent research field; it started about 40 years ago. The need for it has become increasingly urgent due to overcrowding paperwork in our societies. A lot of research conducted on AOCR as the Arabic script language is the mother tongues of over quarter of the world population despite this fact, robust and reliable performance AOCR system is still challenge. It is not such as Latin language OCR which have Reliable font-written OCR systems which are readily in use since long time ago. This thesis aimed to enhance the optical printed Arabic characters recognition accuracy across using local invariant features. A comparative study of four recent highly reported recognition accuracy algorithms presented. The algorithms have been evaluated on a proposed computer generated Primitive Arabic Characters Noise Free dataset (PACNF) since there is no publicly available dataset for primitive printed Arabic text. It contains two models PAC-NFA and PAC-NFB. Accuracy of algorithms is evaluated using CRR (Character Recognition Rate) metric. Results show that one of the four Approaches[1]achieved the highest CRR by average of 99.36% on PAC-NFA and 75.21% on PAC-NFB. considering this algorithm as the base technique to be improved, a combination of additional features has been proposed to achieve higher recognition rates, three types of classifiers used to test the features (Random Forest Tree, ANN, and SVM). the results showed that the Random Forest Tree classifier achieved the highest CRR. The proposed technique achieved CRR by average of 100% on PAC-NFA and 92.81% on PAC-NFB using Random Forest Tree classifier. The proposed technique robustness against two types of noise (scanning noise, and Artificial Gaussian noise) is tested, the results showed that the proposed

IV

ABSTRACT

technique more robust to the two types of noise than the base technique. Another system process has been added to AOCR system to automate the recognition process of Omni font documents which is the Optical Font Recognition (OFR). Keywords: - AOCR; OFR; Local Features.

V

SUMMERY

SUMMERY AOCR is one of the real world important disciplines due to the high speed of our information society. Many AOCR applications such as (Tesseract, Sakhr, Readiris, Verus, OmniPage, TextPert, ICRA, Al-Qari (al-Ali)) used to help or replace humans in computerizing paperwork in order to accelerate, improve and reduce cost as well as time and effort. It provides the ability to electronically editing, storing more compactly and searching documents. These applications require high recognition accuracy rate to satisfy user requirements. Many different Latin and other languages OCR systems reached to a satisfied recognition accuracy. On the other side, a group of Arabic OCR techniques proposed to increase recognition accuracy to reach to satisfied results, but still there are some drawbacks. The major challenge is to distinguish between the different Arabic character scripts in different font types achieving high recognition rates. In this thesis, a proposed Primitive Arabic Character Noise Free dataset (PAC-NF) generated for evaluating and testing feature extraction systems purpose. Then a comparative study performed to choose the base technique. After that, a combination of statistical features have been proposed to increase the recognition accuracy. Moreover, the proposed features evaluated using three types of classifiers, also its robustness against scanning and artificial Gaussian noise has been tested. Moreover, an automatic font-type recognition process has been proposed by considering an Optical Font Recognition (OFR) stage before going ahead with the traditional OCR stages.

The organization of the thesis is as follows: Chapter one: Introduces Arabic language, Starting by the history of AOCR process, then the research objectives and thesis contributions are given. Chapter two: A Literature Review is presented that provides definitions, context, and a clearer understanding of previous research in printed Arabic text recognition.

VI

SUMMERY

Chapter three: Introduces the proposed generated dataset for feature extraction systems evaluation (PAC-NF) dataset. Chapter four: Presents recognition rate evaluation using different approaches across performing a comparative study between four recent primitive AOCR algorithms. Chapter five: Presents a detailed description of the proposed enhanced AOCR features to increase the recognition accuracy, and study its robustness against noise effect. Chapter six: Presents a detailed description of the proposed system for automating Optical font recognition on AOCR system. Chapter seven: Concludes the thesis, and provides guidelines for the future work in the area of AOCR.

VII

TABLE OF CONTENTS

TABLE OF CONTENTS AUTHOR BIBLIOGRAPHY ................................................................................... III ACKNOWLEDGEMENT ........................................................................................ III ABSTRACT

........................................................................................................... IV

SUMMERY

........................................................................................................... VI

TABLE OF CONTENTS ....................................................................................... VIII LIST OF FIGURES ..................................................................................................... X LIST OF TABLES ...................................................................................................XIV LIST OF ABBREVIATIONS .................................................................................. XV LIST OF FONTS ................................................................................................... XVII CHAPTER 1 INTRODUCTION .............................................................................. 1 1.1

INTRODUCTION .................................................................................................................. 2 Arabic Language ........................................................................................................................2 Arabic OCR ...............................................................................................................................8

1.2 1.3 1.4

RESEARCH OBJECTIVE ..................................................................................................... 25 RESEARCH CONTRIBUTION .............................................................................................. 26 THESIS ORGANIZATION ................................................................................................... 26

CHAPTER 2 LITERATURE REVIEW ................................................................ 28 2.1 2.2 2.3 2.4

ARABIC OPTICAL CHARACTER RECOGNITION ................................................................. 28 DATASETS ....................................................................................................................... 36 COMPARATIVE STUDIES ................................................................................................... 39 OMNI FONT RECOGNITION SYSTEMS ............................................................................... 40

CHAPTER 3 PROPOSED DATASETS ................................................................ 44 3.1

PRIMITIVE ARABIC CHARACTERS-NOISE FREE DATASET ................................................ 44 PAC-NFA ................................................................................................................................44 PAC-NFB .................................................................................................................................47

3.2 3.3

PRIMITIVE ARABIC CHARACTERS OPTICALLY SCANNED DATASET ................................. 49 DATASET DESCRIPTION ................................................................................................... 51

CHAPTER 4 CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES .................................................................................. 52 4.1

COMPARED TECHNIQUES ................................................................................................. 52 First Approach .........................................................................................................................52 Second Approach .....................................................................................................................54 Third Approach ........................................................................................................................57 Fourth Approach ......................................................................................................................64

4.2

EXPERIMENTAL SETUP .................................................................................................... 65

VIII

TABLE OF CONTENTS

4.3

RESULTS AND DISCUSSION .............................................................................................. 68

CHAPTER 5 PROPOSED ENAHNCED AOCR ACCURACY ......................... 71 5.1 5.2 5.3 5.4 5.5

INTRODUCTION ................................................................................................................ 71 FEATURE EXTRACTION .................................................................................................... 73 FEATURE SELECTION ....................................................................................................... 79 CLASSIFICATION .............................................................................................................. 81 RESULTS AND DISCUSSION .............................................................................................. 85

CHAPTER 6 PROPOSED AUTOMATING FONT RECOGNITION IN AOCR SYSTEMS 103 6.1 6.2 6.3

PROPOSED AUTOMATED SYSTEM .................................................................................. 103 FONT RECOGNITION ...................................................................................................... 104 PRACTICAL TEST ........................................................................................................... 105 Experimental Setup ................................................................................................................105 Results and Discussion...........................................................................................................108

CHAPTER 7 CONCLUSION AND FUTURE WORK ...................................... 122 7.1 7.2

CONCLUSION ................................................................................................................. 122 FUTURE WORK .............................................................................................................. 123

REFERENCES ......................................................................................................... 124 PUBLICATIONS ...................................................................................................... 131

IX

LIST OF FIGURES

LIST OF FIGURES Figure 1-1 Arabic Read/Write direction .................................................................................... 4 Figure 1-2 Dots characteristic in Arabic characters ................................................................... 4 Figure 1-3 Overlapping example ............................................................................................... 7 Figure 1-4 Diacritics .................................................................................................................. 7 Figure 1-5 Subwords or PWS .................................................................................................... 8 Figure 1-6 General AOCR System processes .......................................................................... 10 Figure 3-1 :Andalus font type optically scanned sample ......................................................... 50 Figure 4-1 : Horizontal and Vertical transitions ..................................................................... 55 Figure 4-2 : Image Devided into 4 regions (1, 2, 3, 4) ........................................................... 56 Figure 4-3 : A grid of 3×3 SIFT descriptors of the letter Taa of isolated form. ..................... 59 Figure 4-4 : Comparative Study results on PAC-NFA ............................................................ 68 Figure 4-5 :Comparative Study results on PAC-NFB ............................................................. 69 Figure 5-1 : Proposed system processes .................................................................................. 72 Figure 5-2 : Horizontal and Vertical transitions ...................................................................... 74 Figure 5-3 : Dividing the image into four regions .................................................................. 74 Figure 5-4 The center of mass of the letter Jeem is marked by the (red) dot. ......................... 76 Figure 5-5 : The crosshair of the letter jeem is marked by the (red) dot. ................................ 77 Figure 5-6 The top outline of the letter Jeem is marked by the semi-flled (purple) dots. ....... 78 Figure 5-7 : Horizontal black ink histogram feature ............................................................... 78 Figure 5-8 : Vertical black ink histogram feature .................................................................. 79 Figure 5-9 : Proposed system processes .................................................................................. 80 Figure 5-10 : Proposed feature selection flowchart ................................................................. 81 Figure 5-11 : Results of proposed technique on PAC-NFA using Random forest tree classifier ........................................................................................................................................... 85

X

LIST OF FIGURES

Figure 5-12 : Results of proposed technique on PAC-NFB using Random forest tree classifier ........................................................................................................................................... 86 Figure 5-13 : Results of proposed technique on PAC-NFA using SVM classifier ................. 87 Figure 5-14 : Results of proposed technique on PAC-NFB using SVM classifier .................. 88 Figure 5-15 : Results of proposed technique on PAC-NFA using ANN classifier ................. 88 Figure 5-16 : Results of proposed technique on PAC-NFB using ANN classifier .................. 89 Figure 5-17 : Results of proposed technique on PAC-OS Dataset(600DPI) .......................... 89 Figure 5-18 : Results of proposed technique on PAC-OS Dataset(400DPI) .......................... 90 Figure 5-19 : Results of proposed technique on PAC-OS Dataset(300DPI) .......................... 90 Figure 5-20 : Results of proposed technique on PAC-OS Dataset(200DPI) ......................... 91 Figure 5-21 : Results of proposed technique on PAC-OS Dataset(150DPI) ......................... 91 Figure 5-22 : Results of proposed technique on PAC-OS Dataset(100DPI) ......................... 92 Figure 5-23 : Results of proposed technique on PAC-OS Dataset(75DPI) ........................... 92 Figure 5-24 : CRR-DPI relationship for F3 (Andalus) font type ............................................. 93 Figure 5-25 : Andalus (F5) Jeem character image. (a) Raw image, (b, c, d, e, f, g, h, i, j, and k) Gaussian noisy images with variance (0.01, 0.02, 0.03, 0.04, 0.05, 0.07, 0.1, 0.13, 0.17, and 0.2) respectively ................................................................................................ 93 Figure 5-26 : Effect of adding Gaussian noise with (variance=0.01) ...................................... 94 Figure 5-27 : Effect of adding Gaussian noise with (variance=0.02) ...................................... 94 Figure 5-28 : Effect of adding Gaussian noise with (variance=0.03) ...................................... 95 Figure 5-29 : Effect of adding Gaussian noise with (variance=0.04) ...................................... 95 Figure 5-30 : Effect of adding Gaussian noise with (variance=0.05) ...................................... 96 Figure 5-31 : Effect of adding Gaussian noise with (variance=0.07) ..................................... 96 Figure 5-32 : Effect of adding Gaussian noise with (variance=0.1) ....................................... 97 Figure 5-33 : Effect of adding Gaussian noise with (variance=0.13) ..................................... 97 Figure 5-34 : Effect of adding Gaussian noise with (variance=0.17) ..................................... 98 Figure 5-35 : Effect of adding Gaussian noise with (variance=0.2) ........................................ 98

XI

LIST OF FIGURES

Figure 5-36 : effect of adding artificial noise on the base algorithm versus the proposed algorithm for F3(Andalus)-font-type ................................................................................ 99 Figure 5-37 : consumed time in feature extraction process ................................................... 100 Figure 6-1 : Proposed AOCR System with OFR ................................................................... 103 Figure 6-2 : General OFR system process ............................................................................. 104 Figure 6-3 : Test image of Andalus font type (Italic) ............................................................ 106 Figure 6-4 : Reference image of Andalus font type (a)Normal (b)Bold (c)Italic (d)BoldItalic ......................................................................................................................................... 107 Figure 6-5: Matched Descriptors of Tahoma(F11) test images with reference font types .... 108 Figure 6-6 : Matched Descriptors of Traditional Arabic(F12) test images with reference font types ................................................................................................................................ 109 Figure 6-7 : Matched Descriptors of Andalus(F3) test images with reference font type ...... 109 Figure 6-8 : Matched Descriptors of DecoType Naskh(F7) test images with reference font types ................................................................................................................................ 110 Figure 6-9 : Matched Descriptors of DecoType Thuluth(F8) test images with reference font types ................................................................................................................................ 110 Figure 6-10 : Matched Descriptors of M Unicode Sara(F9) test images with reference font types ................................................................................................................................ 111 Figure 6-11 : Matched Descriptors of Simplified Arabic(F10) test images with reference font types ................................................................................................................................ 111 Figure 6-12 : Matched Descriptors of Akhbar MT(F1) test images with reference font types ......................................................................................................................................... 112 Figure 6-13 : Matched Descriptors of Ariel(F2) test images with reference font types ........ 112 Figure 6-14 : Matched Descriptors of Diwani Letter(F6) test images with reference font types ................................................................................................................................ 113 Figure 6-15 : Matched Descriptors of Arabic Transparent(F4) test images with reference font types ................................................................................................................................ 113 Figure 6-16 : Matched Descriptors of Advertising Bold(F5) test images with reference font types ................................................................................................................................ 114 Figure 6-17 : Matched Descriptors of Tahoma(F11) Optically Scanned test image with reference font types ......................................................................................................... 114

XII

LIST OF FIGURES

Figure 6-18 : Matched Descriptors of Andalus(F3) Optically Scanned test image with reference font types ......................................................................................................... 115 Figure 6-19 : Matched Descriptors of DecoType Thuluth(F8) Optically Scanned test image with reference font types ................................................................................................. 115 Figure 6-20 : Matched Descriptors of Traditional Arabic(F12) Optically Scanned test image with reference font types ................................................................................................. 116 Figure 6-21 : Matched Descriptors of Simplified Arabic(F10) Optically Scanned test image with reference font types ................................................................................................. 116 Figure 6-22 : Matched Descriptors of Akhbar MT(F1) Optically Scanned test image with reference font types ......................................................................................................... 117 Figure 6-23 : Matched Descriptors of Advertising Bold(F5) Optically Scanned test image with reference font types ................................................................................................. 117 Figure 6-24 : Matched Descriptors of M Unicode Sara(F9) Optically Scanned test image with reference font types ................................................................................................. 118 Figure 6-25 : Matched Descriptors of DecoType Naskh(F7) Optically Scanned test image with reference font types ................................................................................................. 118 Figure 6-26 : Matched Descriptors of Diwani Letter(F6) Optically Scanned test image with reference font types ......................................................................................................... 119 Figure 6-27 : Matched Descriptors of Arabic Transparent(F4) Optically Scanned test image with reference font types ................................................................................................. 119 Figure 6-28 : Matched Descriptors of Ariel(F2) Optically Scanned test image with reference font types ......................................................................................................................... 120

XIII

LIST OF TABLES

LIST OF TABLES Table 1-1 : Arabic language primitives ..................................................................................... 3 Table 1-2 : Dots count and existence in Arabic characters ....................................................... 5 Table 1-3 : Characters connectivity and interrupting ................................................................ 6 Table 1-4 : Ligatures .................................................................................................................. 6 Table 3-1 : Font Types used in PAC-NFA .............................................................................. 45 Table 3-2 : Arabic language characters in all positions ........................................................... 45 Table 3-3 : Font types used in PAC-NFB ................................................................................ 47 Table 3-4 : Datasets description ............................................................................................... 51 Table 4-1: Neural Network training parametes ....................................................................... 57 Table 5-1: Proposed ANN training parameters ....................................................................... 84 Table 5-2 : computed out-of-bag error for features ................................................................. 86 Table 5-3 : Proposed technique improvement accuracy across all tests for F3 font type ...... 101 Table 6-1 : Misclassified assignements of font types in optically scanned test samples ....... 121

XIV

Chapter 1

LIST OF ABBREVIATIONS

LIST OF ABBREVIATIONS Abbreviation

Term

AOCR

Arabic Optical Character Recognition

APTI

Arabic Printed Text Images

CRR

Character Recognition Rate

CEDAR

Center of Excellence for Document Analysis and Recognition

DPI

Dot Per Inch

ERIM

Environmental Research Institute of Michigan

FFT

Fast Fourier Transform

FDs

Fourier Descriptors

GRUHD

Greek Unconstrained Handwriting

HCC

Haar Cascade Classifier

HMM

Hidden Markov Model

ICR

Intelligent Character Recognition

IRONOFF

IRESTE ON/OFF

KNN

K-Nearest Neighbors

KPCA

Kernel Principal Component Analysis

KFDA

Kernel Fisher Discriminant Analysis

LDC

Linguistic Data Consortium

LDF

Linear Discriminant Function

LVQ

Learning Vector Quantization

MFS

Modified Fourier Spectrum

MCR

Minimum Covering Run

NN

Neural Network

NIST

National Institute of Standards and Technology

OFR

Optical Font Recognition

XV

Chapter 1

LIST OF ABBREVIATIONS

Abbreviation

Term

OCR

Optical Character Recognition

PATS

Printed Arabic Text Set

PPM

Page Per Minute

PAC-NF

Printed Arabic Characters –Noise Free

PCA

Principal Component Analysis

QDF

Quadratic Discriminant Function

RAST

Recognition by Adaptive Subdivision of Transformation Space

RDA

Regularized Discriminant Analysis

RBF

Gaussian Radial Basis Function

SCSI

Small Computer System Interface

SVM

Support Vector Machines

SIFT

Scale Invariant Feature Transform

SRF

Sobel and Robert Features

SOCR

SIFT-based OCR

UNIPEN

NicIcon database of Iconic Pen

XVI

Chapter 1

LIST OF FONTS

LIST OF FONTS Font Abbreviation

Font Name

F1

Akhbar MT

F2

Ariel

F3

Andalus

F4

Arabic Transparent

F5

Advertising Bold

F6

Diwani Letter

F7

DecoType Naskh

F8

Deco Type Thuluth

F9

M Unicode Sara

F10

Simplified Arabic

F11

Tahoma

F12

Traditional Arabic

XVII

Chapter 1

INTRODUCTION

Chapter 1 INTRODUCTION OCR stands for (Optical Character Recognition) or (Optical Character Reader). It is a research field in computer vision, artificial intelligence and pattern recognition. It is the science of converting text image documents of type, printed, or handwritten into machine-encoded text. It is widely used as a form of data entry from a paper documents whether passport, bank statements, invoices, computerized receipts, mails, printouts of static data, business cards or any other suitable documentation. It is a common method of digitizing paper documents so that it can be electronically stored more compactly, edited, displayed on-line, searched and used in machine processes such as text-tospeech, machine translation, text mining and key data. The start of optical character recognition related to the invention of retina scanner in 1870 [2] and the technologies involving telegraphy and creating devices in order to help blinds for reading. The first OCR was proposed in 1914 [3] by Emanuel Goldberg. He developed a machine which read characters then converts them into standard telegraph code. On January 13, 1976 Ray Kurzweil [3] created a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. Also a research group headed by A. G. Ramakrishnan has developed PrintToBraille tool that can be used by any OCR to convert scanned images of printed books to Braille books[3]. In the early twentieth century, OCR became available online as a service (WebOCR) and

now

in

mobile

applications

on

smartphones

and

in

the cloud-

computing environment. Recently various open source and commercial OCR systems are available for most of common writing systems including Arabic, Latin, Cyrillic, Bengali, Hebrew, Indic, Devanagari, Tamil, Japanese, Korean and Chinese characters. OCR branched into more advanced OCR called ICR (Intelligent Character Recognition). In other words ICR specified with handwriting recognition systems that

1

Chapter 1

INTRODUCTION

allows different fonts and styles of handwriting to be learned by the computer during processing to improve the recognition and accuracy levels. AOCR (Arabic Optical Character Recognition) started since 40 years ago by Ahmed Nazif et al [4] at 1975. Arabic text recognition, which was not researched as thoroughly as Latin, Chinese, or Japanese, is receiving more attention from both Arabic and nonArabic-speaking researchers. Irrespective of the language under consideration, some traditional applications of text recognition include check verification, office automation, reading postal address, writer identification, and signature verification. Searching scanned documents available on the internet and searching Arabic historical manuscripts are also emerging applications. When Arabic is considered, the need to advance each one of these applications is serious as there is a lack of real applications in these areas this is due to the cursive script of writing in Arabic language.

1.1 Introduction This section presents general short introduction about Arabic language properties and Arabic Optical Character Recognition (AOCR) system, types, and available software products.

Arabic Language Arabic characters are used in many languages as it is the first language for more than 400 million people in the world. It is also used by more than triple the previous number of Muslims all over the world as a second language, as it is the language in which the Holy Qur'an was revealed. Arabic was added to the official languages of the United Nations in 1973 as the sixth language. The other five official languages (Chinese, English, French, Russian and Spanish) were chosen when the United Nations was founded. Also, as has been reported by National Geographic, Arabic is expected to be one of the 5 major languages by 2050. Arabic is one of the Semitic languages. Languages like Arabic, Urdu, Persian (Farsi), Sorani and Luri dialects of Kurdish,

2

Chapter 1

INTRODUCTION

Jawi, Pashto, Sindhi, Hausa, Kashmiri, Kazak, , Kyrghyz, Malay, Morisco, Pashto, Punjabi, Tatar, Turkish, and Uyghur use Arabic characters with some little differences see [5-7]. Arabic language properties can be listed according to [5] [6] [7]as: 5 6 7

[ ][ ][ ]

a) Arabic language consists of 28 characters (see Table(1-1)). b) Multiple grapheme cases: As mentioned before, Arabic language script is cursive (connected), this characteristic cause the characters to be context sensitive and change its form and have multiple variants according to its position (Start, Middle, End, Isolated) (see Table(1-1)).

Table 1-1 : Arabic language primitives

Arabic Language Primitives Character name

Isolated

Connected End

Middle

Start

Alif

‫الف‬

‫ا‬

‫ـا‬

-

-

Baa

‫باء‬

‫ب‬

‫ـب‬

‫ـبـ‬

‫بـ‬

Taa

‫تاء‬

‫ت‬

‫ـت‬

‫ـتـ‬

‫تـ‬

Thaa

‫ثاء‬

‫ث‬

‫ـث‬

‫ـثـ‬

‫ثـ‬

Jeem

‫جيم‬

‫ج‬

‫ـج‬

‫ـجـ‬

‫جـ‬

Haa

‫حاء‬

‫ح‬

‫ـح‬

‫ـحـ‬

‫حـ‬

Khaa

‫خاء‬

‫خ‬

‫ـخ‬

‫ـخـ‬

‫خـ‬

Daal

‫دال‬

‫د‬

‫ـد‬

-

-

Thaal

‫ذال‬

‫ذ‬

‫ـذ‬

-

-

Raa

‫راء‬

‫ر‬

‫ـر‬

-

-

Zaay

‫زاي‬

‫ز‬

‫ـز‬

-

-

Seen

‫سين‬

‫س‬

‫ـس‬

‫ـسـ‬

‫سـ‬

Sheen

‫شين‬

‫ش‬

‫ـش‬

‫ـشـ‬

‫شـ‬

Saad

‫صاد‬

‫ص‬

‫ـص‬

‫ـصـ‬

‫صـ‬

Dhaad

‫ضاد‬

‫ض‬

‫ـض‬

‫ـضـ‬

‫ضـ‬

3

Chapter 1

INTRODUCTION

Ttaa

‫طاء‬

‫ط‬

‫ـط‬

‫ـطـ‬

‫طـ‬

Dthaa

‫ظاء‬

‫ظ‬

‫ـظ‬

‫ـظـ‬

‫ظـ‬

Ain

‫عين‬

‫ع‬

‫ـع‬

‫ـعـ‬

‫عـ‬

Ghen

‫غين‬

‫غ‬

‫ـغ‬

‫ـغـ‬

‫غـ‬

Faa

‫فاء‬

‫ف‬

‫ـف‬

‫ـفـ‬

‫فـ‬

Qaf

‫قاف‬

‫ق‬

‫ـق‬

‫ـقـ‬

‫قـ‬

Kaf

‫كاف‬

‫ك‬

‫ـك‬

‫ـكـ‬

‫كـ‬

Lam

‫الم‬

‫ل‬

‫ـل‬

‫ـلـ‬

‫لـ‬

Mem

‫ميم‬

‫م‬

‫ـم‬

‫ـمـ‬

‫مـ‬

Noon

‫نون‬

‫ن‬

‫ـن‬

‫ـنـ‬

‫نـ‬

Haa

‫هاء‬

‫ه‬

‫ـه‬

‫ـهـ‬

‫هـ‬

Wow

‫واو‬

‫و‬

‫ـو‬

-

-

Yaa

‫ياء‬

‫ي‬

‫ـي‬

‫ـيـ‬

‫يـ‬

c) Arabic language is written and read from right to left (see Fig(1-1)).

Figure 1-1 Arabic Read/Write direction

d) Dots: Arabic characters consists of two parts : i.

Grapheme: it is the main structure of the character. Multiple characters share the same grapheme (see Fig(1-2)).

Figure 1-2 Dots characteristic in Arabic characters

4

Chapter 1

ii.

INTRODUCTION

Dots: many characters can share the same grapheme but not the same number of dots (see Fig(1-2)) (Baa, Taa, Thaa) share the same structure but not the same number of dots. Dotting is a significant source of confusion in AOCR systems especially noisy scanned documents. There are 15 characters in the language have dots, (10) characters have (1) dot, (3) characters have (2) dots and (2) characters have (3) dots, (see Table(1-2)). Table 1-2 : Dots count and existence in Arabic characters

Arabic Name

Latin Name

Form

‫باء‬

Baa

‫ب‬

‫جيم‬

Jeem

‫ج‬

‫خاء‬

Khaa

‫خ‬

‫ذال‬

Thaal

‫ذ‬

‫زاى‬

Zaay

‫ز‬

‫ضاض‬

Dhaad

‫ض‬

‫ظاء‬

Dthaa

‫ظ‬

‫غين‬

Ghen

‫غ‬

‫فاء‬

Faa

‫ف‬

‫نون‬

Noon

‫ن‬

‫تاء‬

Taa

‫ت‬

‫قاف‬

Qaf

‫ق‬

‫ياء‬

Yaa

‫ي‬

‫ثاء‬

Thaa

‫ث‬

‫شين‬

Sheen

‫ش‬

Number of dots

1

2

3

e) Connectivity: Unlike Latin language, Arabic characters is connected (cursive) within the same word, this connection can be interrupted at the middle of the word at few certain characters (‫أ‬, ‫د‬, ‫ذ‬, ‫ر‬, ‫ز‬, ‫( )و‬see Table(1-3)).

5

Chapter 1

INTRODUCTION

Table 1-3 : Characters connectivity and interrupting

Word

Primitives

‫متصلة‬

‫ة‬+‫ل‬+‫ص‬+‫ت‬+‫م‬

‫أصوات‬

‫ت‬+‫ا‬+‫و‬+‫ص‬+‫أ‬

‫بذل‬

‫ل‬+‫ذ‬+‫ب‬

‫إزاء‬

‫ء‬+‫ا‬+‫ز‬+‫إ‬

‫تاريخ‬

‫خ‬+‫ي‬+‫ر‬+‫ا‬+‫ت‬

‫أبجدية‬

‫ة‬+‫ي‬+‫د‬+‫ج‬+‫ب‬+‫أ‬

‫عربية‬

‫ة‬+‫ي‬+‫ب‬+‫ر‬+‫ع‬

f) Ligatures: Depending on font type, many characters can be compound together at certain positions in the word, and represented by a single atomic grapheme which is called Ligatures. e.g. lamalif (‫ )أل‬is a combination of lam (‫ )ل‬and alif (‫( )أ‬see Table(1-4)), for more details see [5]..

Table 1-4 : Ligatures

Ligature

Characters

‫لم‬

‫م‬+‫ل‬

‫ال‬

‫ا‬+‫ل‬

‫لحـ‬

‫ح‬+‫ل‬

‫لجـ‬

‫ج‬+‫ل‬

‫بحـ‬

‫ج‬+‫ب‬

‫بمـ‬

‫م‬+‫ب‬

‫لمحـ‬

‫ح‬+‫م‬+‫ل‬

‫ممـ‬

‫م‬+‫م‬

‫لهـ‬

‫ه‬+‫ل‬

‫بى‬

‫ى‬+‫ب‬

6

Chapter 1

g)

INTRODUCTION

‫محـ‬

‫ح‬+‫م‬

‫لى‬

‫ى‬+‫ل‬

‫حمـ‬

‫م‬+‫ح‬

‫بم‬

‫م‬+‫ب‬

‫سمـ‬

‫م‬+‫س‬

Overlapping: The characters in the same word may overlap vertically without touching. Notice the vertical overlapping in Fig(1-3) between the (Wow) character and the following character (Alif).

Figure 1-3 Overlapping example

h) Size variation: Arabic words is not like English words in its characters structured size. In English, a single word, has a fixed width and height; while in Arabic, a single word does not have a fixed width or height. i) Diacritics: Diacritics used for correct and standard pronunciation and called (Tashkyl), and it is always used for practice only (see Fig(1-4)).

Figure 1-4 Diacritics

j) Subword(s): some Arabic words are composed of sub-word(s) or pws (piece of words). exists if the middle of the word contains one of following characters (‫أ‬, ‫د‬, ‫ذ‬, ‫ر‬, ‫ز‬, ‫)و‬. As example (see Fig(1-5)) the word (‫ )فأسقيناكموه‬contains four subwords or pws: (‫فأ‬, ‫سقينا‬, ‫كمو‬, ‫)ه‬.

7

Chapter 1

INTRODUCTION

Figure 1-5 Subwords or PWS

These characteristics considered the most important general characteristics, for additional characteristics about Urdu (Special case of Arabic language) characteristics see [5]. In this work Omni font Arabic text primitives is considered, written in different Arabic text font types. The proposed technique intended to achieve a higher recognition accuracy rates by contributing the following additions:

Arabic OCR This section presents an introduction about Arabic optical character recognition this will include a detailed information about AOCR types as well as its general system processes.

1.1.2.1 Arabic OCR Types 1) Machine Printed Also called typewritten or computer-generated style It uses the similar writing style for the whole Arabic characters. it is the simplest among all styles because of the uniformity in writing a word.

8

Chapter 1

INTRODUCTION

2) HandWritten This is assumed to be the most difficult style because of the variations in character shape even if it is rewritten by the same person. Handwritten style may be further classified into: scribe, personal and decorative. Scribe is more carefully written than the personal handwriting style that represents the daily usage of Arabic alphabet by individuals. Few people are able to perform an exquisite scribe handwritten script. Scribe handwriting is certainly different from decorative handwriting, which is normally used for adornment purposes. Note that the size of the vocabulary and the writer dependency effect in this style, which leads to wrong recognition 3) Typeset It is normally used to print books, journals, magazines, announcements and newspapers. Typeset style is generally more difficult than the machine-printed style, because of the existence of overlaps and ligatures, which poses a challenging problem. Ligatures occur when two or more letters overlap vertically and touch. By contrast, overlaps occur when two or more letters overlap vertically without touching. Recently, some computer-generated fonts have imitated the typeset style in providing ligatures and overlaps.

1.1.2.2

Arabic OCR System

The general AOCR system processes according to recent researches can be listed as shown in Fig(1-6) to:

1)

Image acquisition This process including acquiring the text image from papers through any type of

acquiring tools (e.g. scanner, camera). It generally means transforming the text from analog form to a digitized form. In printed OCR, there is just offline recognition; the online recognition exists only in the handwritten type. The output of this process is a raw image containing the desired text to be recognized.

9

Chapter 1

INTRODUCTION

Figure 1-6 General AOCR System processes

This is the first step in the recognition system. The objective is to acquire the text and transform it into a digitized raster image. There are two types of character recognition systems in terms of acquiring their input: on-line and offline recognition systems. The on-line recognition system recognizes the text as it is being written. The preferred input device is an electronic tablet with a stylus pen. The electronic tablet captures the (x,y) co-ordinate data of pen-tip movement, which typically has a resolution of 10 points/mm, a sampling rate of 100 points/s, and an indication of pen-up and pendown. The on-line recognition system has two major advantages: the highrecognition accuracy and the interaction. The first advantage is that on-line recognition captures a character as a set of strokes, which are represented by a series of co-ordinate points. The second advantage is that it is very natural for the user to detect and correct unrecognized characters immediately by verifying the recognition results as they appear. On the other hand, on-line recognition is limited to recognizing handwritten text. The off-line recognition system recognizes the text after it has been written or typed. The system may acquire the text using a video camera or a scanner. The latter is

10

Chapter 1

INTRODUCTION

commonly used because it is more convenient, it introduces less noise into the imaging process, extra features such as automatic Binarization and image enhancement can be coupled with the scanning process to enhance the resulting image text and, most importantly, it is more relevant to the problem of recognizing written script. For document management applications the aim is to speed-up the scanning process to maximum speed. Most scanners can run at 600 dots per inch (dpi), and are designed with a high volume document feeder and high throughput Small Computer System Interface (SCSI) can process up to 85 pages per minute (ppm). Lower resolution and poor Binarization can contribute to readability when essential features of characters are deleted or obscured. The resulting image can also be affected by the presence of marking or stains, or if the document has been faxed or copied several times. The latter causes a diminishing of contrast, the appearance of ‘salt and pepper’ noise, and the false appearance of text by becoming either thinner or thicker than the original document. Binarization, or thresholding, is a conversion from a grey level image to a bi-level image. A bi-level image contains all of the essential information concerning the number, position and shape of objects while containing less information. The simple and straightforward method is to select a threshold value, and then all pixels with a greylevel below this threshold are classified as black, and those above as white. The threshold must be determined from the pixel values found in the image, for example the use of the mean grey level in the image as a threshold. Another method is by using the histogram of the grey levels in the image. Given a histogram and the percentage of black pixels desired, one can determine the number of black pixels by multiplying the percentage by the total number of pixels, then simply count the pixels in histogram bins, starting at bin 0 until the count is greater than or equal to the desired number of black pixels. The threshold is the grey level associated with the last bin counted. Other approaches such as using edge pixels, iterative selection and using entropy can also be applied. In the edge pixel method, the threshold is found by first computing the Laplacian of the input image, then selecting those pixels with large Laplacian values. In the iterative selection method, the threshold is first calculated as the average value.

11

Chapter 1

INTRODUCTION

The average values of the object and the background classes are iteratively calculated, and the mean of these two values represents the new threshold. The entropy method treats the image as a source of information. Each of the entropies associated with the black pixels and the white pixels is weighted with a calculated probability. The threshold is found by maximizing a predefined equation based on the implemented algorithm.

2)

Preprocessing This process includes several steps to prepare the image to be better suited

recognized. Some of these steps are mandatory and some are optional depending on the system. These steps can involve (Binarization, Filtering & Smoothing, Slant Correction, Skewness Detection & correction, Thinning (Skeletonization)). The output of this process is an image, which are optimal or near optimal noise free image. As the OCR system depends upon both the original document quality and the registered image quality. The preprocessing stage attempts to compensate for poor quality originals and/or poor quality scanning. This is achieved by reducing both noise and data variations. All image acquisition processes are subject to noise of some type, therefore there is no ideal situation in which no noise is present. Noise can neither be predicted nor measured accurately from a noisy image. Instead, noise may be characterized by its effect on the image. Smoothing: This reduces the noise in an image using mathematical morphology operations. Two operations are mainly used, Opening and Closing. Opening opens small gaps or spaces between touching objects in an image; this will break narrow isthmuses and eliminate small islands. In contrast, Closing fills small gaps in an image; this will eliminate small holes on the contour. Both Opening and Closing apply the same basic morphology operations, namely, Dilation and Erosion, but in the opposite order. Skew Detection and Correction. One of the first steps in attempting to read the document is to estimate the orientation angle, the skew angle, of the text lines. This process is called skew detection, and the process of rotating the document with the skew

12

Chapter 1

INTRODUCTION

angle, in the opposite direction, is called skew correction. The common, and perhaps the most efficient, approach to estimate the skew angle is to use the Hough Transform The Hough Transform is a method for detecting straight lines in a raster image. Document Decomposition :( this part more related to segmentation process, but some researches classify it in preprocessing stage) the document decomposition and structural analysis task can be divided into three phases [8]: Phase one consists of block segmentation where the document is decomposed into several rectangular blocks. Each block is a homogeneous entity containing one of the following: a text, an image, a diagram or a table. Phase two consists of block classification. Each block is assigned a label (title, regular text, picture, table, etc.) using properties of individual blocks from phase one. Phase three consists of a logical grouping and ordering of the blocks. For OCR the concentration is focused on text blocks. Work on Arabic has been limited to text documents, thus the notation of document decomposition means the separation of text lines and the segmentation of words and sub-words. The classical method for identifying text lines in an Arabic text image is to use a fixed threshold to separate the pairs of consecutive lines [9] [10] This threshold is obtained using the distances between various baselines of the text. The median of different distance values is an appropriate selection. Slant Normalization: This problem may be clearly seen in handwritten words, although machine-printed words with italic fonts suffer from the same problem. Kim and Govindaraju [11] [12] proposed an algorithm to correct slant angle in which vertical and near vertical lines are extracted by tracing chain code components using a pair of one dimensional filters, each being a five element array of different weights. A convolution operation between the filter and five consecutive components was applied by sliding the filter one component at each iteration. Thinning and Skeletonization: These are the operations that produce the skeleton. A skeleton is presumed to represent the shape of the object in a relatively small number of pixels, all of which are structural and have semi-equal distance from two or more contour points. Thinning algorithms may be classified into parallel and sequential. The

13

Chapter 1

INTRODUCTION

parallel algorithms operate on all pixels simultaneously. In contrast, the sequential algorithms examine pixels and transform them, depending on the preceding processed results. The approach in both cases is to remove the boundary pixels of the character that are neither essential for preserving the connectivity of the pattern, nor for representing any significant geometrical feature of the pattern. The process converges when the connected skeleton does not change or vanish, even if the iteration continues. A recapitulation of different preprocessing techniques is given in [13] , for Urdu, Arabic, Persian, and Jawi. The authors employed spatial max and median filter, histogram equalization, and frequency domain Gaussian low-pass filter with the objective of enhancing the dark and noisy Urdu document for the later steps of segmentation and feature extraction. Page orientation is defined as the printing direction of the text lines with the upright position of characters in a document (whether in the portrait mode or in landscape mode). The angle that the text lines make with the horizontal direction in a digital text image is called the skew angle of the document. The proper page orientations and accurate skew corrections enable better document analysis. Using a discriminative learning approach, such as convolutional neural networks. In an OCR system, the layout analysis is an other key preprocessing step for the effective text-lines extraction and reading order determination. While rely in go nan existing system for Roman script, the Recognition by Adaptive Subdivision of Transformation Space (RAST) [14].

3)

Segmentation: A scanned text document is usually a collection of paragraphs, each of which is a

collection of sentences and sentences have connected or partially connected character strings. After the layout preprocessing orientation detection, skew elimination, and layout analysis one ought to carry out segmentation. The segmentation refers to these parathion of the paragraphs, text lines, words, characters, and strokes for effective feature extraction. Segmentation is a challenging task, especially in cursive script OCR,

14

Chapter 1

INTRODUCTION

cursiveness is the main obstacle facing any Arabic text recognition system whether it is on- or off-line., and directly affects the subsequent stages of feature extraction and classification. Segmenting on-line Arabic handwriting is much simpler than segmenting off line machine-printed Arabic words, This simplicity motivated the work carried out in developing an algorithm to restore the temporal information in off-line Arabic handwriting. Systems differ in requiring segmentation process as a main process, according to that Arabic text recognition systems can be categorized, relative to their approach in tackling word segmentation, into segmentation- based systems and segmentation-free systems:

a) Segmentation-free systems: Global Segmentation Approach, which is also called segmentation free or holistic approach in which the segmentation just required for dividing the text into words. This scheme of text recognition is motivated by discoveries in psychological studies of the human reading process. It attempts to recognize the whole representation of a word without trying to segment and recognize characters or primitives individually. This approach was originally introduced for speech recognition. One approach of the word level Arabic recognition was to analyze the word shape with a unique vector of features, then this feature vector might be matched against a database of analogous feature vectors, or represented in attribute/value form to an inductive learning system. Another approach based on marking the location at which a structuring element fits within a pixel set corresponding to a shape of interest and another structuring element lies outside the pixel set. Shape primitives located on the whole page were then combined into characters. Another approaches was based on choosing a text line as the major unit for training and recognition. When a page was decomposed into text lines, the horizontal position along each line was selected as an independent variable. Hence, a text line was scanned from right-to-left, and at each horizontal position,

15

Chapter 1

INTRODUCTION

a set of features was extracted from a narrow vertical strip. The system was based on hidden Markov models, where a separate model represented each character. The output was a sequence of characters that had the highest likelihood. Another techniques for recognizing typewritten and handwritten Arabic cursive words treated the word as a whole. Each word was represented by a set of Fourier coefficients extracted from the word image.

b) Segmentation-based systems: Analytical approach, in which the page is divided into lines, the lines is divided into words and the words is divided into its primitive characters. These can be divided into four categories: 1. Isolated/pre-segmented characters Researchers here recognize numerals, isolated characters, or assume that these characters result from a reliable segmentation algorithm. These systems are not practical, except if we consider mathematical formulas or the indexing of diagrams. 2. Segmenting a word into characters This is the first approach used for segmentation. The system attempts to segment a word into its characters, and then recognize each character separately. The reason behind the emergence of this approach is the simplicity of the recognition afterwards, since the cursiveness obstacle is not present and the problem is now similar to Latin OCR. 3. Segmenting a word into primitives This segments a sub word or connected component into symbols where each symbol may represent a character, a ligature, or possibly a fraction of a character. 4. Integration of recognition and segmentation: This claims that the procedure resembles, largely, the human recognition process. The segmentation here is performed after recognition. The approach is to scan the word starting from the far right, and at each step either cluster a column to one of the codebook entries or calculate accumulative moment invariants. The

16

Chapter 1

INTRODUCTION

system is not always able to recognize all characters, which implied that not all succeeding characters in that sub-word would be processed.

4)

Feature extraction: The recognition of characters depends on the differences between its features. A

feature is a measurement made on a glyph, and combining it into a vector is a simple way of collating multiple measurements. Ideally, the features extracted from an image capture the essential characteristics of the character or the word by filtering out all attributes which make a character/word in one font different from the same character/word in another. At the same time, they preserve the properties that make one character/word different from another character/word. The recognition of characters depends on the differences between its features. Which can be classified to Statistical, Structural, and Global Transformation according to [15].

c) Structural Structural or topological features these features depend on the geometrical information of the characters. Some of these features are (convexities, concavity, end points, number of holes, etc.). Structural features are the most popular features investigated by researchers. Structural features describe a pattern in terms of its topology and geometry by giving its global and local properties. Structural features can highly tolerate distortions and variations in writing styles (multi-font) but extracting them from images is not always easy. A combination of several structural features could enhance the overall recognition rate. The structural features used in the literature depend on the kind of pattern to be classified. In the case of characters, the features include strokes and bays in various directions, end points, intersection points, loops, dots, and zigzags. Some researchers use the height, width, number of crossing points, and the category of the pattern (character body, dot, etc.); the presence and number of dots and their position with respect to the baseline. the number of concavities in the four major directions, the number of holes, and the state of several key pixels. the number of strokes or radicals and the size of the

17

Chapter 1

INTRODUCTION

stroke’s frame; and the connectivity of the character. In the case of primitives, the features extracted include the direction of curvature (e.g., clockwise), the type of the feature point at which a curve was segmented (cross point, etc.); the direction, slope, and length of strokes. The length of a contour segment, the distance between the start and end point of the contour projected on the x- and y-axis, and the difference in curvature between the start and end points; and the length of vectors in the four major directions that approximate a curve. Sometimes the pattern is divided into several zones and several types of geometric features in each zone are registered and counted, with some features constrained to specific zones. Those features include the number of concavities, holes, cross points, loops, and dots; the number and length of the contour segments; the zone with maximum (minimum) number of pixels; and concavities in the four major directions. d) Statistical Global or statistical features obtained from the arrangement of points constituting the character matrix, in other words derived from the statistical distribution of pixels and describe the characteristic measurements of a pattern. In contrast to topological features, it is less affected by distortions or noise. Statistical features are derived from the statistical distribution of pixels and describe the characteristic measurements of a pattern. These include zoning [16] [17], which features the density distribution of character pixels, characteristic loci [18], which counts the one and the zero segments and the length of each segment, the ratio of pixel distribution between two parts of the image [19] [20] [21], Moment invariants which refer to certain functions of moments. They are invariant to geometric transformations such as translation, scaling and rotation [22] [23]. Moment invariants are sensitive to any change and multi-font recognition. There are other features like ( projection histograms ( vertical, horizontal), ntuples, distances, outlines (right, up, down, left) and crossings, etc.). Some of them used in this paper.

18

Chapter 1

INTRODUCTION

Statistical features are easy to extract; nonetheless, they may be misleading, due to a fraction of noise brought forth haphazardly according to the Binarization process. e) Global Transformations Global transformations depend on the transformation schemes, which converts the pixel representation of the pattern to another representation, which enable to discriminate between characters, in other words Global transformation techniques transform the pixel representation to a more compact form. This reduces the dimensionality of the feature vector, and provides feature invariants to global deformation like translation, dilation and rotation. e.g. (projection transform, chaincode transformation, direction codes, Hough transform, Walsh transform, Fourier Descriptors (FDs), etc.). 5)

Classification: Concerns with decision-making. This process and Extracting features are considered

the pivot elements in the recognition processes. In this process, the extracted features compared with other features of previously known characters (model) to find the closest match. This process is a major task after feature extraction to classify the object into one of several categories. There are a number of various classification techniques applied in text recognition Different classification strategies have been proposed by researchers e.g. (statistical methods, kernel methods, syntactic methods, template matching, decision tree and artificial neural networks, etc.).

Minimum Distance Classifier: Given K different classes, where each class is characterized by a feature vector prototype, the problem is to assign an input feature vector to one of these classes according to a predefined discriminant function. The features can be geometrical, statistical or structural

19

Chapter 1

INTRODUCTION

Decision Tree Classifier. This classifier splits the N-dimensional feature space into unique regions by means of a sequential method. The algorithm is such that every class need not be tested to arrive at a decision. This becomes advantageous when the number of classes is very large where the dictionary was composed of a tree and the nodes were labelled with character names. Each node of the dictionary was associated with a Boolean variable indicating if the path joining the root to the terminal node corresponded effectively to an existing word. If during the ongoing sequential identification process several models are candidates, then the last mentioned attribute is calculated to make the final decision. As in the previous category, classification here can be a two-step process. In the first step, an input character is assigned to one of the main groups according to some syntactic rules. Then, and relative to a more detailed feature vector, the input character is matched with one of the group members.

Syntactic or structural methods: Syntactic methods are good for classifying hand written texts. This type of classifier, classifies the input patterns based on components of the characters and the relationship among these components. Firstly, the primitives of the character are identified and then strings of the primitives are checked based on pre-decided rules. Generally, a character is represented as a production rules structure, whose left-hand side represents character labels and whose right-hand side represents string of primitives. The right-hand side of rules is compared to the string of primitives extracted from a word. So classifying a character means finding a path to a leaf.

Statistical Classifier. The purpose of the statistical methods is to determine to which category the given pattern belongs. By making observations and measurement processes, a set of numbers is prepared, which is used to prepare a measurement vector. Statistical

20

Chapter 1

INTRODUCTION

classifiers are automatically trainable. This classifiers assumes that different classes and the feature vector have an underlying joint probability. One approach is to use the Bayes classifier. The Bayes classifier minimizes the total average loss in assigning an unknown pattern to one of the possible classes. The probability density function can be cumulative, therefore at the end, the assignment is to that class with majority samples. In other words, Bayesian classifier assigns a pattern to a class with the maximum a posteriori probability. Hidden Markov Models are statistical models, which have been found extremely efficient for a wide spectrum of applications, especially speech processing. This success has motivated researchers to implement HMMs in character recognition. The k-NN rule is a non-parametric recognition method. This method compares an unknown pattern to a set of patterns that have been already labeled with class identities in the training stage. A pattern is identified to be of the class of pattern, to which it has the closest distance. Besides these methods other statistical methods are Quadratic Discriminant Function (QDF), Linear Discriminant Function (LDF), Euclidean distance, cross correlation, Mahanalobis distance, Regularized Discriminant Analysis (RDA)

Neural Network Classifier: OCR is one of the most successful applications that has been proposed for neural networks. A Neural Network (NN) is a non-linear system, which may be characterized according to a particular network topology. This topology is decided by the characteristics of the neurons and the learning methodology. There are three main advantages behind implementing NNs in OCR: NNs have faster development times; they have an ability to automatically take into account the peculiarities of different writing/printing styles; and they can be run on parallel processors. On the other hand, introducing a new shape to the NN requires that the network be retrained or even worse, that the network be trained to a different architecture. NNs can simply cluster the feature vectors in the feature space, or they can integrate feature extraction

21

Chapter 1

INTRODUCTION

and classification stages by classifying characters directly from images. NNs were also applied to recognize Arabic words on-line [24]. The common architecture of NNs used in Arabic OCR is a network with three layers: input, hidden and output. The number of nodes in the input layer varies according to the dimensionality of the feature vector or the segment image size. The number of nodes in the hidden layer govern the variance of samples that can be correctly recognized by this NN. The most commonly used family of neural networks for pattern classification task is the feed-forward network, which includes multilayer perception and Radial-Basis Function (RBF) networks. The other neural networks used for classification purpose are Convolutional Neural Network, Vector Quantization (VQ) networks, autoassociation networks, Learning Vector Quantization (LVQ). But the limitation of the systems based on neural networks is their poor capability for generality.

Template matching This is one of the simplest approaches to patter recognition. In this approach, a prototype of the pattern that is to be recognized is available. Now the given pattern that is to be recognized is compared with the stored patterns. The size and style of the patterns is ignored while matching. According to [25] the matching techniques can be classified as: Deformable templates & elastic matching, relaxation matching, direct matching.

Kernel methods Some of the most important Kernel methods are Support Vector Machines, Kernel Principal Component Analysis (KPCA), and Kernel Fisher Discriminant Analysis (KFDA) etc. Support vector machines (SVM) are a group of supervised learning methods that can be applied to classification. In a classification task usually data is divided into training and testing sets. The aim of SVM is to produce a model, which predicts the target values of the test data. Different types of kernel functions of SVM

22

Chapter 1

INTRODUCTION

are Linear kernel, Polynomial kernel, Gaussian Radial Basis Function (RBF) and Sigmoid

6)

Post-processing Due to human extraordinary error-recovery ability, people can recognize and correct

errors in any text written in many font types, all of this thanks to the syntactic, lexical, discursive, semantic and pragmatic language constraints we routinely apply. OCR postprocessing goal is to increase the probability that OCR hypotheses are correct, and they are compatible with the language constraints imposed by the task. The Language Model conform from these constraints and can be as complex as an unconstrained sentence like the natural language or as simple as a small set of valid words.

1.1.2.3

ARABIC OCR SOFTWARES

Several OCR software products with Arabic text recognition capabilities are available in the market (free or with fee). The following is a listing of some these softwares:

Readiris Pro from I. R. I. S. is an OCR solution for converting paper documents into digital files. The software works for different languages. A Middle East version is available for Arabic, Farsi and Hebrew [26].

VERUS Middle East Standard from NovoDynamics [27] is designed to recognize Arabic, Farsi, Dari, and Pashto languages, including embedded English and French.

Sakhr

23

Chapter 1

INTRODUCTION

Automatic Reader from Sakhr [28] is an OCR solution that addresses the Arabic language. It supports Arabic, Farsi, Pashto, Jawi, and Urdu.

OmniPage From Nuance Communications [29] is an optical character recognition application that supports more than 25 languages including Arabic. : This runs under standard Arabic Windows 3.1 and Arabic Windows for Workgroups 3.11, without customizing. It integrates with standard Arabic word processors, including Microsoft Word, Microsoft Write and Accent. OmniPage does not require any training of fonts. This is not essentially an advantage, since the program commits the same systematic error repeatedly without being able to learn from its failure.

TextPert This runs on the Macintosh Arabic system, and it is easy to use. However, training new fonts is not possible. The recognition rate was approaching acceptable standards when the program was tested on very good simple texts, but it virtually recognized nothing with more complicated fonts [30] [31]. As a consequence, until the training feature is introduced to the software, its usage will be limited to those who only want to scan certain kinds of computer-generated documents. ICRA This runs under Microsoft Windows Arabic, and every typeface needs to be learned. The training process takes about one hour for each typeface. An experiment training this software with a number of Arabic magazines [32] showed that using these texts in their ordinary size gave disappointing results. Enlarging the text about 20% improved the recognition rate, which ranged between 90% and 99.7%. Al-Qari’ al-Ali This program was first developed by Dr Rezvan of the Russian Academy of Science at the beginning of the 1990s [33]. It is a segmentation-based system, which combines

24

Chapter 1

INTRODUCTION

vector and bitmap analysis. The program is delivered with a standard set of modern computer fonts, which can be recognized automatically. The main problem with this program is the considerable amount of time it takes to train for new fonts, especially typeset fonts with many ligatures. Versions 1.0 and 1.1 of this program run on alNawafidh al-Arabiya, which is an equivalent environment to Microsoft Windows but in Arabic, and version 2.0 runs under Microsoft Arabic Windows.

1.2 Research Objective Since 1975 a lot of research works done on AOCR fields and its main system processes but there is still a lack in the works which researchers depend on in choosing their start point (base techniques) to start from, as well as there was a lack in a typical primitive of Arabic words datasets to test and evaluate works on. The traditional AOCR systems depended on recognizing one font type at a time, what faced here is that most systems depended on the user knowledge of fonts (supervised) to assign each font image to its classifier to be recognized. There was a lack in employment OFR (Optical Font Recognition) since in AOCR system process. For about 40 years ago there was almost no obvious recognition system with a satisfied highly recognition results compared to Latin OCR systems. The objective of this thesis is to developing a highly accurate recognition system by depending on good features, and generating a typical noise free dataset containing all forms of primitives of Arabic characters to evaluate feature extraction systems and being available for scientists for research purpose. The developed system should be able to recognize all font types automatically without any supervision from users in AOCR system processes.

25

Chapter 1

INTRODUCTION

1.3 Research Contribution  Generating a new computer generated Primitive Arabic Characters (Scanning, Quantization and Segmentation) Noise Free dataset (PAC-NF) addressing Arabic characters variations for evaluating and testing research purpose.  A comparative study of four recent primitive Arabic character recognition approaches (first approach based on Gabor filter [34], second approach based on a combination of statistical features and geometric moments [35], third approach (SOCR) SIFT based OCR [36] and the fourth approach based on statistical features with random forest tree classifier [1] ) is presented to evaluate the different features and classifiers utilized in their systems. Experimental results show that the fourth approach [1] is the best in the term of Character Recongition Accuracy(CRR). But it still not the optimal solution for AOCR problems.  The recognition accuracy of AOCR enhanced by adding additional features.  Automating font recognition in AOCR process by considering Optical Font Recognition (OFR) stage as a process in the Traditional AOCR system processes.

1.4 Thesis Organization The thesis is organized as following:  Chapter 1: Introduces Arabic language, Starting by the history of AOCR process, then the research objectives and thesis contributions are given.  Chapter 2: A Literature Review is presented that provides definitions, context, and a clearer understanding of previous research in printed Arabic text recognition.  Chapter 3: Introduces the proposed generated dataset for feature extraction systems evaluation (PAC-NF) dataset.

26

Chapter 1

INTRODUCTION

 Chapter 4: Presents recognition rate evaluation using different approaches across performing a comparative study between four recent primitive AOCR algorithms.  Chapter 5: Presents a detailed description of the proposed enhanced AOCR features to increase the recognition accuracy.  Chapter 6: Presents a detailed description of the proposed system for automating Optical font recognition on AOCR system.  Chapter 7: Concludes the thesis, and provides guidelines for the future work in the area of AOCR.

27

Chapter 2

LITERATURE REVIEW

Chapter 2 LITERATURE REVIEW As mentioned before, Arabic optical character recognition started about 40 years ago. According to that, many researches performed on this field. This chapter reviewing most of available researches techniques in the field of Arabic optical character recognition. It is organized as following: firstly, presenting a literature review about Arabic optical character recognition techniques. Secondly, presenting the available datasets in the field of AOCR. Thirdly, presenting a literature review about comparative studies in AOCR. Finally, presenting some Omni Arabic optical font recognition works.

2.1 Arabic Optical Character Recognition In [37] authors enhanced AOCR recognition accuracy by depending on the knowledge that normalized Fourier descriptors is invariant to scale, rotation, and translation. He depended on this technique because researchers of Latin OCR used it and yielded acceptable results. In addition, contour analysis was used in object recognition with success. He adopted both techniques as they are necessary for the recognition of Arabic characters. He mentioned that this combination was deemed necessary due to the special characteristics of Arabic characters that have some very similar characters. The character images are smoothed by a statistically based algorithm to eliminate noise, then the contours of the image (namely the character primary part, the dots, the hole contours) are extracted. Fourier descriptor and curvature features of the primary part of the character are computed. These features of the training set are used as the model features. The features of an input character are compared to the models features using a distance measure. The model with the minimum distance is considered as the class representing the character. The dots and holes features are then used to specify the particular character. he reported that the experimental results have shown that the combination of the Fourier descriptors, the curvature features and the

28

Chapter 2

LITERATURE REVIEW

use of dots and holes features to be powerful in successfully classifying Arabic characters. He reported that recognition rates of 100% were achieved for the model classes. However, this rate has come down to 98% in the post recognition phase of identifying the specific characters. He reported that The major part of these errors come from corrupted data. In [38] authors proposed a front-end OCR for Persian/Arabic cursive documents, which utilizes an adaptive layout analysis system in addition to a combined MLP-SVM recognition process. They reported achieving an accurate OCR which is independent of font size for Persian/Arabic printed documents with ability to recognize omni-font scripts. Moreover, this system segments the full-text Manhattan style documents free of font size or page layout size using a single adjustable parameter, and it tolerates the skew of 20 degrees in the scanned pages. They reported that The implementation results on a comprehensive database show a high degree of accuracy. In [39] authors proposed Arabic character recognition algorithm using Modified Fourier Spectrum (MFS). The MFS descriptors is estimated by applying the Fast Fourier Transform (FFT) to the Arabic character primary part contour. Ten descriptors are estimated from the Fourier spectrum of the character primary part contour by subtracting the imaginary part from the real part (and not from the amplitude of the Fourier spectrum as is usually the case). These descriptors are then used in the training and testing of Arabic characters. They reported the The computation of the MFS descriptors requires less computation time than the computation of the Fourier descriptors. Their Experimental results have shown that the MFS features are suitable for Arabic character recognition. They achieved Average recognition rate of 95.9% for the model classes. They inform that The analysis of the errors indicates that this recognition rate can be improved by using the “hole” feature of a character and use cleaning corrupted data. In [40] authors proposed a system that is based on a method called Modified MCR (Minimum Covering Run) . This expression method was developed to represent binary document images by a minimum number of both types, horizontal and vertical runs.

29

Chapter 2

LITERATURE REVIEW

This is in analogy with bipartite graph in graph theory. From correspondence between binary image and bipartite graph, where runs correspond to partite sets and edges of the graph correspond to pixels in the image, finding the MCR expression amounts to constructing a minimum covering in the corresponding bipartite graph. Its Modified MCR expression and corresponding bipartite graph. A number of horizontal and vertical parts called strokes represents the characters. Specific features are then associated with these strokes. Thus separating words into characters is done once the characters composing parts are successfully recognized. It deals with the problem of separating the characters after their recognition. These strokes are labeled and a database of prototypes for each character shape is build from the features of these strokes. Recognition is achieved by simple matching to the reference prototypes. In [41] authors proposed a system for recognizing Urdu optical characters using neural networks, in it the pixels strength is measured to detect words in a sentence and joins of characters in a compound/connected word for segmentation these segmented characters are feeded to Neural Network for classification. He developed a prototype of the system using Matlab, currently achieves 70% accuracy on the average. In [42] authors proposed a new system to recognise cursive Arabic text that decomposes the document image into text line (The system is segmentation-free) images and extracts a set of simple statistical features from a narrow window which is sliding a long that text line. It then injects the resulting feature vectors to the Hidden Markov Model Toolkit (HTK). The proposed system is applied to a data corpus, which includes Arabic text of more than 600 A4-size sheets typewritten in multiple computergenerated fonts. He informed that the system was capable to learn complex ligatures and overlaps and the system performance has been improved when implementing the tri-model scheme. In [43] authors presented a new technique based on feature extraction and on dynamic cursor sizing for the recognition of Arabic Text. Several rules are defined that govern the size and movement of the cursor through each segment. The features obtained from

30

Chapter 2

LITERATURE REVIEW

each segment are termed strokes and a number of strokes defines each segment where each stroke is defined mainly in terms of a sequence of directions. The basic concept followed here is a logical, dynamically sized cursor that is used to "travel" through a text image of one word at a time while extracting features of strokes. The strokes obtained are then "pieced" back together to be classified into character classes based on a knowledge base and eventual recognition of characters is achieved. In [44] authors proposed algorithm that performs the recognition process from within a reference library of isolated characters and owns a very good immunity against noises. The big amount of its computing during the recognition process makes its execution time very slow and, hence, restricts its utilization. The proposed solutions require generally specific high cost architectures. They report in this research the performance analysis of an analytical and an experimental study of a distributed Arabic optical character recognition based on the dynamic time warping algorithm within loosely coupled architectures. Obtained results confirm that loosely coupled architectures and more specifically grid computing present a very interesting framework to speed up the Arabic optical character recognition based on the dynamic time warping algorithm. In [45] authors proposed an offline isolated character recognition of Urdu Language. They depended on the creation of chain codes of input image to be matched with stored references in an xml file. They reported achieving 89% recognition accuracy with a rate of 15 character/sec. In [46] authors developed a system consists of two main modules segmentation and classification. In the segmentation phase pixels strength is measured to detect words in a sentence and joints of characters in a compound/connected word for segmentation. In the next phase, these segmented characters are feeded to a trained Neural Network for classification and recognition, where Feed Forward Neural Network is trained on 56 different classes of characters each having 100 samples. The main purpose of the system is to test the algorithm developed for segmentation of compound characters. The prototype of the system has been developed in Matlab; they reported achieving 70% accuracy on the average.

31

Chapter 2

LITERATURE REVIEW

In [34] authors proposed A technique for the Automatic recognition of Arabic characters using Gabor filters. K-Nearest Neighbor (KNN) is used for classification. They reported that, the achieved recognition rates proved that Gabor filters are effective in the classification of Arabic characters. Different number of orientations and scales, resulting in 30 and 48 feature vector sizes, are used and the recognition scales and 5 orientations). The results are compared with two previously published techniques using Modified Fourier Spectrum and Fourier descriptors using the same data. This technique has 2.6% and 4% higher recognitions rate than Fourier descriptors and Modified Fourier Spectrum descriptors, respectively. In [47] authors proposed the use of SIFT descriptors to evaluate its effectiveness for representing Pashto ligatures while overcoming thousands of ligatures challenges cause of ( variations of various kinds including scaling, orientation, font style, spatial location/registration of ligatures and limited number of samples available for training) in a holistic framework. The proposed approach is script independent and can be easily adapted to other cursive languages. A comparison of recognition results against classical methods such as PCA is provided to test the effectiveness of feature representation. proposed research showed that SIFT descriptor perform better than classical feature representation methods such as PCA. The proposed recognition is holistic using ligature (word) based classification. They have tested 1000 unique ligatures (images) with 4 different sizes, along with their rotated images; and average recognition rate of 74% is obtained. In [48] authors proposed a method that segmenting words into letters and identifying the individual letters, using a grid of SIFT descriptors as features for classification of letters. Each word is scanned with increasing window sizes; segmentation points are set where the classifier achieves maximal confidence. Using the fact that Arabic has four types of letters, isolated, initial (start), middle and final (end), they are also able to predict if a word is correctly segmented. Performance of the algorithm applied to printed texts and computer fonts was evaluated on the PATS-A01 dataset. For fonts with nonoverlapping letters, they achieve letter correctness of 87–96% and word correctness of

32

Chapter 2

LITERATURE REVIEW

74–88%. For overlapping fonts, although the word correctness is low, only 14–23% are not predicted to be wrong. In [49] authors presented a simple approach for Arabic OCR, they proposed a template based Arabic OCR algorithm. That method deployed correlation and dynamicsize windowing to segment and to recognize Arabic characters. The proposed coherent template recognition process have the ability of recognizing Arabic characters with different sizes. They reported achieving 96% recognition accuracy. In [50] authors Proposed a system for recognizing isolated printed Arabic characters. They used (Zernike, Invariant) moments as a statistical features and the Walsh transformation as a global transformation feature. By using a multilayer neural network as a classifier, they achieved recognition accuracy of 98% using Zernike moments, 94.82% for invariant features and 93.11% for Walsh transformation. In [35] authors proposed a novel and effective procedure for recognizing Arabic characters using a combination of statistical features and geometric moment features which are independent of the font and size of the character. These features trained a backprobagation neural network to classify the characters. They reported achieving recognition rate of 97% using 6 different fonts. In [36] authors work on twofold: segmenting words into letters and identifying individual letters. They describe a method that combines the two tasks, using multiple grids of SIFT descriptors as features. To construct a classifier, they do not use a large training set of images with corresponding ground truth, a process usually done to construct a classifier, but, rather, an image containing all possible symbols is created and a classifier is constructed by extracting the features of each symbol. To recognize the text inside an image, the image is split into pieces of Arabic words, and each piece is scanned with increasing window sizes. Segmentation points are set where the classifier achieves maximal confidence. Using the fact that Arabic has four forms of letters (isolated, start (initial), middle (medial), and end (final)), they narrow the search space based on the location inside the piece. The performance of the proposed method, when applied to printed texts and computer fonts of different sizes, was evaluated on

33

Chapter 2

LITERATURE REVIEW

two independent benchmarks, PATS and APTI. They reported that algorithm outperformed that of the creator of PATS on five out of eight fonts, achieving character correctness of 98.87%-100%. On the APTI dataset. In [51] authors developed Arabic Optical Character Recognition system that has five stages: preprocessing, segmentation, thinning, feature extraction, and classification. In preprocessing stage, they compare two skew estimation algorithms i.e. skew estimation by image moment and by skew triangle. They also implemented Binarization and median filter. In thinning stage, they use Hadith thinning algorithm incorporated by two templates, one to prevent superfluous tail and the other one to remove unnecessary interest point. In segmentation stage, line segmentation is done by horizontal projection cross verification by standard deviation, sub-word segmentation is done by connected pixel components, and letter segmentation is done by Zidouri algorithm. In the feature extraction stage, 24 features are extracted. The features can be grouped into three groups: main body features, perimeter skeleton features, and secondary object features. In the classification stage, they used decision tree that generated by C4.5 algorithm. They reported Functionality test showed that skew estimation using moment is more accurate than using skew triangle, median filter tends to erode the letter shape, and template addition into Hilditch algorithm gives a good result. Performance test yield these result. Line segmentation had 99.9% accuracy. Standard deviation is shown can reduce over-segmentation and quasi-line. Letter segmentation had 74% accuracy, tested on six different fonts. Classification components had 82% accuracy, tested by cross validation. Overall performance of the system only reached 48.3%. In [52] authors proposed a novel sub-character HMM models for Arabic text recognition. Modeling at sub-character level allows sharing of common patterns between different contextual forms of Arabic characters as well as between different characters. The number of HMMs gets reduced considerably while still capturing the variations in shape patterns. This results in a compact and efficient recognizer with reduced model set and is expected to be more robust to the imbalance in data distribution. They reported Experimental results using the sub-character model based

34

Chapter 2

LITERATURE REVIEW

recognition of handwritten Arabic text as well printed Arabic text. the recognition results are not much different than other common HMM setups. In [53] authors presented a new approach to offline optical character recognition for printed Arabic (Persian) subwords using wavelet packet transform. The proposed algorithm is used to extract font invariant and size invariant features from 87804 subwords of 4 fonts and 3 sizes. The feature vectors are compressed using PCA. The obtained feature vectors yield a pictorial dictionary for which an entry is the mean of each group that consists of the same subword with 4 fonts in 3 sizes. The sets of these features are congregated by combining them with the dot features for the recognition of printed Persian subwords. To evaluate the feature extraction results, this algorithm was tested on a set of 2000 subwords in printed Persian text documents. They reported achieving a recognition rate of 97.9% at subword level recognition. In [54] authors proposed approach in it Haar-Cascade classifier (HCC) is modified for the first time to suit Arabic glyph recognition. The HCC approach eliminates problematic steps in the preprocessing and recognition phases and, most importantly, the character segmentation stage. A recognizer was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These recognizers were trained and tested on some 2,000 images each. The system was tested with real text images and produces a recognition rate for Arabic glyphs of 87%. They reported that The proposed method is fast, with an average document recognition time of 14.7 seconds compared with 15.8 seconds for commercial software. In [55] authors introduced a method for extracting features from patterns. It is based on the relative density distribution of each numeral object; specifically it depends on the centralized moments. This method gives sufficient results to recognize the printed and highly stylized handwritten numeral images. They reported that the attained recognition rate is 97.47% for the printed numeral images with total number of samples equal (198) samples and 95.55% for the highly stylized handwritten numeral images with total number of samples equal (90) samples. The attained recognition rate is unacceptable when the system applied for a handwritten numeral samples which have

35

Chapter 2

LITERATURE REVIEW

wide differences in their shapes with total number of samples equal (4500) samples. The attained recognition rate is (74.93%). Each tested numeral image is scanned with scanning resolution of 300 dpi. In [1] authors investigate the using both K- Nearest Neighbor (KNN) and random forest tree classifiers with a previously tested statistical features. These features are independent of the fonts and size of the characters. First, they binarize the input characters images, and then extract the main features, and finally they compare both classifiers in recognizing the training and the testing datasets. They reported that Random forest tree found to be better than KNN by more than 11 % recognition rate. The effect of different parameters of these classifiers has also been tested, as well as the effect of noisy characters. In [56] authors proposed an Omni-font, large-vocabulary Arabic OCR system using Pseudo Two Dimensional Hidden Markov Model (P2DHMM), which is a generalization of the HMM. P2DHMM offers a more efficient way to model the Arabic characters, such model offer both minimal dependency on the font size/style (Omnifont), and high level of robustness against noise. The evaluation results of this system compared to a baseline HMM system and best OCRs available in the market (Sakhr and NovoDynamics). The recognition accuracy of the P2DHMM classifier is measured against the classic HMM classifier, the average word accuracy rates for P2DHMM and HMM classifiers are 79% and 66% respectively. The overall system accuracy is measured against Sakhr and NovoDynamics OCR systems, the average word accuracy rates for P2DHMM, NovoDynamics, and Sakhr are 74%, 71%, and 61% respectively.

2.2 Datasets Databases contain a significant representative number and selection of samples constitute a potential resource. They are necessary tools to the different experiments of OCR systems as well as to the advanced research in the field of writing optical recognition. In fact, these bases allow us to evaluate, upon common resources, the performances of different approaches and the recorded error analysis permits to deduce

36

Chapter 2

LITERATURE REVIEW

future developments in the subject matter. They are as important as their big size. In fact, the number of samples used during the training stage affect perceptibly the performances of an OCR system. Nevertheless, the OCR oriented databases are hard to conceive at the individual researcher level, taking into account the underlying fastidious tasks and the costs involved, particularly at the level of data collecting and their digitization. Different list databases for different OCR systems exist such as Latin and Japanese, which contribute considerably in the advance of their OCR research. Some of these bases are public, a few of which are free, but some others are commercialized. We give as examples the following databases: NIST, CEDAR, UNIPEN, IRONOFF, and GRUHD. Unfortunately, such bases are nearly absent in the Arabic case. Consequently most AOCR researchers have to gather data individually and hence, they test their systems on private images, which constitute a big inertia to the development of efficient methodologies in the field. There has been a great deal of previous research in three areas related to the topic:

1.

Collected printed word databases a. The Linguistic Data Consortium (LDC) at the University of Pennsylvania

produced “Arabic Gigaword Second Edition” [57]. Is a huge database of 1,500 million Arabic words. It has been collected over a period of years from news agencies, but has a number of drawbacks. First, the database was collected only from news agencies, whereas a set of more varied sources would be advantageous. Secondly, most of the files come from Lebanese news agencies while it would be better to collect samples from many Arab countries. Thirdly, the database format is in paragraphs and not in single words for testing and training which makes it less immediately useful. b. The Environmental Research Institute of Michigan (ERIM) has created a printed database of 750 pages collected from Arabic books and magazines. This database contains different text qualities saved in an appropriate file formats. However, this database has two drawbacks; it is small and hard to access [58].

37

Chapter 2

LITERATURE REVIEW

c. APTI (Arabic Printed Text Images ) [59] a database composed of images of Arabic Printed words. The database is synthetically generated using a lexicon of 113’284 words, 10 Arabic fonts, 10 font sizes and 4 font styles. The database contains 45’313’600 single word images totaling to more than 250 million characters. Ground truth annotation is provided for each image. This dataset available for the scientific community to evaluate their recognition systems. d. PATS (Printed Arabic Text Set) [60] Contains two versions (Dataset PATSA01 , Dataset PATS-A02 ). The first Printed Arabic Text Set A01 (PATS-A01) consists of 2766 text line images. The text of 2751 line images of this set was selected from two standard classic Arabic books. The text of the remaining 15 line images are added from author's minimal Arabic script. The line images are available in eight fonts: Arial, Tahoma, Akhbar, Thuluth, Naskh, Simplified Arabic, Andalus, and Traditional Arabic.

2.

Creating a lexical database of printed Arabic a. DIINAR.1 is an Arabic lexical database produced by the Euro-Mediterranean

project. It comprises 119,693 lemmas distributed between nouns, verbs and adverbials and uses 6,546 roots [61] .

b. Xerox developed the Xerox Arabic Morphological Analyzer/Generator in 2001. This contains 90,000 Arabic stems, which can create a derived database of 72 million words [62] This type of databases partially solves the problem of not having a trusted Arabic corpus, but it misses many of the words used in practice.

3.

Collected handwritten databases

38

Chapter 2

LITERATURE REVIEW

In 2002 Al-Ma'adeed et al introduced AHDB, a database of 100 different writers which contains Arabic text and words. It contains the most common Arabic words that are used in writing cheques and some handwritten pages [63] . In 2002 another handwritten database of town/village names was created by the Institute for Communications Technology (IFN), Technical University Braunschweig, Germany and Ecole Nationale d’Ingénieur de Tunis (ENIT). It was completed by 411 writers. They entered about 26,400 names [64]. This database has been used recently in a number of other research projects. The handwritten database has different characteristics than the typed one. In this thesis the new Primitive Arabic Characters – Noise Free dataset (PAC-NF) dataset is presented. It contains two versions (PAC-NFA, PAC-NFB). It is generated for evaluating and testing purposes using closely the same procedures used to generate words datasets.

2.3 Comparative studies According to literature review, there are currently only one Arabic Character Recognition Comparative Study available for the scientific community. This comparative study is the (Arabic Character Recognition based on Statistical Features A Comparative Study) presented in [65], authors presents a comparative study for Arabic optical character recognition techniques according to the statistic approach. Therefore, the proposed work consists in experimenting character image characterization and matching to show the most robust and reliable techniques. For features extraction phase, they test invariant moments, affine moment invariants, Tsirikolias–Mertzios moments, Zernike moments, Fourier-Mellin transform and Fourier descriptors. In addition, for the classification phase, they used k-Nearest Neighbors and Support Vector Machine. Their data collection enclosed three datasets. The first contained 2320 multi-font and multi-scale printed samples. The second contained 9280 multi-font, multi-scale and multi-oriented printed samples. Moreover, the third contained 2900 handwritten samples, which are extracted from the IFN/ENIT 39

Chapter 2

LITERATURE REVIEW

data. The aim was to cover a wide spectrum of Arabic characters complexity. The best performance rates found for each dataset were 99.91%, 99.26% and 66.68% respectively. However, it is not a comparative study between the previously proposed research algorithms by the other researchers. The other available studies in AOCR are evaluation studies for OCR softwares like studies presented by [66] [67] [68] [69] [70] [71] [72]. In this thesis, a comparative study between four of most recent techniques which reported achieving high recognition ratios in their researches performed.

2.4 Omni Font Recognition Systems In [16] at 1999 authors presented a research focused on two aspects of the OCR system. First, they addressed the issue of how to perform OCR on Omni font and multistyle data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. They demonstrated mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, they showed how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method included the use of a trigram language model on character sequences. Using all these techniques, they reported achieving character error rates of 1.1 % on data from the University of Washington English Document Image Database and 3.3 % on data from the DARPA Arabic OCR Corpus.

In [73] at 2010 authors proposed a new approach for the recognition of Arabic (Farsi) fonts. Font type of individual lines with any font size is recognized based on a new feature. They depended on that not all text lines of the same block or paragraph have the same font, e.g. titles usually have different fonts. On the other hand, although the Gabor filter does this task fairly, but it is very time consuming, so that feature extraction of a texture of size 128 x128 takes about 178 ms on a 2.4 GHz PC. They performed font

40

Chapter 2

LITERATURE REVIEW

recognition in line level using a new feature based on Sobel and Roberts gradients in 16 directions, called SRF. They break each line of text into several small parts and construct a texture. Then SRF is extracted as texture features for the recognition. This feature requires much less computation and therefore it can be extracted very faster than common textural features like Gabor filter, wavelet transform or momentum features. During experiments, they reported that it is about 50 times faster than an 8-channel Gabor filter. At the same time, SRF can represent the font characteristics very well. They reported achieving the recognition rate of 94.16% on a dataset of 10 popular Farsi fonts. This is about 14% better than what an 8-channel Gabor filter can perform. If we ignore the errors between very similar fonts, the recognition rate will be about 96.5%.

In [74] at 2010 authors applied majority vote approach to classify the unlabeled data to reliable and unreliable classes. Then, they added the reliable data to training set and classify the remaining data including unreliable data in iterative process. they tested this method on the extracted features of ten common Persian fonts. Experimental result indicated that proposed method improved the classification performance.

In [75] at 2011 authors proposed a new method for Arabic (Farsi) automatic font recognition which is based on scale invariant feature transform (SIFT) method. As SIFT features are scale-invariant, the final system is robust against variation of size, scale and rotation. The system does not need a pre-processing stage but in the case of low quality images, some noise removal processes can be used. Using a database of 1400 text images, they reported achieving excellent recognition rate of nearly 100%.

In [76] at 2011 authors presented a new approach for font recognition of Farsi document images. In this approach using two types of features, font and font size of Farsi document images were recognized. The first feature was related to holes of letters of text of document image. The second feature was related to horizontal projection profile of text lines of document image. This approach has been applied on seven widely

41

Chapter 2

LITERATURE REVIEW

used Farsi fonts and seven font sizes. A dataset of 10 X 49 images and another dataset of 110 images were used for testing and recognition rate more than 93.7% obtained. Images have been made using paint software and are noiseless and without skew. They reported achieving a recognition rate of 93.7%.

In [77] at 2013 authors proposed a new font and size identification method for ultralow resolution Arabic word images using a stochastic approach. This research work proposed an efficient stochastic approach to tackle the problem of font and size recognition. Proposed method treated a word image with a fixed-length, overlapping sliding window. Each window is represented with a 102 features whose distribution is captured by Gaussian Mixture Models (GMMs). They present three systems: (1) a font recognition system, (2) a size recognition system and (3) a font and size recognition system. They demonstrated the importance of font identification before recognizing the word images with two multi-font Arabic OCRs (cascading and global). They reported that the cascading system is about 23% better than the global multi-font system in terms of word recognition rate on the Arabic Printed Text Image (APTI) database, which is freely available to the scientific community.

In [78] at 2014 was the first attempt to present an alternative method for Arabic font recognition based on diacritics. It presented the diacritics as the thumb of Arabic fonts, which can be used individually to identify and recognize the font type. Diacritics were the marks and strokes, which have been added to the original Arabic alphabet. In this research, two algorithms for diacritics segmentation have been developed, namely flood-fill based and clustering based algorithm. The experiments conducted proved that proposed approach could achieve an average recognition rate of 98.73% on a typical database that contains 10 of the most popular Arabic fonts. Compared with existing methods, they reported the proposed approach has the minimum computation cost and it can be integrated with OCR systems very easily. Moreover, it could recognize the

42

Chapter 2

LITERATURE REVIEW

font type regardless of the amount of the input data since five diacritics, which in most cases can be found in only one word, are enough for font recognition. This thesis proposes automating the process of recognizing font type in AOCR systems rather than manual assignment of it.

43

Chapter 3

PROPOSED DATASETS

Chapter 3 PROPOSED DATASETS This chapter presents two types of datasets a) computerized dataset, also called Primitive Arabic Characters-Noise Free Dataset(PAC-NF). It is generated noise free to evaluate and test feature extraction and classification stages and avoid any possible faults from other previous stages like preprocessing and segmentation.it is available online for research and development at [79]. This dataset contain two models ( Model(A):(PAC-NFA) and Model(B):(PAC-NFB)) which differ in their variations. The other dataset is b) Optically scanned dataset, also called Primitive Arabic Characters Optically Scanned (PAC-OS) this dataset was generated for testing the proposed system on real data. It is available online for research and development at [80]. The detailed description of these two datasets is as following:

3.1 Primitive Arabic Characters-Noise Free Dataset This section presents the Primitive Arabic Characters Noise Free dataset (PAC-NF) with its two models (PAC-NFA, and PAC-NFB). This dataset contains primitives, unlike all the previously available datasets like (APTI, PATS, LDC, and ERIM) which contain words. It is called Primitive Arabic Characters as it contains the primitives of the Arabic characters in all of its four positions (Isolated-Start-Middle-Final) e.g. the Baa character has the primitives (Isolated(‫)ب‬-Start(‫)بـ‬-Middle(‫)ـبـ‬-Final(‫))ـب‬.

PAC-NFA In this dataset, the characters were written in all of their possible forms (Isolated, Start, Middle and End) according to its position on word. Each font type in the dataset contains 102 primitive shapes. It is free of scanning, quantization, and segmentation

44

Chapter 3

PROPOSED DATASETS

noises. It contains 30600 samples for each font type yielding in 214200 samples in whole dataset. It is available online at [79].

Generation process 1. It has been generated using Microsoft office word software. 2. Generated using 10 different fonts: Ariel, Andalus, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, and Akhbar (see Table(3-1)). Table 3-1 : Font Types used in PAC-NFA

Font

Font Type

Example

F11

Tahoma

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F1

Akhbar MT

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F3

Andalus

‫سبحان اهلل والحمد هلل وال اله اال اهلل‬

F2

Ariel

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F8

DecoType Thuluth

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F10

Simplified Arabic

‫سبحان اهلل والحمد هلل وال اله اال اهلل‬

F12

Traditional Arabic

‫سبحان اهلل واحلمد هلل وال اله اال اهلل‬

Abbreviation

3. The dataset consists of 28 characters from ‘Alif’ to ‘Yaa’ and the composite character ‘LamAlif’ (‫)أل‬, which is composed of ‘Lam’ (‫ )ل‬and ‘Alif’ (‫ )أ‬characters, yielding in 29 characters, (see Table(3-2)).

Table 3-2 : Arabic language characters in all positions

Arabic Language Primitives Character name

Isolated

Connected End

45

Middle

Start

‫‪Chapter 3‬‬

‫‪PROPOSED DATASETS‬‬

‫‪-‬‬

‫‪-‬‬

‫ـا‬

‫ا‬

‫الف‬

‫‪Alif‬‬

‫بـ‬

‫ـبـ‬

‫ـب‬

‫ب‬

‫باء‬

‫‪Baa‬‬

‫تـ‬

‫ـتـ‬

‫ـت‬

‫ت‬

‫تاء‬

‫‪Taa‬‬

‫ثـ‬

‫ـثـ‬

‫ـث‬

‫ث‬

‫ثاء‬

‫‪Thaa‬‬

‫جـ‬

‫ـجـ‬

‫ـج‬

‫ج‬

‫جيم‬

‫‪Jeem‬‬

‫حـ‬

‫ـحـ‬

‫ـح‬

‫ح‬

‫حاء‬

‫‪Haa‬‬

‫خـ‬

‫ـخـ‬

‫ـخ‬

‫خ‬

‫خاء‬

‫‪Khaa‬‬

‫‪-‬‬

‫‪-‬‬

‫ـد‬

‫د‬

‫دال‬

‫‪Daal‬‬

‫‪-‬‬

‫‪-‬‬

‫ـذ‬

‫ذ‬

‫ذال‬

‫‪Thaal‬‬

‫‪-‬‬

‫‪-‬‬

‫ـر‬

‫ر‬

‫راء‬

‫‪Raa‬‬

‫‪-‬‬

‫‪-‬‬

‫ـز‬

‫ز‬

‫زاي‬

‫‪Zaay‬‬

‫سـ‬

‫ـسـ‬

‫ـس‬

‫س‬

‫سين‬

‫‪Seen‬‬

‫شـ‬

‫ـشـ‬

‫ـش‬

‫ش‬

‫شين‬

‫‪Sheen‬‬

‫صـ‬

‫ـصـ‬

‫ـص‬

‫ص‬

‫صاد‬

‫‪Saad‬‬

‫ضـ‬

‫ـضـ‬

‫ـض‬

‫ض‬

‫ضاد‬

‫‪Dhaad‬‬

‫طـ‬

‫ـطـ‬

‫ـط‬

‫ط‬

‫طاء‬

‫‪Ttaa‬‬

‫ظـ‬

‫ـظـ‬

‫ـظ‬

‫ظ‬

‫ظاء‬

‫‪Dthaa‬‬

‫عـ‬

‫ـعـ‬

‫ـع‬

‫ع‬

‫عين‬

‫‪Ain‬‬

‫غـ‬

‫ـغـ‬

‫ـغ‬

‫غ‬

‫غين‬

‫‪Ghen‬‬

‫فـ‬

‫ـفـ‬

‫ـف‬

‫ف‬

‫فاء‬

‫‪Faa‬‬

‫قـ‬

‫ـقـ‬

‫ـق‬

‫ق‬

‫قاف‬

‫‪Qaf‬‬

‫كـ‬

‫ـكـ‬

‫ـك‬

‫ك‬

‫كاف‬

‫‪Kaf‬‬

‫لـ‬

‫ـلـ‬

‫ـل‬

‫ل‬

‫الم‬

‫‪Lam‬‬

‫مـ‬

‫ـمـ‬

‫ـم‬

‫م‬

‫ميم‬

‫‪Mem‬‬

‫نـ‬

‫ـنـ‬

‫ـن‬

‫ن‬

‫نون‬

‫‪Noon‬‬

‫هـ‬

‫ـهـ‬

‫ـه‬

‫ه‬

‫هاء‬

‫‪Haa‬‬

‫‪-‬‬

‫‪-‬‬

‫ـو‬

‫و‬

‫واو‬

‫‪Wow‬‬

‫يـ‬

‫ـيـ‬

‫ـي‬

‫ي‬

‫ياء‬

‫‪Yaa‬‬

‫‪46‬‬

Chapter 3

PROPOSED DATASETS

4. Kashida (‫()ــ‬Shift + ‫ )ت‬has been used to generate characters in positions (Start, Middle and End) to avoid requiring segmentation as shown in Table(3-2) this resulted in 102 primitive. 5. Each one of the 102 primitives is written in (300) different sizes (8-308) font size to addressing the different scattering of pixels through different sizes. 6. After that, the characters were exported from Word software as a pdf file. 7. The pdf file was converted to images using (PDFill_Free_PDF_Tools) [81] software at 110 dpi (dot per inch) to address small resolution scanned images. 8. The images have been segmented line-by-line using Matlab software. 9. Then, each line has been segmented into single characters. 10. At runtime, the dataset images were converted to binary images using Matlab built-in function (im2bw) at threshold (0.8) (experimental). 11. This Model contains 214,200 character sample.

PAC-NFB It is also computer-generated dataset like PAC-NFA so it is free of scanning, quantization, and segmentation noise. It contains 4080 samples for each font type yielding in 40800 samples in whole dataset. This dataset contains much more variation rather than Dataset (A).

Generation process 1. It has been generated using Microsoft office word software. 2. Generated using 10 different fonts: Andalus, Arabic Transparent, Advertising Bold, Diwani Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType Naskh, and M Unicode Sara, (see Table(3-3)). Table 3-3 : Font types used in PAC-NFB

Font

Font Type

Example

Abbreviation

47

Chapter 3

PROPOSED DATASETS

F3

Andalus

‫سبحان اهلل والحمد هلل وال اله اال اهلل‬

F4

Arabic Transparent

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F5

Advertising Bold

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F6

Diwani Letter

‫سبحان هللا والحمد هلل وال اله‬ ‫اال هللا‬

F8

DecoType Thuluth

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F10

Simplified Arabic

‫سبحان اهلل والحمد هلل وال اله اال اهلل‬

F11

Tahoma

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F12

Traditional Arabic

‫سبحان اهلل واحلمد هلل وال اله اال اهلل‬

F7

DecoType Naskh

‫سبحان هللا والحمد هلل وال اله اال هللا‬

F9

M Unicode Sara

‫سبحان هللا والحمد هلل وال اله اال هللا‬

3. The dataset consists of 28 character from ‘Alif’ to ‘Yaa’ and the composite character ‘LamAlif’ (‫ )أل‬which is composed of ‘Lam’ (‫ )ل‬and ‘Alif’ (‫ )أ‬characters, yielding in 29 characters, (see Table(3-2)). 4. Kashida (‫( )ــ‬Shift + ‫ ) ت‬has been used to generate characters in positions (Start, Middle and End) to avoid requiring segmentation as shown in Table(3-2) this result in 102 primitive. 5. Each written in four styles (Plain, Bold, Italic, and Italic Bold). 6. Each one of the 102 primitives written in (10) different sizes (6, 7, 8, 9, 10, 12, 14, 16, 18 and 24 points). 7. After that, characters exported from Word software as a pdf file. 8. The pdf file converted to images using (PDFill_Free_PDF_Tools) [81]. software at 142 dpi (dot per inch). 9. Then, each image has been segmented into single characters using a simple implemented script in Matlab software.

48

Chapter 3

PROPOSED DATASETS

10. This Model contain 40,800-character sample. At runtime, the dataset images were converted to binary images using Matlab builtin function (im2bw) at threshold (0.8) (Experimental).

3.2 Primitive Arabic Characters Optically Scanned Dataset As mentioned before, this dataset (PAC-OS) is a real test sample cases, it is available online at [80], its generated using the following procedure: 1. It is written in 12 different font type Andalus, Akhbar, Ariel, Arabic Transparent, Advertising Bold, Diwani Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType Naskh, and M Unicode Sara as shown in Tables(31,3-3 ). 2. It is written using one font size (24). 3. The dataset consists of 28 character from ‘Alif’ to ‘Yaa’ and the composite character ‘LamAlif’ (‫ )أل‬which is composed of ‘Lam’ (‫ )ل‬and ‘Alif’ (‫ )أ‬characters, yielding in 29 characters, (see Table(3-2)). 4. Then it is printed in Actual Size (using (Actual Size) option in printing properties) on A4 papers. 5. They are scanned using (Cannon MG2400 series) Scanner using Cannon Quick Menu software version 2.2.1 in 7 different resolutions 150,100, and 75). (See Fig(3-1)). 6. This dataset contain 8568 characters.

49

(600, 400, 300, 200,

Chapter 3

PROPOSED DATASETS

Figure 3-1 :Andalus font type optically scanned sample

50

Chapter 3

3.3

PROPOSED DATASETS

Dataset Description This section presents the main properties and differences between the two

proposed dataset models (PAC-NFA, and PAC-NFB), as well as the (PAC-OS) dataset and Arabic Printed Text Images (APTI) dataset. across different point of views. (e.g. font styles, font types, font size, contained data (consists of), number of samples, type of dataset, scanning resolution, and exporting resolution). This comparison showed in table (3-4).

Table 3-4 : Datasets description

Dataset Property

PAC-NFA

PAC-NFB

Type

Computerized

Computerized

Consists of

Primitives

Primitives

Primitives

Words

Font Types

7-types

12-types

Font size

8-to-308

Font Style

Plain

10-types 6,7,8,9,10,12 ,14,16,18,24 Plain, Italic, Bold, Bold Italic

10-types 6,7,8,9,10,12 ,14,16,18,24 Plain, Italic, Bold, Bold Italic

214200

40800

45313600

‫ــــ‬

‫ــــ‬

8568 600-400-300200-150-10075

110

142

‫ــــ‬

360(pixel/inch)

102

102

102

‫ــــ‬

None

None

Yes

Yes

Number of Samples Scanning Resolution Exporting Resolution Number of primitives Noise

51

PAC-OS

Optically Scanned

24 Plain

APTI

Computerized

‫ــــ‬

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

Chapter 4 CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES This chapter presents different Arabic Optical Character Recogniton approaches, and perform a comparative study between them to reach to higher character recognition rates (CRR). According to literature review there is a lake in performing comparative studies between different proposed techniques of AOCR. This chapter presents a performed comparative study between four of most recent techniques which reported achieving high recognition rates in their researches. This chapter is organized as following, firstly the four mentioned recent techniques is presented, secondly the experimental set up and environment specifications is mentioned, finally the results and discussion is presented.

4.1 Compared techniques First Approach In this approach [34], the authors proposed a technique for the automatic recognition of Arabic characters using Gabor filters

a)

FEATURE EXTRACTION

In the feature extraction process (see Fig(1-6)), Arabic character image features are extracted using Gabor filters which can be written as a two dimensional Gabor function g(x, y) as given in Eq (4-1). Its Fourier transform G(u, v) is given in Eq (4-2) [34]:

52

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

𝐠(𝒙, 𝒚) =

𝟏 𝟐𝝅𝝈𝒙

𝟏 𝒙𝟐

𝐞𝐱𝐩⁡[− ( 𝟐

𝑮(𝒖, 𝒗) = 𝐞𝐱𝐩⁡[−𝟏/𝟐(

𝝈𝟐 𝒙

+

(𝒖−𝒘)𝟐 𝝈𝟐 𝒖

𝒚𝟐 𝝈𝟐 𝒚

+

) + 𝟐𝝅𝒋𝒘𝒙] 𝒗𝟐 𝝈𝟐 𝑽

)]

(4-1) (4-2)

Where σx, σy are the variances of x and y along the x, y axis, respectively; σu =½ πσx and σv = ½ π σy. Gabor functions form a complete but non-orthogonal basis set. Expanding a signal using this basis set provides localized frequency description of the signal. Let g(x, y) be the mother Gabor wavelet, then this selfsimilar filter dictionary can be obtained by appropriate dilations and rotations of g(x, y) through the generating function of [14] 𝒈𝒎𝒏 (𝑿, 𝒀) = 𝒂−𝒎 𝑮(𝒙, 𝒚)

(4-3)

𝒙 = 𝒂−𝒎 (𝒙𝒄𝒐𝒔𝜽 + 𝒚𝒔𝒊𝒏𝜽),⁡

(4-4)

𝒚 = 𝒂−𝒎 (−𝒙𝒔𝒊𝒏𝜽 + 𝒚𝒄𝒐𝒔𝜽),

(4-5)

Where a >1, m,n are integers, θ = nπ/K and K is the total number of orientations. The scale factor a-m is to ensure that the energy is independent of m. After filtering the given input image, statistical features such as the mean and the variance of the image are computed. The extracted feature vector is constructed from the means and variances of all filtered images. Filtering in the frequency domain requires taking the Fourier transform of an image, processing it, and then computing the inverse transform to obtain the result. Thus, given a digital image, i(x, y), the basic filtering equation has the form [34]: 𝒈(𝒙, 𝒚) = 𝕴⁡−𝟏 ⁡[⁡𝑯(𝒖, 𝒗)⁡𝑰(𝒖, 𝒗)]

(4-6)

Where ℑ -1 is the Inverse Fourier Transform (IFT), I(u, v) is the FT of the input image, i(x, y). H(u, v) is a filter function (i.e. the Gabor filter), and g(x, y) is the filtered (output) image. The FFT is computed for the Arabic characters’ images. The Gabor filters are applied using the different orientations and scales. Then the inverse Fourier transform

53

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

of the filtered images is computed. The mean μ and the standard deviation σ for each filtered image are then computed to form the character feature vector.

b) CLASSIFICATION The classification phase (see Fig(1-6)) consists of two phases, training (modeling) and testing. 70 % of the data was used for training and 30 % for testing. The K-NN is used in the classification stage. Hence, in the training phase, the features of the Arabic characters (of the training data) are extracted and saved as reference models of the respective classes. The K-Nearest Neighbor (KNN) is used for classification. The magnitude (city block) distance measure is used to measure the similarity/dissimilarity between the input sample and the character reference models. The feature vector (V) for the unknown character is computed and then compared with the feature vectors of all the characters’ reference models. The magnitude (city block) distance is computed using a simple formula given by [34]: 𝑬𝒊 = ∑𝒓𝒋=𝟏|𝑴𝒊𝒋 − 𝑽𝒋 |

(4-7)

where Ei is the distance between the input character and the reference model i (i.e. sum of the absolute difference between the features of the input character and those of model i), r is the total number of parameters in the feature vector (i.e. 48 and 30 in our case), Mij is the jth feature of model i, and Vj is feature j of the input character feature vector.

Second Approach In this approach [35], the authors proposed a novel and effective procedure for recognizing Arabic characters using a combination of statistical features and geometric moment features which are independent of the font and size of the character.

a) FEATURE EXTRACTION 54

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

According to the feature extraction process (see Fig(1-6)), the proposed system used a combination of different types of features which categorically define the details of the character regardless of its font type and size. These features are : 1. Statistical Features. 2. Geometric moment Features.

A. Statistical Features : there are 14 statistical features extracted from each character, four of them are global statistical features applied for the whole image as listed below: 1. Height / Width. 2. Number of black pixels / number of white pixels. 3. Number of horizontal transitions. 4. Number of vertical transitions. The horizontal and vertical transitions is a technique used to detect the curvature of each character. The procedure runs a horizontal scanning through the character box and finds the number of times that the pixel value changes state from 0 to 1 or from 1 to 0 as shown in Fig(4-1). The total number of times that the pixel status changes, is its horizontal transition value. Similar process used to find the vertical transition value.

Figure 4-1 : Horizontal and Vertical transitions

The other 10 feature are extracted after dividing the image of the character into four regions to get the following ratios as shown in Fig(4-2).

55

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

Black Pixels in Region 1/ White Pixels in Region 1. Black Pixels in Region 2/ White Pixels in Region 2. Black Pixels in Region 3/ White Pixels in Region 3. Black Pixels in Region 4/ White Pixels in Region 4. Black Pixels in Region 1/ Black Pixels in Region 2. Black Pixels in Region 3/ Black Pixels in Region 4. Black Pixels in Region 1/ Black Pixels in Region 3. Black Pixels in Region 2/ Black Pixels in Region 4. Black Pixels in Region 1/ Black Pixels in Region 4. Black Pixels in Region 2/ Black Pixels in Region 3.

Figure 4-2 : Image Devided into 4 regions (1, 2, 3, 4)

They have also used Euler number which is defined as the difference between number of connected components and number of holes in a binary image. Hence, if an image has C connected components and H number of holes, the Euler number E of the image can be defined as [35]: E=C-H

(4-8)

B. Geometric moment features: The used Geometric features here are the seven normalized central moments introduced by HU [23]. It is supposed to be invariant to (translation, scale, and rotation).

56

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

b) CLASSIFICATION In the classification phase (see Fig(1-6)), the authors used feed forward neural network and backpropagation learning algorithm for classifying the different types of features. The experimental setup and training parameters are showed in Table(4-1).

Table 4-1: Neural Network training parametes

Parameter

value

Number of Neurons in Input Layer

22

Number of Hidden Layer

3

Number of Neurons in Hidden Layer

Layer1=16 , Layer2=16 , Layer3=25

Performance Function

MSE(mean Square Error)

Learning Rate

0.1

Initial Weights and bias

Randomly Generated

Third Approach In this approach [36], the authors proposed algorithm, which called SOCR for "SIFTbased OCR", segments and recognizes the letters of an image containing a single paw. Using a sliding-window technique a candidate letter is isolated inside each window and classified using the appropriate classifiers. The appropriate classifiers are chosen base on the location of the window inside the paw and can be either of the four form classifiers (Start, Start, Middle, End). The best segmentation points and letters are chosen based on the confidence of the classifier for the letter to end at this segmentation point. a) FEATURE EXTRACTION In the feature extraction process (see Fig(1-6)), They used a multiple grids of SIFT descriptors as the main feature set. To fine-tune the classification results produced by

57

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

the use of SIFT descriptors they used additional features. The structure and extraction process of those Descriptors and features are described in the following section: Extracting a Grid of SIFT Descriptors: Given an image that contains a whole paw or a part of a paw, the following steps are executed to extract the grid of SIFT descriptors: 1. The image is padded with white to the size of the smallest bounding square of the paw. 2. The image is splited into G×G identical squares, where G is a small constant. 3. Let W be the width of each square and the scale be W=4. 4. In the middle of each square, extract N descriptors, where M = {m1, ….. ,mN} are the magnification factors. 5. Each extracted descriptor is identified by Dx,y,m, where x, y are the coordinates of the square in the grid and m is the magnification factor. When no magnification is used, by design, the grid of descriptors should cover the whole image without overlapping each other; hence the scale of the descriptor is always set to W=4 due to the fact that each descriptor has 4 spatial bins in each spatial direction. Throughout this work, G = 3 and M = {0:5; 1:0; 1:5}. Fig (4-3) shows an example of a grid of SIFT descriptors that were extracted without using magnification.

58

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

Figure 4-3 : A grid of 3×3 SIFT descriptors of the letter Taa of isolated form.

Extracting Additional Features Additional features are used to penalize the confidence of letters that are suggested by the SIFT classifier. This section describe the additional features that they used and how to extract them. They assume that the text inside the image is bounded by its borders and consists of black pixels located on a white background.

i.

Center of Mass Feature The center of mass feature [36] fm is the relative location (relative to the height and width of the image) of the center of mass of the black ink. Given an image and a letter, where c and c' are their centers of mass, respectively, the center of mass penalty used is pm = 1/(1 + d 1/2 (c, c')), where d1/2 gives the square-root of the Euclidean distance (a commonly used measure).

ii.

Crosshair Feature

59

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

The crosshair feature [36] fc, is the relative location (relative to the height and the width of the image) of the vertical and horizontal slices with the largest portion of black ink compared to the white background. Given an image and a letter, where c and c' are their crosshair features, respectively, the crosshair penalty used is pc = 1/(1 + d 1/2 (c,c')). iii.

Ratio Feature The ratio feature [36] fo , is the height divided by the width of the bounding box of the black ink. Given an image and a letter, where o' and o are their ratio features, respectively, the ratio penalty used is po = 1/ (1+(o'-o)2). The exponent was arbitrary set to be 2 without being optimized for any of the datasets tested.

iv.

Outline Features Each image has four outline features [36], top, left, bottom and right. The topoutline feature, ft = (t1, . . . . , tW), where W is the width of the bounding box of the black ink, is calculated as follows: 1. For i = 1, . . . ,W, let di be the distance from the top of the bounding box to the first occurrence of a black pixel in the ith column of the image. 2. For i = 1, . . . ,W, let ti be (max{dj}-di)/(max{dj}-min{dj}), where maximum and minimum are taken over all columns. Given an image and a letter, where t and t' are its top-outline features, respectively, the top-outline penalty, pt, is calculated as follows: 1. If the two feature vectors are of unequal length, downscale the longer one, so they are both of some length n. 2. Define⁡𝑝𝑡⁡ = ⁡1/(1⁡ + ⁡ 𝑎𝑣𝑔⁡𝑛𝑖=1 ⁡|𝑡𝑖 − 𝑡𝑖 ′|)⁡⁡ where avg takes the average. The left (pl), bottom (pb) and right (pr) outline penalties are calculated in a similar manner.

v.

Black Ink Histogram Features

60

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

Each image has a horizontal black ink histogram feature and a vertical one. The horizontal black ink histogram feature, fh = (h1, . . . , hH), where H is the height of the bounding box of the black ink, is calculated as follows: 1. For i = 1, . . . , H, let bi be the number of black ink pixels in row i. 2. For 1, . . . , H, let hi be bi/ max{bi}. The vertical black ink histogram feature [36] (fv) is calculated in a similar manner. Given an image and a letter, where h and h0 are its horizontal black ink histogram features, respectively, the horizontal black ink histogram penalty, ph, is calculated as follows: 1. If the two feature vectors are of unequal length, downscale the longer one, so they are both of some length n. 2. Define 𝑝ℎ ⁡ = ⁡1/(1⁡ + ⁡ 𝑎𝑣𝑔⁡𝑛𝑖=1 ⁡|ℎ𝑖 − ℎ𝑖 ′|)⁡⁡. The vertical black ink histogram penalty (pv) is calculated in a similar manner.

FEATURE EXTRACTION (PREPARING FOR CLASSIFICATION) Before they extract features, they assigned to each image Li,r a unique identification number σ. they extract SIFT descriptors and additional features as described. The number of extracted SIFT descriptors per symbol is NG2, where, as said, N is the number of different descriptor magnifications used and G is the size of the grid of 𝜎 descriptors that were extracted. Each SIFT descriptor of symbol σ is denoted 𝐷𝑥,𝑦,𝑚 ⁡

where x,y and m are as described before. They group the descriptors into 4NG2 groups, each SDg, for a combination of g = (x, y, m, r), containing all the relevant descriptors 𝜎 𝐷𝑥,𝑦,𝑚 .

b) CLASSIFICATION In the classification phase in the system process (see Fig(1-6)), the authors performed the following stages: Constructing a Classifier

61

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

For each classifier C for font F and an alphabet Ʃ, they execute a series of operations as described below. They generated high- resolution images, each one containing a unique symbol of the alphabet Ʃ written in font F. They explained how to extract the SIFT descriptors and additional features from each image and group the SIFT descriptors into four groups based on the location where the unique symbol can appear in a word (isolated, initial, medial and final). They described the quantization process on the SIFT descriptors and group them into four groups creating a separate classifier for each of the four forms of letters. Finally, they computed a base confidence for each unique symbol.

Creating Images for All Possible Symbols To create an image for each unique symbol for the alphabet Ʃ written in font F, a Word R document that contains |Ʃ| rows, representing all possible letters of the alphabet, and four columns, representing the different possible letter forms (isolated, final, medial and initial) is created. Since some letter combinations are graphically represented by a single symbol (ligature), these combinations are referred to as letters and belong to Ʃ. Each row can have one to four symbols, since some letters do not have all four-letter forms, but only isolated or initial forms, or even only an isolated form. The resulting document is exported as a high-resolution image. See Sect. 3 for details about the alphabet and the resolution of the image that was exported for each tested font. The exported image is split into lines and each line is split into the number of unique symbols it contains resulting in an image for each possible symbol. They denote each image by Li,r, where r ϵ {isolated, initial, medial, final} is the form of the ith letter of the alphabet Ʃ.

Quantization For each SDg, a quantization is performed using k-means clustering. For each g, kg is chosen to be the largest number such that the smallest energy among 1000 runs of kg-means is smaller than E. The k-means process is executed 1000 times to insure,

62

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

with a high probability, that the clustering solution is near optimal (has the smallest energy) and consistent over many runs. The centers of each of the kg clusters are the quantized descriptors of SDg and denoted QDg. Each quantized descriptor QD ϵ QDg is assigned a unique identification number, г. For each г, they save a mapping, 𝑀𝐴𝑃𝑔г , to the σs of the descriptors that QD is their quantized descriptor; id ϵ 𝑀𝐴𝑃𝑔г if QDг ϵ QDg is the center of the cluster to which Did ϵ SDg belongs. they divide all QDg into four groups, based on r, the form of the letter. Each group serves as the SIFT classifier for that form. The quantization process is designed to improve the recognition rate. Since there are letters that look similar, their descriptors might also be very close to each other. By quantizing, we allow a letter descriptor Dx,y,m to be matched to one QDг ϵ QDg, but since|𝑀𝐴𝑃𝑔г | ≥ 1, the descriptor can be matched to more than one symbol.

Base Confidence For each symbol, they compute its base confidence. The base confidence is the confidence value returned by executing the classification process on the image Li,r that the σ of the symbol was assigned to. Since the base confidence is used to divide the confidence as the last step in the classification process, its initial value is set to 1. Since all additional feature penalties will be equal to 1, the base confidence is actually the SIFT confidence. In the classification process, the SIFT confidence of the classifier in the symbol σ is divided by its base confidence to create a more "comparable metric" between different symbols of the same form.

Single Letter Classification Given an image I, a classifier C and a letter form r, the classification process returns the pair (σ; c), where c is the confidence of the classifier that I contains just the symbol σ. First, SIFT descriptors and the additional features are extracted. The grid size, G, and the magnification factors, M, must be the same once that were used to create C. The extracted features of I are: a) SD, the set of descriptors Dx,y,m, where

63

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

x; y ϵ {1, . . . ,G} and m ϵ M, b) the additional features fm; fc; fo; ft; fr; fb; fl; fh; fv. Next, execute the following operations: 1. Let P' be an empty list that will hold the predicted σs. The σs in P' can repeat since two descriptors can be matched to the same σ, as can be seen in the next step. 2. For each Dx,y,m ϵ SD, they execute the following: (a) Find QDг ϵ QDg, where g = (x, y, m, r), such that Euclidean distance between Dx,y,m and QDг is smaller or equal to any other descriptor in QDg. (b) Add all the σs of MAP𝑔г to P'. 3. Let P be the set of unique values of P'. 4. For each σ ϵ P execute the following: (a) Calculate the additional feature penalties pm; pc; po; pt; pl; pb; pr; ph; pv. (b) Let the SIFT confidence, ps, be the number of occurrences of σ in P' divided by |P'|. (c) Let the confidence, cid, of the classifier C in I being the symbol σ, be pspmpcpoptplpbprphpv/(base confidence of id). 5. The pair (σ,c) is the result of the classification process, where c = maxidϵP cσ and σ is such that cσ = σ.

Fourth Approach In this approach [1], the authors investigated the use of the Random forest tree and KNN classifiers with a tested statistical features which are used in [35]. (their last paper). They used Random forest tree and achieved a recognition rate 11% higher than using the KNN classifier. a) FEATURE EXTRACTION In the feature extraction process (see Fig(1-6)), the authors used the 14 statistical features previously used in [35] extracted from each character, four of them are for the whole image, The other 10 features are extracted after dividing the image of the character into four regions and calculate the ratios.

64

Chapter 4

CHARACTER RECOGNITION RATE EVALUATION USING DIFFERENT APPROACHES

b) CLASSIFICATION They used Random forest tree, which is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for a random forest was developed by L.Breiman. Random Forest classifier has been tested in this study because: It is one of the most accurate learning algorithms available as it produces a highly accurate classifier for many data sets and , it also runs efficiently on large databases and it gives estimates of what variables are important in the classification. Random forest is composed of some number of decision trees. Each tree is built as follow: 1) Let the number of training objects be D, and the number of features in features vector be F. 2) Training set for each tree is built by choosing D times with replacement from all D available training objects. 3) Number f