Script Identification of Multi-Script Documents: a Survey

22 downloads 0 Views 695KB Size Report
Oct 11, 2009 - scripts which originated from the ancient Indian Brahmi script and makes up almost all of the scripts of India and. Southeast Asia. The northern ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < distinguishing characteristic between Japanese and Han scripts. 3) Alphabetic System. The most important scripts in the alphabetic system are Greek, Latin, Cyrillic, and Armenian. The Greeks were the first Europeans to learn to write with an alphabet and from this system alphabetic writing spread to the rest of Europe, eventually leading to creation of all modern European alphabets. Latin script is used in many languages throughout the world such as Latin, Cyrillic, and Armenian, as well as many European languages like English, Italian, French, German, Portuguese, Spanish, and Austronesia, Modern Malay, Vietnamese, and Indonesian. Cyrillic script is used in some Eastern European, Asian, and Slavic languages such as Bulgarian, Russian, Macedonian, and Ukrainian. Some characters in the Cyrillic alphabet are borrowed from Latin and Greek and modified with cedillas, crosshatches, or diacritical marks. 4) Abjads. In this system, each symbol represents a consonantal sound. It includes Arabic and Hebrew scripts. The characteristic that clearly identifies Abjad-based scripts in pen computing systems is the right to left writing direction. 5) Abugidas. This system includes the Brahmic family of scripts which originated from the ancient Indian Brahmi script and makes up almost all of the scripts of India and Southeast Asia. The northern group of Brahmic scripts is used in the Devnagari, Bangla, Manipuri, Gurumukhi, Gujrati, and Oriya languages; the scripts in south India and southeast Asia are used in the Tamil, Telugu, Kannada, Malayalam Thai, Lao, Burmese, Javanese, Balinese languages. One important characteristic of Devnagari, Bangla, Gurumukhi, and Manipuri is that words are written together without spaces. The large number of horizontal lines in the textual portions of a document can distinguish these scripts from others. 6) Featural system. This system includes features that make up phonemes and includes the Korean Hangul script. In this system a script is formed by mixing logographic Hanja with featural Hangul. The Korean script is less complex and less dense compared to Chinese and Japanese scripts and it contains more circles and ellipses. The six writing systems mentioned above include many scripts with similar shaped characters: similar shaped characters are the major source of confusion in script identification. Usually, each script has several spatial characters, diacritics, multi graphs (include digraphs) or ligatures that differ with other scripts in a same writing system. They are significant in identifying scripts with similar shaped characters each other. For instance, the 11 languages such as, Afrikaans, Catalan, Dutch, English, French, German, Indonesian, Luxembourgish, Malay, Portuguese, Spanish alphabets consisted of Latin alphabet based 26 characters, and they are distinguish together with spatial characters, diacritics, multi graphs and ligatures. In particular, the 26 Latin characters are included in all language alphabets and only three of them (English, Indonesian and Malay) are without diacritics. Other 8 types of language alphabets (Afrikaans, Catalan, Dutch, French, German, Luxembourgish, Portuguese,

2

Spanish) contain several diacritics. These diacritics can be used to distinguish the 8 kinds of scripts in character level script identification. On the other hand, some diacritics are common for a number of scripts. For example, the diacritic ”é” is not useful for character level script identification since it is common for Afrikaans, Catalan, Dutch, French, Luxembourgish, Portuguese and Spanish. In this case, other factors of these languages should be considered, such as multigraphs and ligatures. TABLE I SCRIPT WRITING SYSTEMS

Writing Scripts Languages system

Sample writings

Chinese Logographi c system

Han

Japanese (Kanji) Korean (Hanja)

Syllabic system

Kanas Greek

Japanese Greek Afrikaans Catalan Dutch English French

Alphabetic system

Latin

German Italian Luxembourgis h Portuguese Spanish Vietnamese

Cyrillic

Russian

American

English

Hebrew

Hebrew Arabic

Abjads

Arabic

Farsi Urdu Uyghur

Abugidas

Featural system

Brahmi scripts

Hangul

Devanagari Bangla Manipuri Gurumukhi Gujrati Oriya Tamil Telugu Kannada Malayalam Thai Lao Burmese Javanese Balinese Korean

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < A comparison of multigraphs and ligatures in the 11 Latin alphabet based scripts shows that Afrikaans, Catalan, Dutch, and Luxembourgish alphabets have no multigraphs and ligatures. English, French, German, Indonesian, Malay, Portuguese and Spanish alphabets contain several multigraphs. Ligatures are present in English, French and German alphabets. Therefore, these multigraphs and ligatures can be used as significant factors to identify these types of scripts in character level script identification. It is worth noting that some diacritics are not useful since they are common for several scripts. For example, the digraph ”ch” is common for English, French, German, Portuguese and Spanish alphabets. Arabic, Persian and Uyghur characters are similar to each other: they have 18 common characters. Moreover, Arabic and Persian alphabets have 8 common characters, Arabic and Uyghur have 2 common characters, Persian and Uyghur have 6 common characters. There are 6 Uyghur characters different from Arabic and Persian. Thus, character level script identification is not efficient to the 3 scripts and word/connected component level script identification should be considered.

Hyb.

HandPrinted

Printed

III. SCRIPT IDENTIFICATION METHODS Most research in the field of script identification concerns documents either printed or handwritten scripts. However, since several documents may contain text blocks with both printed and handwritten scripts, some research is now addressing hybrid documents. Hence, on the basis of content type, documents can be classified into three categories: printed, handwritten and hybrid. Furthermore, document acquisition can be performed not only through optical scanners but also via cameras and camcorders. Of course, the device used for document acquisition can affect document image quality and therefore specific script identification methods have recently been proposed for video and camera-based acquisitions. TABLE II SCRIPT IDENTIFICATION: DOCUMENTS TYPES VS ACQUISITION Scanner Video and Camera Chanda et al. [13, 14, 15, 16], Chaudhuri and Gllavata et al. [42], Pal [21], Dhanya et al. [33], Ding et al. [34], Linlin Li and Tan Dhandra et al. [28, 29], Ghosh and Chaudhuri [54], Sharma et al. [41], Hochberg et al. [46], Jaeger et al. [48], [104], Trung et al. Padma et al. [68], Pal et al. [71, 72, 73], Peake [123], Zhou et al. et al. [85], Prakash et al. [90], Rani et al. [91], [134] Spitz [119] Chaudhuri [22], Dhandra and Hangarge [30], Ghosh et al. [39, 40], Hiremath et al. [44], Hochberg et al. [46], Namboodiri and Jain [64], Obaidullah et al. [67], Pal et al. [75], Razzak et al. [93], Roy et al. [96, 97, 98, 99], Zhou et al.[133] Ben Moussa et al. [6], Benjelil et al. [7], Kanoun et al. [50], Saidani et al. [101]

Table II summarizes some of the most relevant research in the field of script identification, categorized by document type and acquisition device. Earlier research on script identification of printed, handwritten and hybrid documents is discussed in

3

the following. For each document type, different methods presented in the literature are introduced according to the kind of data they use to perform script identification: Page / Paragraph / Text-block, Text line, Word, or Character. A. Script identification in printed documents Most script identification research was carried out on printed documents. The main sources of printed documents include books, magazines, journals, dictionaries, etc.. Some researcher prepared multi script texts using automatic translating software [36] or other software [5] firstly, then multi script documents are obtained as computer printouts. Since the diversity of scripts and the deficiency of available public databases, most of the researcher built their own databases/datasets, as Table III summarizes. 1) Page/Paragraph/Text block Level Script Identification Most research on printed document script identification has been carried out at the page level. Hochberg et al. [46] used cluster-based templates for discriminating 13 different scripts. Spitz [119] proposed a language identification scheme where the words of 26 different languages were first classified into Han-based and Latin-based scripts. Successively, the actual languages were identified using projection profiles of words and character shapes. Jie Ding et al. [34] presented a method which uses a combined analysis of several discriminating statistical features to categorize European and Oriental language scripts. Pal and Chaudhuri [21] developed a system for identifying Bangla and Devnagari (Hindi) scripts using a classification tree. Research on printed document script identification was also conducted at text block level. For instance, Peake and Tan [85] proposed a method based on Multiple Gabor filters and grey level co-occurrence matrices to extract the texture features of five major scripts. 2) Text-line Level Script Identification. In text-line level script identification, a text block is firstly divided into lines. Pal and Chaudhuri [71] developed an automatic technique for separating text lines using script characteristics and shape based features. They also proposed a system for the identification of printed Roman, Chinese, Arabic, Devnaguri and Bangla text lines from a single document [72] and a method of identifying text lines of different Indian scripts from a document [73]. An automatic technique for the identification of Japanese and English script portions from a single line of a printed document was proposed by Chanda et al. [13]. Padma et al. [68] developed a monothetic algorithmic model to identify and separate Telugu, Hindi and English text lines from a printed multilingual document. A simple and efficient technique of script identification for Kannada, Hindi and English text lines was presented by Prakash et al. [90]. Ferrer et al. [36] proposed a LBP-based line-wise script identification system to identify ten different scripts. 3) Word Level Script Identification. Dhanya et al. [33] presented a successful method for identifying script at word level in a bilingual document containing Roman and Tamil scripts. Jaeger et al. [48] used a Gabor filter analysis of textures and a multiple classifier system with four different classifiers to identify Arabic, Chinese, Hindi, and Korean scripts at word-level. Dhandra et al. [28, 29] proposed an automatic technique for script identification at word level based on the morphological reconstruction of two printed

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < scripts: Telugu and Devnagari. A SVM based method was proposed by Chanda et al. [14] for the identification of printed English and Thai scripts at word level from a single line of a document page. Chanda [15] proposed a SVM based technique for word-level identification of Sinhala, Tamil and English scripts from a single document page, and a SVM based scheme for the identification of printed word-level English, Devnagari and Bangla scripts [16]. 4) Character Level Script Identification. Pal and Sarkar [74] used a combination of topological, contour and water reservoir concept based features to identify printed Urdu script. Rani et al. [91] carried out experiments on multi-font and multi-sized characters with Gabor features and Gradient features to identify Gurumukhi and English scripts at character or numeral level. B. Script identification in handwritten documents Handwritten documents are another important area of application for script identification systems. Of course, script identification of handwritten documents is more challenging than script identification of printed documents. In fact, there are some relevant differences between printed and handwritten script identification. For example, some scripts resemble each other much more in handwritten documents than in printed ones. Moreover, handwriting styles can be very variable. The experimental documents, which are written by different individuals at different times, enlarge the inventory of possible character and word shapes in handwritten documents. In addition, ruling lines and character fragmentation are common in handwritten documents due to the variety of papers and writing instruments used. All these differences can create huge challenges for script identification in handwritten documents. 1) Page/Paragraph/Text block Level Script Identification. The first study conducted on handwritten script identification was carried out by Chaudhuri [22] and was similar to that proposed by Hochberg et al. [46] for printed documents. However, the resulting classification accuracy was lower than that for the printed documents. An online handwritten script recognition system was proposed by Namboodiri and Jain [64] for classifying six major scripts at word level. Eleven different features and six types of classifiers were considered. A method based on the texture features for script identification in a handwritten document image was proposed by Hiremath et al. [44]. Ghosh and Shivaprasad [39] proposed an handwritten script identification method in which a “possibilistic” approach was used for cluster analysis. 2) Text-line Level Script Identification. Namboodiri and Jain [64] proposed a method to classify words and lines into one of the six major scripts: Arabic, Cyrillic, Devnagari, Han, Hebrew or Roman. The classification is based on eleven different spatial and temporal features extracted from strokes of the words. 3) Word Level Script Identification. Roy et al. [96] proposed a word-wise handwritten script identification method for Indian postal automation regarding Bangla and English script identification at word-level. The method mainly uses water reservoir concept based features, fractal-based features and a Neural Network classifier. Roy et al. [97] also developed a technique for script separation of handwritten postal documents in Bangla, Roman and Devanagri scripts.

4

Run Length Smoothing Algorithm (RLSA) was used to segment the document pages into lines and then into words. Fractal-based, busy-zone and topological features were used along with a Neural Network (NN) classifier for script identification. A script separation technique of Roman and Oriya scripts for Indian Postal automation was proposed by Zhou et al. [133]. They presented a script identification method based on water reservoir concept based features, fractal dimension based features and topological features with an NN classifier. Sarkar et al. [103] presented an automatic separation system for word-level script identification from Bangla or Devanagri mixed with Roman scripts. Dhandra et al. [30] used a two-stage approach. In the first stage, some global and local features were applied to identify the text words. In the second stage, the numeral written in different scripts was identified. To test the system, Kanada, Devanagri and Roman scripted handwritten documents were considered. A word-wise handwritten script identification system for biscript documents written in Persian and Roman scripts was proposed by Roy et al. [99]. The system used a simple and fast computable sets of twelve features based on fractal dimension, position of small components and topology. A scheme for document level handwritten script identification from six popular Indian script documents was presented by Roy et al. [98]. In the proposed scheme, a small set of features based also on fractal dimension are computed using an MLP classifier. Obaidullah et al. [67] proposed a scheme to identify the six popular Bangla, Devnagari, Malayalam, Urdu, Oriya and Roman scripts in Indian documents, and compared performance using different well-known classifiers. 4) Character Level Script Identification. Pal et al. [75] proposed a modified quadratic classifier based scheme for the recognition of off-line handwritten numerals of six popular Indian scripts: Devnagari, Bangla, Telugu, Oriya, Kannada and Tamil. Razzak et al. [93] presented a fuzzy rule based approach for the recognition of both Urdu and Arabic numerals in an unconstrained environment. C. Script identification in hybrid documents. Hybrid documents include printed and handwritten texts. A multi-lingual automatic identification of Arabic and Latin in both handwritten and printed script was proposed by Ben Moussa et al [6]. A method for Arabic and Latin text block differentiation for both printed and handwritten scripts is discussed by Kanoun et al. [50]. The method is based on a morphological analysis for each script at the text-block level and a geometrical analysis at line and connected component levels. Benjelil et al. [3] proposed an accurate system based on a steerable pyramid transform for Arabic and Latin script identification at word level. By using new structural features, a successful attempt was made by Saidani et al. [101] to identify the Arabic or Latin script of a machine printed or handwritten document at word level.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < (MQDF) Bayes Quadratic classifier Rule-based classifier Linear Support Vector Machine (SVM)

Polynomial/Gau ssian radial basis functions (RBFs)

Namboodiri and Jain [64] Prakash et al. [90], Razzak et al. [93] Chanda et al.[14], Dhanya and Ramakrishnan [33], Ghosh and Chaudhuri [41], Sharma et al. [104], Wang et al. [127] Ben Moussa [6], Chanada et al. [14, 15, 16, 18, 19], Ferrer. et al. [36], Ghosh and Chaudhuri [41], Jaeger et al. [48], Namboodiri and Jain [64], Pan and Tang [79], Rani et al. [92], Sharma et al. [104], Wang et al. [127], Zhou et al. [134]

VI. PERFORMANCE ANALYSIS Performance of some of the most significant systems presented in the recent literature are reported in Table VI. More precisely, Table VI shows that different kinds of features, e.g., statistical features, structural features, symbols matching features and texture features, were generally used for script identification. Indeed, these features express only some characteristics of scripts that are generally not sufficient for script identification. Conversely, texture features are generally more efficient than others, but they cannot be applied reliably at word and character level within a document. Furthermore, to achieve better results, different kinds of features and classifiers were used. Experimental results show that a particular feature, that is generally efficient within a set of scripts, is not necessarily useful for other scripts. In English, Kannada, Hindi multi-script documents [32, 69, 81], texture features with a K-NN classifier were more useful than other features. For identifying different scripts in Indian documents, whatever classifier was used, both Global and Local features demonstrated to be efficient. Moreover, results obtained using a single feature (either local or global) were generally worse than those obtained using both features [31, 83]. In Han, Roman and some alphabet multi script identification methods, an SVM classifier with different extracted features achieved higher identification results than K-NN and NN classifiers [19, 55, 127]. For Arabic and Latin scripts, an RBF classifier was more efficient than a K-NN classifier [6].

References

Angadi and Kodabagi [3] Bashir and Quadri [5] Ben Moussa et al. [6] Benjelil et al. [7] Chanda et al. [13] Chanda et al. [14]

TABLE VI COMPARISON RESULTS Methods Script Feature Classifier Hindi, Kannada, Wavelet English, NA feature Malayalam, Tamil English, Project profile NA Kashmiri feature KNN, Arabic, Latin Fractal feature RBF Steerable Arabic, Latin pyramid KNN transform Statistical and Japanese, Tree structural English Classifier feature Statistical and English, Tail SVM structural

Accuracy (%)

94.33

96.2 96.64 98.72 97.5

98.79 99.36

Sinhala, Chanda et al. Tamil and [15] English English, Chanda et al. Devnagari, [16] Bangla English, Chanda et al. Devnagari, [17] Urdu

feature Statistical and structural feature Chain code histogram feature Statistical and structural feature

SVM

96.4

SVM

98.51

Tree classifier

97.51

Chain code histogram feature

SVM

98.39 (character ) 99.85 (word)

Morphological feature

KNN

97

Global and Local feature

KNN

99.96

Dhanya and Roman and Ramakrushnan Tamil [33]

Gabor filter

SVM NN KNN

96.03 91.86 90.02

Kanoun et al. Arabic, Latin [50]

Morphological and geometrical feature

KNN

96.1

Chanda et al. [18]

Dhandra et al. [28] Dhandra et al. [30]

Chinese, Japanese, Korean, Roman English, Kannada, Hindi Kannada, Roman, Devnagari

Chinese, NA SVM English Arabic, Cyrillic, Namboodiri KNN, NN, Devnagari, Stroke feature and Jain [64] SVM Han, Hebrew, Roman English, Padma and Kannada, Texture feature KNN Vijaya [69] Hindi Pal and English, Shape based Chaudhuri Devnagari NA feature [71] etc. Roman, Chinese, Statistical Pal et al. [72] Arabic, NA feature Devnagari, Bengla Statistical Pal et al. [74] Indian NA feature Chinese, Japanese, Pan et al. [80] Gabor filter NN Korean, English Patil and English, KNN Subbareddy Kannada, NA PNN [84] Hindi Pati and Gabor filter Ramakrishnan Indian NN, SVM and DTC [83] Chinese, English, Gabor filter, Greek, Peake and Tan Grey level coKorean, KNN [85] occurrence Malayalam, matrix Persian, Russian Philip and English, Gabor filter NA Samuel [88] Malayalam Gabor filter, English, Rani et al. [91] Gradient SVM Gurumukhi feature Structure, English, SVM, KNN, Rani et al. [92] Gabor, DCT Gurumukhi PNN feature Lin et al. [55]

99.6

95.5

99.33

98.5

97.33

97.52

98.5

98.0 98.89 99.6

95

96.5 98.9 99.45 99.402

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

9

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Rezaee et al. Farsi, English [94] Roy et al. [96]

Indian

Roy et al. [97]

Indian

Saidani et al. Arabic, Latin [101] Arabic, Shijian Lu and Chinese, Tan [56] Hebrew, Roman Roman, Singhal et al. Devanagari, [116] Bangla, Telugu Roman, Waked et al. Ideographic, [126] Arabic Chinese, Wang et al. Japanese, [127] Korean, English Arabic, Chinese, Zhou et al. English, [134] Japanese, Korean, Thailand

Statistical and shape based feature Statistical feature Statistical and structural feature Structure feature

NA

96.05

NN

97.62

NN

96.79

NA

98.4

Statistical feature

KNN

95.36

Gabor filter

NA

91.6

Statistical feature Template based method

NA

SVM

Texture feature SVM, RBF

TABLE VII PUBLIC DATABASES

DB

Script Lang.

IAM

English

APTI

Arabic

APTID/ MF

Arabic

99.1

KAFD

Arabic

97.59

RIMES

French

91

It is worth noting that approaches in Table VI are difficult to be compared since performance have been estimated using databases collected in laboratory environments. In fact, although there existed several public datasets containing natural images with texts, they are mainly related to the text recognition task [109, 128]. Some datasets specifically devoted to script identification have been also realized [46, 87, 132]. Among the others, the SIW-10 dataset was developed for script identification [110]. The SIW-10 dataset contains more than 13,000 multi-scripts images including textual components from 10 languages. Therefore, to date, it can be considered as one of the most valuable benchmarking dataset for research in script recognition. VII. FUTURE WORK AND TRENDS Although in the last twenty years there have been many advances in the field of script identification, a great deal of work is still necessary to improve accuracy and efficiency of script identification systems. Some of the most valuable directions of research are here addressed. The first point is that, as Table III shows, many researchers constructed database/datasets by themselves collecting data in laboratory according to the requirement of their research work. These datasets are different in type, size, scanning resolution and image format. Besides, these datasets are not publicly available. Therefore, specific work is necessary to define some standard data formats and to realize and release public datasets for script identification.

Data type and size Handwritten, 1539 page of text images written by 657 different writers Printed, 113284 words, 10 fonts, 10 sizes, 4 styles Printed,1845 text-blocks images, 27402 characters images Printed,40 Arabic fonts, 115068 page images, 2576024 line images Handwritten mails, 1300 individuals, 12723 pages of 5605 letters

KHATT

Arabic

Handwritten, 1000 writers 4000 image pages

QUWI

English, Arabic

Handwritten, 1017 writers image pages

LAMIS- French, MSHD Arabic

Maurdor

English, Arabic, French

Special field

ReferenceYear

writer identification

U. Marti et. al [61]

2002

evaluation of screen-based OCR systems

F. Slimane et. al [117]

2009

text recog., OCR systems

F.K. Jaiem et. al [49]

2013

multi-font text recog.

H. Luqman et. al [58]

2014

writer identification, handwriting recog.

E. Augustin et. al [4]

2006

text recog., writer S.A identification, Mahmoud forms et. al [60] analysis, segmentation writer identification S.Aland gender, Maadeed age and et. al [2] handedness classification

2012

2012

Handwritten 600 Arabic, 600 Writer recog. French writer C. Djeddi handwritten, classification, 2014 et. al [35] 1300 signature signatures, verification 21000 digits Printed and S. handwritten, script Brunessau 2014 2500 document identification x et. al images [10]

Arabic, Chinese, English, Greek, Hebrew, SIW-10 Japanese, Korean, Russian, Thai and Tibetan

13045 word images, cropped from 7700, full images from 10 languages

script identification

Baoguang Shi et. al [110]

2015

APTID/ English, MF Arabic,

1. Printed Text images, English: 2328 Arabic: 1845; 2. Handwritten Text images, English: 582, Arabic: 460

script identification

I. Chtourou, A. Cheikh et. al [49]

2013

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Table VII summarizes some of the public datasets that have been considered by the script identification research community. Most of these databases are referred to research in the field of OCR [117, 49], handwriting recognition [58, 4, 60], document analyzing [60], writer identification and classification [61, 4, 60, 2], signature recognition and verification [35]. Only few public databases are specifically devoted to script identification [110, 10, 49]. Moreover, databases for script identification are limited in terms of script/language type, font types and sizes, lack degraded/noisy images, etc.. In particular, many databases are devoted to specific scripts, such as English, Chinese, Arabic, German, French, Japanese, Korean, Devanagari, Bangla, whereas no databases are available for research on other scripts. Of course, along with the expansion of research on script identification, the blank of some scripts in this area will be amended in the future work. Furthermore, most of researches is based on offline script identification technology, but there are only few reports about online script identification technology. With the spreading of PDAs and smartphones, the demand of online script identification technology is increasing. At the same time, for everyday useful aims, also automatic translation should be considered. Just think to a tourist needing for a fast automatic translation of a signboard. Of course the development of online script identification systems will require a great amount of work and it is an extraordinary challenge for the research community. VIII. CONCLUSION Script identification is an important task in an OCR system for multi-lingual, multi-script documents. Many script identification methods have been proposed for written scripts at different levels within a document— page/paragraph level, text-line level, word level, and even character level. Compared to the field of document analysis and optical character recognition, research on script identification is still limited. In fact, studies were focused so far on identifying the major scripts in the world such as English, Arabic, Indian, Chinese and Japanese. Indeed, many other scripts exist that have received no apparent attention. The identification of video and camera based images is another research area for which more research is necessary, since mobile and cheap devices become more and more widespread. Concerning features and classification methods, although it is quite difficult to obtain conclusive results, Gabor filter and statistical features are certainly some of the most effective characteristics for script identification as well as kNN and SVM are the most valuable classifiers. It can be concluded that although many advancements have been made, additional research is necessary the field of script identification. A crucial step is certainly the creation of new standard databases for multi-lingual, multi-script identification, to evaluate comparatively different script recognition systems and support the research community active in the field. ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China (No. 61163028,61363064, 61563052),

10

Special Training Plan Project of Xinjiang Uyghur Autonomous Region’s Minority Science and Technological Talents (No. 201323121), College Scientific Research Plan Project of Xinjiang Uyghur Autonomous Region (No. XJEDU2013I11), and the Open Projects Program of National Laboratory of Pattern Recognition (No. 201306321). This work is supported by the Italian Ministry of Education, University and Research (MIUR) within the PRIN2015HAND project (No. H96J16000820001).

[1]

[2]

[3]

[4]

[5] [6]

[7]

[8]

[9]

[10]

[11] [12] [13] [14] [15]

[16] [17] [18] [19] [20]

[21] [22]

[23] [24]

REFERENCES V.Ablavsky, M.R. Stevens,“Automatic feature selection with applications to script identification of degraded documents”, Proc. ICDAR, Aug. 3-6, 2003, pp. 750-754. S. Al-Maadeed, W. Ayouby, A. Hassaine, and 1. Aljaam, “QUWI: An Arabic and English Handwriting Dataset for Offline Writer Identification”, Proc 13th ICFHR, 2012, pp. 742-747,2012. S.A.Angadi, M.M.Kodabagi,“A fuzzy approach for word level script identification of text in low resolution display board images using wavelet features”, Proc. ICACCI, Aug. 22-25, 2013, pp.1804 – 1811. E. Augustin, M. Carre, G. E., J.M. Brodin, E. Geoffrois, and F. Preteux,”Rimes evaluation campaign for handwritten mail processing”, Proc. IWFHR, 2006, pp. 231-235. R. Bashir, S. Quadri, “Identification of Kashmiri script in a bilingual document image”, Proc. ICIIP, Shimla, Dec. 9-11, 2013, pp. 575 – 579. S.Ben Moussa, A.Zahour, A.Benabdelhafid, A.M.Alimi, “Fractal-based system for Arabic/Latin, printed/handwritten script identification”, Proc. ICPR, Tampa, Dec. 8-11, pp. 1–4. M. Benjelil, S. Kanoun, R. Mullot, A.M. Alimi,“Arabic and Latin Script Identification in Printed and Handwritten Types Based on Steerable Pyramid Features”, Proc. ICDAR, Barcellona, 2009, pp.591-595. M.Benjelil, R. Mullot, A. Alimi, “Language and script identification based on Steerable Pyramid Features”, Proc. ICFHR, Bari, Sept. 18-20, 2012, pp. 716-721. U. Bhattacharya, B.B.Chaudhuri, “Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals”, IEEE T-PAMI, 2009, Vol. 31, No. 3, pp. 444- 457. S. Brunessaux, P.Giroux, B. Grilheres, M. Manta, M. Bodin, K.Choukri, O.Galibert, J.Kahn, ”The Maurdor project - improving automatic processing of digital documents”, Proc. DAS, 2014, pp. 349-354. A. Busch, W.W. Boles, S. Sridharan, “Texture for script identification”, in IEEE T-PAMI, 2005, vol. 27, no 11, pp.1720-1732. A.Busch, “Multi-Font script identification using texture-based features”, Image Analysis and Recognition, LNCS vol. 4142, 2006, pp. 844- 852. S.Chanda, U. Pal, F.Kimura, “Identification of Japanese and English script from a single document page”, Proc. IEEE-CIT,2007, pp.656-661. S.Chanda, O.R. Terrades, U. Pal, “SVM Based Scheme for Thai and English Script Identification”, Proc. ICDAR, Parana, 2007, pp. 551-555. S.Chanda, S. Pal, U.Pal, “Word-wise Sinhala Tamil and English Script Identification Using Gaussian Kernel SVM”, Proc. ICPR,Tampa, Dec. 8-11, 2008, pp. 1-4. S.Chanda, S. Pal, K.Franke, U.Pal, “Two-stage Approach for Word-wise Script Identification”, Proc. ICDAR,2009, pp. 926-930. S.Chanda, U.Pal,“English, Devanagari and Urdu text identification”, Proc. ICDAR, 2005, pp. 538-545. S.Chanda, K.Franke, U.Pal, “Identification of Indic scripts on torndocuments”, Proc. ICDAR,Beijing, Sept. 18-21, 2011, pp. 713 - 717. S.Chanda, U.Pal, K.Franke, F.Kimura, “Script identification – a Han and Roman script perspective”, Proc. ICPR, Istanbul, 2010, pp. 2708 - 2711. S.A.Chaudhari, R.M.Gulati, “An OCR for separation and identification of mixed English — Gujarati digits using kNN classifier”, Proc. ISSP, Gujarat, March 1-2, 2013, pp. 190 – 193. B.B.Chaudhuri, U.Pal,“An OCR system to read two Indian language scripts:Bangla and Devnagari(Hindi)”,ICDAR,Ulm,1997,pp. 1011-1015. B.B.Chaudhuri, “On multi-script OCR system evaluation”, Proc. Int. Workshop on Performance Evaluation Issues in Multi-Lingual OCR, 1999, http://www.kanungo.com/workshop/abstracts/chaudhuri.html S.Chaudhury, R.Sheth,“Trainable script identification strategies for Indian languages”, Proc. ICDAR, 1999, pp. 657-660. J.Cheng, X.Ping, G. Zhou, Y.Yang,“Script identification of document image analysis”, Proc. Int. Conf. on Innovative Computing Information and Control, Beijing, 2006, pp. 178-181.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [25] I. Chtourou, A. C. Rouhou, F. K. Jaiem, S. Kanoun, “ALTID : Arabic/Latin Text Images Database for recognition research”, Proc. 13th ICDAR, 2015, pp.836-840. [26] S.Dalal, L.Malik, “A survey of methods and strategies for feature extraction in handwritten script identification”, Proc. Int. Conf. on Emerging Trends in Engineering and Technology (ICETET), Nagpur, July 16-18, 2008, pp. 1164 – 1169. [27] M.S. Das, D.S. Rani, C.R.K. Reddy, “Heuristic based Script Identification from Multilingual Text Documents”, Proc. Int. Conf. on Recent Advances in Information Technology (RAIT), Chanbad, March 15-17,2012, pp. 487-492. [28] B.V.Dhandra, H.Maliikarjun, R. Hegadi, V. S.Malemath, “Word-wise script identification based on morphological reconstruction in printed bilingual documents”, Proc. IET Int. . on Visual Information Engineering (VIE), Bangalore, Sept. 26-28, 2006, pp. 389-393. [29] B.V.Dhandra, H.Mallikarjun, R.Hegadi, V.S.Malemath, “Word-wise Script Identification from Bilingual Documents Based on Morphological Reconstruction”, Proc. Int. Conf. on Digital Information Management, Bangalore, Dec. 6, 2006, pp. 389 – 394. [30] B.V. Dhandra, M. Hangarge,“Global and local features based handwritten text words and numerals script identification”, Proc. ICCIMA, Dec. 13-15, 2007, pp. 471-475. [31] B.V.Dhandra,M. Hangarge, R. Hegadi, V.S. Malemath, “Word level script identification in bilingual documents through discriminating features”, Proc. ICSCN, Chennai, Feb. 22-24, 2007, pp. 630-635. [32] B.V.Dhandra,P.Nagabhushan,M.Hangarge,R.Hegadi,V.S.Malemath,“Sc ript identification based on morphological reconstruction in document images”, Proc. ICPR, Hong Kong, 2006, pp. 950-953. [33] D.Dhanya, A.G.Ramakrishnan,“Script Identification in printed bilingual documents”, LNCS vol. 2423, 2002, pp. 13-24. [34] J. Ding, L. Lam, C.Y. Suen, “Classification of oriental and European scripts by using characteristic features”, Proc. ICDAR, Ulm, 1997, pp. 1023-1027. [35] C. Djeddi, A. Gattal, L.S.Meslati, I.Siddiqi, Y. Chibani, H.EI Abed, “LAMIS-MSHD: A Multi-Script offline Handwriting Database”, Proc. ICFHR, 2014, pp. 93-97. [36] M.A.Ferrer, A.Morales, U.Pal, “LBP based line-wise script identification”, Proc. ICDAR, Washington DC, 2013, pp. 369-373. [37] M. A. Ferrer, A. Morales, N. Rodríguez, U. Pal, “Multiple Training – One Test methodology for Handwritten Word-Script Identification”, Proc. 14th ICFHR, 2014, pp.754-759. [38] B.P.Gaikwad, R.R.Manza, G.R.Manza,“Video scene segmentation to separate script”, Proc. IACC, Ghaziabad, 2013, pp. 1269 – 1274. [39] D.Ghosh, A.P.Shivaprasad, “Handwritten script identification using possibilistic approach for cluster analysis”, Journal of the Indian Institute of Science, vol.80, no. 3, 2000. [40] D.Ghosh, T.Dube, A.P. Shivaprasad, “Script Recognition-A Review”, IEEE T-PAMI, vol. 32, no.12, 2010, pp. 2142-2161. [41] S.Ghosh, B.B. Chaudhuri, “Composite Script Identification and Orientation Detection for Indian Text Images”, Proc. ICDAR, Beijing, Sept. 18-21, 2011, pp. 294-298. [42] J.Gllavata, B.Freisleben, “Script Recognition in Images with Complex Backgrounds”, Proc. ISSPIT, Athens, Dec. 21, 2005, pp.589-594. [43] M.Hangarge, K.C.Santosh, R.Pardeshi, “Directional Discrete Cosine Transform for Handwritten Script Identification”, Proc. ICDAR,Washington DC, 2013, pp. 344- 348. [44] P.S.Hiremath, S.Shivashankar, J.D. Pujari, V.Mouneswara, “Script Identification in a handwritten document image using texture features”, Proc. IACC, Patiala, Feb. 19-20, 2010, pp. 110-114. [45] P.S.Hiremath, S.Shivashankar,“Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image”, Pattern Recognition Letters, vol. 29, no. 9, 2008, pp. 1182-1189. [46] J.Hochberg, P.Kelly, T.Thomas,L. Kerns, “Automatic script identification from images using cluster-based templates”, IEEE TPAMI, vol. 19, no. 2, 1997, pp.176-181. [47] J. Hochberg, K. Bowers, M. Cannon, P. Kelly, “Script and language identification for handwritten document images”, IJDAR, vol.2, no.2, 1999, pp. 45-52. [48] S.Jaeger, H. Ma, D. Doermann, “Identifying Script on Word-Level with Informational Confidence”, Proc. ICDAR, 2005, pp. 416-420. [49] F.K. Jaiem, S. Kanoun, M. Khemakhem, H. El Abed, and S. Kardoun, ”Database for Arabic Printed Text Recognition Research”, Proc. ICIAP, 2013, LNCS 8156, pp. 251-259, 2013.

11

[50] S.Kanoun, A.Ennaji, Y.Lecourtier, A.M.Alimi, “Script and Nature Differentiation for Arabic and Latin Text Images”, Proc. IWFHR, 2002, pp. 309-313. [51] M.Khoddami, A. Behrad, “Farsi and Latin Script Identification using Curvature Scale Space Features”, Proc. NEUREL, Belgrade, Sept. 2325, 2010, pp. 213 - 217. [52] L.Lam, J. Ding, C.Y.Suen,“Differentiating between Oriental and European Scripts by Statistical Features”, IJRAI, vol. 12, no. 1, 1998, pp.63-79. [53] D.S.Lee, C.R.Nohl, H.S.Baird, “Language Identification in Complex, Unoriented, and Degraded Document Images”,Proc.DAS,1996,pp.76-98. [54] L.Li, C.L.Tan, “Script Identification of Camera-based Images”, Proc ICPR, Tampa FL, Dec. 8-11, 2008, pp. 1-4. [55] X.R. Lin, C.Y. Guo, F. Chang, “Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines”, Proc. ICDAR, Beijing, Sept. 18-21, 2011, pp. 498- 502. [56] S. Lu, C.L.Tan, “Automatic Detection of Document Script and Orientation”, Proc. ICDAR, Parana, Sept. 23-26, 2007, pp. 237-241. [57] S. Lu, C.L.Tan, “Script and Language Identification in Noisy and Degraded Document Images”,IEEE T-PAMI, vol.30,no.1,2008,pp.14-24. [58] H. Luqman, S. A. Mahmoud and S. Awaida, “KAFD Arabic font database”, Pattern Recognition, vol. 47, no. 6, 2014, pp. 2231-2240. [59] H.Ma, D.Doermann, “Gabor Filter based Multi-Class Classifier for Scanned Document Images”, Proc. ICDAR, 2003, pp. 968- 972. [60] S. A. Mahmoud, A. Ahmad, M. Alshayeb, W.G. AI-Khatib, M.T. Parvez, G.A. Fink, V. Margner, and El Abed, “KHATT: Arabic Offline Handwritten Text Database”,Proc. 13th ICFHR, 2012, pp. 447-452. [61] U. Marti and H. Bunke, “The IAM-database: An English Sentence Database for On:line Handwriting Recognition”, IJDAR, vol. 5, pp. 3946, 2002. [62] S.Mori, C.Y.Suen, K.Yamamoto, “Historical Review of OCR Research and Development”, Proc. IEEE, 1992, vol. 80, no. 7, pp.1029-1058. [63] P.Nagabhushan, S.A.Angadi, B.S.Anami, “An Intelligent Pin code Script Identification Methodology Based on Texture Analysis using Modified Invariant Moments”, Proc. ICCR, 2005, pp. 615-623. [64] A.M.Namboodiri, A.K.Jain,“On-line Script Recognition”, IEEE TPAMI,2004, vol. 26, no. 1, pp. 124-130. [65] S.M.Obaidullah, C. Halder, N. Das, K. Roy, “Numeral Script Identification from Handwritten Document Images”, Proc. 11th IMCIP, Procedia Computer Science, 2015, pp.585–594. [66] S.M.Obaidullah, R. Karim, S. Shaikh, “Transform Based Approach for Indic Script Identification from Handwritten Document Images”, Proc. 3rd ICSCN, 2015, pp.1-7. [67] S.M.Obaidullah, K.Roy, N.Das,“Comparison of different classifiers for script identification from handwritten document”, Proc. ISPCC, Sept. 26-28, 2013, pp. 1 - 6. [68] M.C.Padma, P.A.Vijaya, “Monothetic Separation of Telugu. Hindi and English Text Lines from a Multi Script Document”, Proc. SMC, Oct. 11-14, 2009, pp. 4870 - 4875. [69] M.C.Padma, P.A.Vijaya, “Entropy Based Texture Features Useful for Automatic Script Identification”, Int. J. Computer Science and Engineering, Vol. 2, No. 2, 2010, pp. 115-120. [70] U. Pal, B.B.Chaudhuri, “Automatic separation of words in Indian multilingual multi-script document”, Proc. ICDAR, Ulm, 1997, pp. 576-579. [71] U.Pal, B.B.Chaudhuri, “Script line separation from Indian multi-Script documents”, Proc. ICDAR, Bangalore, 1999, pp. 406-409. [72] U.Pal, B.B.Chaudhuri, “Automatic Identification of English, Chinese, Arabic Devnagari and Bangla Script Line”, Proc. ICDAR, Seattle, 2001, pp. 790-794. [73] U.Pal, S. Sinha, B.B.Chaudhuri, “Multi-Script line identification from Indian documents”, Proc. ICDAR, Bangalore, 2003, pp.880-884. [74] U.Pal,A. Sarkar, “Recognition of Printed Urdu Script”, Proc. ICDAR, Bangalore, 2003, pp. 1183-1187. [75] U.Pal, N. Sharma, T.Wakabayashi, F.Kimura, “Handwritten Numeral Recognition of Six Popular Indian Scripts”, Proc. ICDAR, Parana, 2007, pp. 749-753. [76] U. Pal, B.B.Chaudhuri, “Identification of different script lines from Multi-script documents”, Image and Vision Computing, 2002, vol. 20, no.13-14, pp. 945-954. [77] U.Pal, B.B.Chaudhuri, “Indian script character recognition: a survey”, Pattern Recognition, 2004, vol. 37, no. 9, pp. 1887-1899. [78] U.Pal, “Automatic Script Identification: A Survey", J. Vivek, 2006, vol. 16, no.3, pp.26-35. [79] J. Pan, Y. Tang, “A rotation-robust script identification based on BEMD and LBP”, Proc. ICWAPR, Guilin, 10-13 July, 2011, pp. 165-170.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2017.2689159, IEEE Access

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [80] W.M.Pan, C.Y. Suen, T.D. Bui, “Script Identification Using Steerable Gabor Filters”, Proc. ICDAR, 2005, pp. 883-887. [81] P.B.Pati, A.G.Ramakrishnan, “HVS Inspired System for Script Identification in Indian Multi-Script Documents”, Proc. DAS, LNCS vol. 3872, 2006, pp. 380-389. [82] P.B. Pati, S.SabariRaju, N.Pati, A.G.Ramakrishnan,“Gabor filters for document analysis in Indian Bilingual Documents”, Proc. ICISIP, 2004, pp. 123-126. [83] P.B. Pati, A.G.Ramakrishnan, “Word level multi-script identification”, Pattern Recognition Letters, 2008, vol. 29, no. 9, pp. 1218-1229. [84] S.B. Patil, N.V.Subbareddy, “Neural network based system for script identification in Indian documents”, Sadhana,2002,vol.27,no.1,pp.83-97. [85] G.S. Peake, T.N.Tan, “Script and language identification from document images”, LNCS vol. 1352, 8th BMVC, 1997, pp. 230-233. [86] L. Peng, C. Liu, X. Ding, H. Wang, “Multilingual document recognition research and its application in China”, Proc. DIAL, 2006, pp. 126-132. [87] T.Q. Phan, P. Shivakumara, Z. Ding, S. Lu, and C. L. Tan, “Video script identification based on text lines”, Proc. ICDAR, 2011, pp. 1240-1244. [88] B.Philip, R.D.S. Samuel, “A Novel Bilingual OCR for Printed Malayalam-English Text Based on Gabor Features and Dominant Singular Values”, Proc. DIP, Bangkok, March 7-9, 2009, pp. 361- 365. [89] R.Plamondon, and G. Lorette, “Automatic Signature Verification and Writer Identification – the State of the Art”, Pattern Recognition, Vol. 22, No. 2, 1989, pp. 107–131. [90] K.A. Prakash, G.Rajesh, U.A. Dinesh, M.Krisnamoorthi, N.V.Subbareddy,“Text Line Script Identification for a Trilingual Document”, Proc. ICCCNT, 2010, pp.1-3. [91] R.Rani, R.Dhir, G.S.Lehal,“Script Identification of Pre-Segmented Multi-Font Characters and Digits”, Proc. ICDAR, Washington DC, Aug. 25-28, 2013, pp. 1150–1154. [92] R.Rani, R.Dhir, G.S.Lehal, “Performance analysis of feature extractors and classifiers for script recognition of English and Gurmukhi words”, Proc. DAR, 2012, pp. 30-36. [93] M.I.Razzak,S.A.Hussain,M.Sher,“Numeral recognition for Urdu script in unconstrained environment”,Proc.ICET, 2009, pp.44-47. [94] H.Rezaee, M.Geravanchizadeh, F.Razzazi, “Automatic Language Identification of Bilingual English and Farsi Scripts”, Proc. AICT, Baku, Oct. 14-16, 2009, pp. 1- 4. [95] K.Roy, A.Banerjee, U.Pal, “A system for word-wise handwritten script identification for Indian postal automation”, Proc. INDICON, 2004, pp. 266 - 271. [96] K.Roy, U.Pal, B.B.Chaudhuri, “Neural Network based Word-wise Handwritten Script Identification for Indian Postal Automation”, Proc. ICISIP, Jan. 4-7, 2005, pp. 240-245. [97] K.Roy, K.Majumder, “Trilingual Script Separation of Handwritten Postal Document”, Proc. ICVGIP, Dec. 16-19, 2008, pp. 693-700. [98] K.Roy, S.K.Das, S.M.Obaidullah, “Script Identification from Handwritten Document”, Proc. NCVPRIPG, Dec.15-17, 2011, pp.66–69. [99] K.Roy,A.Alaei,U.Pal,“Wordwise Handwritten Persian and Roman Script Identification”, Proc. ICFHR, Kolkata, Nov. 16-18. 2010, pp. 628-633. [100] A. Saidani, A. Kacem, A. Belaid, “Co-occurrence Matrix of Oriented Gradients for Word Script and Nature Identification”, Proc. 13th ICDAR, 2015, pp.16-20. [101] A.Saidani, A.K.Echi, A. Belaid, “Identification of Machine-printed and Handwritten Words in Arabic and Latin Scripts”, Proc. ICDAR, Washington DC, Aug. 25-28, 2013, pp. 798 – 802. [102] K.R.Sanjeev, S.R.D.Sudhaker, “On Separation of Kannada and English Words from a Bilingual Document Employing Gabor Features and Radial Basis Function Neural Network”, Proc. ICCR, 2005, pp.640- 644. [103] R.Sarkar,N.Das, S.Basu, M.Kundu, M. Nasipuri, D.K.Basu, “Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script”, Journal of Computing, Vol. 2, No. 2, 2010, pp.103-108. [104] N.Sharma, S.Chanda, U.Pal, M. Blumenstein, “Word-wise Script Identification from Video Frames”, Proc. ICDAR, Washington DC, Aug. 25-28, 2013, pp. 867 – 871. [105] N. Sharma, R. Mandal, R. Sharma, P. P. Roy, U. Pal, M. Blumenstein, “Multi-Lingual Text Recognition from Video Frames”, Proc. 13th ICDAR, 2015, pp.951-955. [106] N. Sharma, R. Mandal, R. Sharma, U. Pal, M. Blumenstein, “ICDAR2015 Competition on Video Script Identification (CVSI 2015)”, Proc. 13th ICDAR, 2015, pp. 1196 - 1200.

12

[107] N. Sharma, U. Pal, M. Blumenstein, “A Study on Word-Level Multiscript Identification from Video Frames”, Proc. IJCNN, Beijing, July 611, 2014, pp.1827-1833. [108] N. Sharma, R. Mandal, R. Sharma, U. Pal, M. Blumenstein, “Bag-ofVisual Words for Word-Wise Video Script Identification: A Study”, Proc. IJCNN, Killarney, Ireland, July 12-17, 2015, pp. 1-7. [109] A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust reading competition challenge 2: Reading text in scene images”, Proc. of ICDAR, 2011, pp. 1485-1489. [110] B. Shi, C. Yao, C. Zhang, X. Guo, F. Huang, X. Bai, “Automatic script identification in the wild”, Proc. 13th ICDAR, 2015, pp. 531 – 535. [111] B. Shi, X. Bai, C. Yao, “Script identification in the wild via discriminative convolutional neural network”, Pattern Recognition, Vol. 52, 2016, pp. 448–458. [112] P. Shivakumara, N. Sharma, U. Pal, M. Blumenstein, “GradientAngular-Features for Word-wise Video Script Identification”, in Proc. ICPR, Stockholm, Aug. 24-28, 2014, pp. 3098-3103. [113] P. Shivakumara, Z. Yuan , D. Zhao , Tong Lu , C. L. Tan , “New Gradient-Spatial-Structural Features for video script identification”, Computer Vision and Image Understanding, Dec.130 ,2015, pp.35–53. [114] P. K. Singh, R. Sarkar, M. Nasipuri, “Offline Script Identification from multilingual Indic-script documents: A state-of-the-art”, Computer Science Review, Aug.15–16 , 2015, pp.1–28. [115] P. K. Singh, R. Sarkar, M. Nasipuri, D. Doermann, “Word-level Script Identification for Handwritten Indic scripts”, Proc. 13th ICDAR, 2015, pp.1106-1110. [116] V.Singhal, N.Navin, D.Ghosh, “Script-based classification of handwritten text documents in a multilingual environment”, Proc. IWRIDEMIM, March 10-11, 2003, pp. 47- 54. [117] F. Slimane, R. Ingold, S. Kanoun, M. A. Alimi, J. Hennebert, “A New Arabic Printed Text Image Database and Evaluation Protocols”, Proc. ICDAR, 2009, pp. 946-950, 2009. [118] A.L.Spitz, “Script and Language Determination from Document Images”, Proc. DAIR, Las Vegas, April, 1994, pp. 229-235. [119] A.L.Spitz, “Determination of the Script and Language Content of Document Images”, IEEE T-PAMI, Vol. 19, No. 3, 1997, 235-245. [120] G.X.Tan, C.Viard-Gaudin, A.C.Kot, “Information Retrieval Model for Online Handwritten Script Identification”Proc.ICDAR,2009,pp.336-340. [121] T.N. Tan, “Rotation invariant texture features and their use in automatic script identification”, IEEE T-PAMI, 1998, Vol. 20, No. 3, pp. 751-756. [122] Y. Tho, Y. Y. Tang, “Discrimination of Oriental and Euramerican Script Using Fractal Feature”, Proc. ICDAR, Seattle, 2001, pp.1115-1119. [123] Q.P. Trung, P.Shivakumara, Zhang Ding,Shijian Lu, C.L. Tan, “Video Script Identification based on Text Lines”, Proc. ICDAR, Beijing, Sept. 18-21, 2011, pp. 1240-1244. [124] A. Ul-Hasan, M. Z. Afzal, F. Shafait, M. Liwicki, T. M. Breuel, “A Sequence Learning Approach for Multiple Script Identification”, Proc. 13th ICDAR, 2015, pp.1046-1050. [125] T.N.Vikram, D.S. Guru, “Appearance Based Models in Document Script Identification”, Proc. ICDAR, Parana, 23-26 Sept. 2007, pp. 709-713. [126] B.Waked, S.Bergler, C.Y.Suen, Khoury, “Skew detection, page segmentation, and script classification of printed document images”, Proc. ICSMC, San Diego, Oct. 11-14, 1998, pp.4470- 4475. [127] N. Wang, L. Lam, C.Y. Suen, “Noise Tolerant Script Identification of Printed Oriental and English Documents Using a Downgraded Pixel Density Feature”, Proc. ICPR, Istanbul,Aug.23-26, 2010, pp. 2037-2040. [128] K. Wang and S. Belongie, “Word spotting in the wild”, Proc. ECCV, 2010, pp. 1-14. [129] S.L.Wood, Y. Xiaozhong, K.Krishnamurthi, L.Dang, “Language Identification for Printed Text Independent of Segmentation”, Proc. ICIP, Washington, Oct. 23-26, 1995, vol. 3, pp. 428-431. [130] S. Wshah, G. Kumar, V. Govindaraju, “Statistical script independent word spotting in off line handwritten documents”, Pattern Recognition, Vol. 47, 2014, pp.1039–1050. [131] L. Zeng, Y-Y. Tang, T-H. Cheni, “Multi-Scale Wavelet Texture-Based Scrip Identification Method”, Chinese J. of Computers, 2000, Vol. 23, No. 7, pp. 699-704. [132] D.Zhao, P.Shivakumara, S.J.Lu, C.L.Tan, “New Spatial-GradientFeatures for Video Script Identification”, Proc. DAS, 2012, pp. 38 – 42. [133] L.Zhou, Y.Lu, C. L.Tan, “Bangla/English Script Identification Based on Analysis of Connected Component Profiles”, LNCS, vol. 3872, 2006, pp. 243-254. [134] L.Zhou, X.J.Ping, E.G.Zheng, L. Guo,“Script identification based on wavelet energy histogram moment features”, Proc. ICSP, Beijing, Oct. 24-28,2010, pp. 980 – 983.

2169-3536 (c) 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Suggest Documents