Accepted manuscript to appear in IJPRAI
Accepted Manuscript
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
International Journal of Pattern Recognition and Artificial Intelligence
Article Title:
Handwritten Indic script identification in multi-script document images: A survey
Author(s):
Sk Md Obaidullah, K.C. Santosh, Nibaran Das, Chayan Halder, Kaushik Roy
DOI:
10.1142/S0218001418560128
Received:
25 May 2017
Accepted:
13 March 2018
To be cited as:
Sk Md Obaidullah et al., Handwritten Indic script identification in multi-script document images: A survey, International Journal of Pattern Recognition and Artificial Intelligence, doi: 10.1142/S0218001418560128
Link to final version:
https://doi.org/10.1142/S0218001418560128
This is an unedited version of the accepted manuscript scheduled for publication. It has been uploaded in advance for the benefit of our customers. The manuscript will be copyedited, typeset and proofread before it is released in the final form. As a result, the published copy may differ from the unedited version. Readers should obtain the final version from the above link when it is published. The authors are responsible for the content of this Accepted Article.
11,
2017
19:52
WSPC/INSTRUCTION
FILE
CR IP
Click here to download Manuscript (pdf) kc-survey-scriptR2.pdf
December scriptR2
kc-survey-
DM AN US
International Journal of Pattern Recognition and Artificial Intelligence c World Scientific Publishing Company
Handwritten Indic script identification in multi-script document images: A survey
Sk Md Obaidullah
Dept. of Computer Science & Engineering Aliah University Kolkata, West Bengal, India
[email protected] K.C. Santosh
Dept. of Computer Science The University of South Dakota, SD, USA
[email protected] Nibaran Das
Dept. of Computer Science & Engineering Jadavpur University, Kolkata, India
[email protected] Chayan Halder, Kaushik Roy
Dept. of Computer Science West Bengal State University, Kolkata, India {chayan.halder, kaushik.mrg}@gmail.com
TE
Script identification is crucial for automating optical character recognition (OCR) in multi-script documents since OCRs are script dependent. In this paper, we present a comprehensive survey of the techniques developed for handwritten Indic script identification. Different pre-processing and feature extraction techniques, including classifiers used for script identification are categorized and discussed their merits and demerits. We also provide information about some handwritten Indic script datasets. Finally, we highlight the extensions and/or future scope of works, together with challenges.
EP
Keywords: Multi-script document, Online and Offline handwritten documents, Handwritten script identification, Optical character recognition.
1. Introduction
Paperless world would become a reality if we are able to digitize a large volume of physical documents. Researchers have been working towards this goal by developing several techniques to automate scanned documents. Digitized documents have several advantages like indexing and sorting of a large volume of data for further processing such as document retrieval. Better storage and indexing are other very important issues since hard copies can be damaged over time. Increasing efforts and
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
T
Accepted manuscript to appear in IJPRAI
Manuscript (pdf)
1
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
DM AN US
kc-survey-
Fig. 1. Two multi-script postal document images: Bangla and Roman scripts (left), and Roman and Devanagari scripts (right) 51
EP
TE
suitable discussions on image to alphanumeric text conversion system are found in the literature 68,29 . The history of optical character recognition (OCR) dates back to 1870 when Carey 29 invented the retina scanner system, which used photocell based image transmission. After the digital computer was invented, the necessity of OCR for document processing was felt in the late 1960s. As per record, IBM developed the first commercialized OCR to read the special font of various IBM machines 29 . Since then we are able to have smart systems that can handle complex and heterogeneous documents (having text, graphics, mathematical symbols, in addition to degradation and noise). Ready-to-use commercial systems are in demand for several different applications such as sorting automatic postal documents. However, OCR for multi-script documents (like in India, where 11 scripts and 22 languages are used officially) require a-priori knowledge of the particular script, since OCRs are script dependent. In our daily lives, we come across several different multi-script documents such as postal documents, pre-printed application forms and railway reservation forms. Fig. 1 shows two real-world multi-script postal document images: (i) Bangla and Roman script and (ii) Roman and Devanagari scripts are used to write the address block. Similarly, Fig. 2 shows examples of multi-script document images. In both cases, we repeat that script identification is required so that we are able to select specific OCR. The problem becomes more complex when handwritten documents are considered. This is because that writing styles vary a lot from time to time and one person to another, unlike the printed texts. Therefore, handwritten script identification is still an open challenge. In this paper, we present a survey on handwritten Indic script identification techniques reported till 2016 (plus a few major works in 2017). From this survey, the following key issues are addressed: (1) We report script identification techniques till 2016 (plus a few major works in 2017).
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
2
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
DM AN US
Fig. 2. Multi-script document images: Devanagari and Roman (left) and Bangla and Roman (right)56
(2) In addition to conventional script identification task, we discuss recent trends in this field: online script identification and video script identification. (3) We present datasets availability along with their sources. (4) We provide future directions together with challenges.
EP
TE
For a quick understanding, a block diagram of multi-script document processing system is shown in Fig. 3, where at first, multi-script document images are pre-processed (such as noise removal, foreground-background separation, skew detection and correction and segmentation). Script identification is then performed at different levels: page, block, line, word and character, depending on the need. This will then make OCRs efficient, since they are script dependent. The rest of the paper is organized as follows. A brief idea about Indic languages and scripts is mentioned in section 2. In section 3, we discuss about state-of-the-art techniques for handwritten Indic script identification including glimpses of online, video script identification techniques. In section 4, we discuss about different feature types, classifiers, evaluation protocol. State-of-the-arts datasets on Indic scripts are presented in section 5. Challenges and future scopes are mentioned in section 6. Finally, we conclude our paper in section 7. 2. Scripts and languages of India A script can be described as a set of graphemes, which are used to write a single language or a class of languages. Note that, a language could refer to a particular script and is very common. Therefore, one should not be confused with the phrase: script identification and language identification, since they have same meaning. Examples of such scripts are Oriya, Tamil, Telugu, Urdu, Gujarati and Kannada. On the other hand, Devanagari, Bangla and Roman scripts are used by more than one lan-
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
3
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
DM AN US
kc-survey-
Fig. 3. Block diagram of a multi-script document processing system showing different modules
EP
TE
guage. For example, Devanagari script is used by the languages like Bodo, Konkani, Marathi, Maithili, Nepali, Sanskrit, Sindhi and Hindi, Roman script is used by English and Santali languages, and Bangla script is used to write Bangla, Assamese and Manipuri languages. So for such cases it is preferable to use the term script identification instead of language identification. Fig. 4 shows a map where different state names are written with different state specific scripts 41 . Detail information about official Indic languages and scripts can be found in literature 38,69,1,14 . Indic script follows alphabetic writing system, which are further classified into three types: (i) Abjad, (ii) Abugida and (iii) True alphabetic system. Abjad writing system is one of the very old system where one symbol per consonant is available and no demarcation of vowels. Few of the Abjads, for example Hebrew and Arabic, have presence of markings for vowels also. However, these are used in special contexts, such as for teaching. There are many scripts which are actually derived from Abjads, have been extended with vowel symbols making a full alphabetic writing system 1 . An example of such Abjad writing system is Urdu, which is a popular script in many South Asian countries, is also used in many places of India. On the other hand, unlike Abjad, in Abugida, bothe vowels and consonants are present. Brahmic family of scripts aret he largest single group of Abugida. This script family is further classified into three major categories namely (i) Gupta, (ii) Kadamba and (iii) Grantha. The official scripts which are used at present in India are all descendants of the Brahmic family. These scripts are used mostly by the languages of South Asia and mainland Southeast Asia (only exceptions are Malaysia and Vietnam). The primary division is into Southern Indic scripts which comes under the Gupta family. These scripts are used in South India, Sri Lanka and Southeast Asia and North Indic scripts, which are under Kadamba and Grantha families, Northern India, Nepal, Tibet and Bhutan. We found that, the characters of South Indic are very rounded in shape, North Indic less so, with an exception of Oriya script whose
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
4
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
DM AN US
Fig. 4. A map showing different scripts for different states
41
TE
characters are mostly rounded in shape. Few of the North Indic scripts like Bangla, Devanagari and Gurumukhi contain ‘matra’ or ‘shirorekha’, which is a horizontal line at the top to characters or words (fully or partially). On contrary, the South Indic scripts do not contain any ‘matra’ or ‘shirorekha’ 69,14 . Latin or Roman script is one of the most popular script/language in India, follows the True Alphabetic writing system. The following 5 presents about the writing system in India with example of each script 35 . 3. Script identification techniques
EP
In general, script identification techniques are divided in accordance with the raw data/image acquisition into two main categories: offline and online. In offline, inputs are provided in the form of images, whereas in online, inputs are considered to be ordered sequence of points (i.e., real-time). In case of online system, additional information regarding the stroke direction and pressure can also be captured. We further divide all the offline/online script identification techniques into five major categories based on the segmentation scheme followed prior to the feature extraction. These are: (i) Page level script identification, (ii) Block level script identification, (iii) Line level script identification, (iv) Word level script identification, and (v) Character level script identification. Among these, character level script identification is not very common till date, as scripts are normally not differentiable at the character level, which is mainly identified with any one of word, line, block
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
5
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
kc-survey-
TE
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
DM AN US
6
EP
Fig. 5. Writing systems of Indic scripts
35
or page level. An artistic document image (one word contains multiple characters from different scripts) can be challenging (see Section 6), and to the best of our knowledge, very few work has been reported till the date 48 . In recent years, a few survey articles on script identification techniques were reported. 62,72 . But these review works are not complete, and they are not focusing on the handwritten scripts. Observing the inherent challenges and complexities of handwritten script identification techniques, in this paper, we propose a comprehensive survey on handwritten Indic script identification techniques.
AC C
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
7
3.1. Offline script identification techniques
EP
TE
DM AN US
The page level approach ensures fast feature computation as it is completely segmentation free. The whole document page is considered as an input, followed by further processing. In Hochberg et al. (1999) 18 , connected component analysis for identifying six scripts: Arabic, Chinese, Cyrillic, Devanagari, Japanese and Latin was reported. Small components were filtered based on pixel count in the bounding box. Similarly, they have also identified long and thin components. Then the mean and standard deviation of the component’s bounding box height and width were measured. In the second pass of the filtering, large components were removed. Finally, using connected components and visual observations, a feature set was generated (relative centroid, number of white holes, sphericity and aspect ratio). Finally, the linear discriminant function (LDF) classifier was used for script identification. They reported that LDF performs better than neural network-based classifier. Zhu et al. (2009) 74 proposed a scheme based on shape codebook for identifying eight scripts: Arabic, Chinese, Roman, Hindi, Japanese, Korean, Russian and Thai. This was a mixed type of work consisting of two Indic and six non Indic scripts. In their work, a shape codebook was constructed based on geometrically invariant feature types and structurally indexed them. Unlike traditional concept, rather than selecting class specific features, they have tried to distinguish differences between texts collectively using the statistics of a large variety of generic and geometrically invariant feature types. After constructing the codebook, contour features were extracted by using a two step procedure. Firstly, edges using the Canny edge detector (Canny 1986) 7 was computed, which gives precise localization and unique response to text content. Secondly, contour segments were grouped by connected components. Then they are fitted locally into line segments using an algorithm that broke a line segment into two, based on a threshold. Considering each connected component, every triplet of connected line segments that starts from the current segment was then extracted. Finally, the dissimilarity measure has been done between them, where an overall dissimilarity between two contour features was quantified by the weighted sum of the distances in length and orientation. Using multi-class SVM classifier, an average classification rate of 95.6% was achieved. Features like fractal dimension, circularity, chain code and small component have proven to be capable enough to identify six Indic scripts: Bangla, Devanagari, Roman, Oriya, Urdu and Malayalam in a recent work reported by Obaidullah et al. (2013) 38 . The fractal dimension is effective to distinguish between ‘matra’ based scripts from their counterpart. If the average fractal dimension of top and bottom profile is computed of these two types (‘matra’ and without ‘matra’), then there will be a significant difference in average pixel density. Circularity feature was well suited to distinguish scripts like Oriya, Malayalam from others. Similarly, using small component analysis scripts like Urdu can be easily distinguished from others. Using
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
3.1.1. Page level script identification
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
kc-survey-
EP
TE
DM AN US
MLP classifier an average accuracy rate of 92.8% was reported, although in Bangla and Devanagari script the rate has dropped to 8.4% and 9.4% respectively from the average rate. It can be observed that for both the scripts the misclassification rate was 9.4% (Bangla as Devanagari) and 8.3% (Devanagari as Bangla). In 8.3% cases the Devanagari script was misclassified as Malayalam because of the structural similarity, which is an open problem. Since the work used small dataset, evaluation may not be convincing. Obaidullah et al. (2015) 37 proposed a page level technique to identify Bangla, Devanagari, Roman and Urdu scripts using convolution-based features, namely, gabor filter bank and directional morphological filter. Gabor filter is a very popular texture computation tool which was used with varying frequencies and orientations to compute feature values. To generate the filter bank, these values and parameters were experimentally chosen. In addition, directional morphological filter was used. Observing the presence of different directional strokes in Indic scripts, four different morphological filters: horizontal, vertical, left diagonal and right diagonal filters were built. Important morphological operations: dilation and erosion were then carried out to extract the prominent directional strokes from scripts. Feature values were computed measuring the ratios of original images with the dilated and eroded images. The reported average accuracies for bi-script and tri-script tests were 94.4%, 97.5% and 98.2%, respectively. Different frequency domain techniques: discrete cosine transform (DCT), distance transform (DT), the Radon transform (RT) and fast Fourier transform (FFT) were applied to convert the page level images to frequency domain and some statistical feature values were then computed to identify four eastern Indian scripts: Bangla, Roman, Devanagari and Oriya by Obaidullah et al. (2015) 36 . As per reported accuracies of these techniques the average accuracies from bi-script and tri-script tests were 88.1%, 94.3% and 89.7%, respectively. In this work, the performances of each of the frequency domain techniques have not measured, rather all of them were combined to compute the final feature vector. Here, when we compare the previous two works of the same author in terms of feature, we found component level features perform pretty well in distinguishing different Indic scripts in comparison to the frequency domain approach. Another advantage of component level features is their fastness in comparison with frequency domain techniques which are relatively slow when applied to the whole images. Singh et al. (2015) 61 proposed a texture based approach using modified logGabor filter to distinguish eight different scripts: Bangla, Devanagari, Gurumukhi, Oriya, Tamil, Telugu, Urdu and Roman. During feature computation of five scales (ns = 1, 2, 3, 4, 5) and 6 orientations (no = 00, 300, 600, 900, 1200, 1500) were considered. Then each filter was convolved with the input image to generate 30 different response matrices from an image. These response matrices were then converted into final feature vector generating a 30 × 240 dimensional feature vector, where the total number of document pages were 240. Different classifiers: Naive Bayes, simple
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
8
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
DM AN US
logistic (SL), MLP, SVM, random forest, bagging and multi-class classifiers were used. Obaidullah et al. (2017) 35 proposed a page level script identification from eleven official Indic scripts. They built a dataset of 1458 handwritten pages from all the 11 official Indic scripts and baseline results were reported. Structural and visual appearance-based and directional stroke-based features were used. For classification, two well-known classifiers: MLP and SL were used. The author also reported the performance of voting using these two classifiers (i.e. MLP and SL). The average identification rates from all-script, tri-script and bi-script were 98.60%, 99.37% and 99.66%, respectively. 3.1.2. Block level script identification
EP
TE
In the block level script identification techniques, a predefined size of blocks are extracted from the document images. This size can vary from 64 × 64, 128 × 128 to 512 × 512. In some cases, these extracted blocks of sub-images requires padding of white pixels since some characters (in any particular block) are attached to the boundary of the blocks. Features are then computed using blocks of sub-images. Kanoun et al. (2002) 20 proposed a hybrid scheme for script identification from Arabic and Latin document images. Their work was designed for both printed and handwritten documents. In this approach, they have collected morphologybased features from text blocks that were globally applied. They have also collected some local features that were based on geometrical analysis at line level and connected component level. During morphological analysis, they have extracted connected components of text block and localized a reference line for each text line. Morphological features such as diacritic dots, occlusions, and alif character were extracted. They have considered other connected components as traces. Features such as diacritic dots and alif character were mostly relying on heuristic threshold and/or empirically designed. For each text line, they have detected diacritic dots and occlusion position (bottom or up) by coordinates comparison between the last components and a reference line. During geometrical analysis, measurement of the physical structure and textual entity was carried out. They have obtained features like pixel density, eccentricity, spheroid on text lines and connected components. Finally, using KNN classifier, they have obtained classification rates of 88% for Arabic handwritten text and 98% for Latin handwritten text. In Singhal et al. (2003) 66 , authors used rotation invariant texture feature using multi-channel Gabor filtering and gray level co-occurrence matrix for feature extraction purpose. In this way, variations in writing style, character size, interline and inter-word spacing problems can be tackled. At pre-processing stage, they have performed denoising, thinning and pruning using basic morphological operations. After that, connectivity and linking process were carried out for adjustment of the broken components. The text size normalization process was done through adjustment of the text height, inter-word spacing and left-right justification. Features were then
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
9
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
kc-survey-
EP
TE
DM AN US
extracted using multi-channel Gabor filtering and gray level co-occurrence matrix. Finally, they have used a multi-prototype classifier, which was a combination of Kmeans clustering, fuzzy C-means clustering and probabilistic clustering methods. They reported script identification rates of 90% for Devanagari, 86.6% for Bangla, 96.7% for Telugu and 93.3% for Latin scripts. On the whole, an average score of 91.64% was produced. Zhou et al. (2006) 73 proposed a line level script identification technique for Bangla and Roman scripts considering connected component analysis. The same technique was also applied in word and character level. At first, connected component labelling was done. They have then selected meaningful connected component based on calculated pixel area (empirically designed) so that a very small regions can be eliminated from further processing. They have also extracted profiles: topmost and bottom-most. Finally, considering a dataset of size 1200 images, they reported script identification rate of 95%. In Hangarge et al. (2010) 16 , authors reported feature extraction from text blocks of size 128 × 128 images based on 13 global spatial features. They addressed the use of visual cues. Visual observation was an important tool for identifying several features from document images. Considering Devanagari, Roman and Urdu scripts, they have first extracted features based on observations like ‘matra’ or ‘Shirorekha’ that can separate Devanagari from Urdu and Roman. Similarly, presence of vertical strokes is more than horizontal strokes in Roman script, as compared to the other two scripts. Urdu scripts have a strong baseline as well as right diagonal strokes. It also has less number of holes compared to the other two scripts. Different stroke density based features like vertical stroke density, horizontal stroke density, etc. were also considered in their work. Further, morphological features such as horizontal openings, bottom hat, and top hat transformation were used. Finally, using KNN classifier, they reported script identification rate of 99.2% for bi-script document and 88.6% for tri-script documents. Rajput et al. (2010) 46 proposed a scheme for script identification by using DCT and DWT-based features. They considered eight Indic scripts: Roman, Devanagari, Kannada, Tamil, Bangla, Telugu, Punjabi and Malayalam. Firstly, input images were manually decomposed into 512 × 512 size blocks. Then feature vectors were computed using DCT and DWT. For classification, they have considered Roman, Hindi and one regional language to make tri-script scenario. Using KNN classifier, they reported an average tri-script classification rate of 96.4%. Basu et al. (2010) 5 proposed a novel framework considering four scripts: Latin, Devanagari, Bangla and Urdu, and considered problem entitled postal code digit identification from an address block in multi-script documents. Firstly, they localized the postal address block from the entire postal document using the Hough transform-based concept. The isolated handwritten digit patterns were then extracted. They grouped four scripts into 25 clusters based on their appearance (pattern similarity). A script independent unified pattern classifier was used to classify the numeric postal codes into one of these 25 clusters. For classification, a rule
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
10
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
DM AN US
based script identification engine was designed. Authors used a quad-tree based image partitioning scheme for feature extraction. Quad-tree based scheme has its advantage of extracting local information from each of the zones regardless of the computational cost. The reported average script identification accuracy from a 10fold cross validation was 92.03% (using SVM). In Obaidullah et al. (2015) 33 , authors reported a block level script identification technique from six scripts: Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu. The approach used 512 × 512 size image blocks (fully automated), which is followed by a two stage binarization technique. A feature vector of size 34 was constructed by using different frequency domain techniques: RT, DCT, FFT and DT. A greedy attribute selection technique was applied to reduce the final feature vector of size 20. They used a dataset of size 600 image blocks, and the whole dataset was divided into 2:1 ratio i.e., training and testing (hold-out evaluation protocol). The reported average script identification rates were bi-script and triscript accuracies of this work were 84%, 95.33% and 88.89%, respectively, using logistic model tree classifier. 3.1.3. Line level script identification
EP
TE
Multi-script documents at line level is very common so line segmentation is done first, then features are extracted from those lines. In what follows, we explain the line level script identification techniques in detail. Moussa et al. (2008) 31 used fractal based feature for script identification from Arabic and Latin scripts. Their scheme will work for both handwritten and printed document images. In this scheme, firstly they have made morphological transformation of line text images. They have then computed fractal analysis features for both original texture of 2-D images, including profiles. Finally, they have obtained 12 features based on multidimensional fractal analysis. They have tested their system on 1000 prototypes with various typefaces, script’s styles and sizes. The script identification rates were 96.64% (with KNN) and 98.72% (with RBF). Rajput and Anita (2011) 47 proposed a scheme based on Gabor filter for identifying unknown script from a bi-script document. Eight Indic languages: Roman, Devanagari, Kannada, Tamil, Bangla, Telugu, Punjabi and Malayalam are considered. Firstly, they have created a Gabor filter bank by considering six different orientations and three different frequencies to obtain 18 filters. They have convolved the input image with the created Gabor filter bank. For each output image, they have extracted cosine and sine parts and compute the standard deviations, and integrated them together. With the use of KNN classifier, they have reported 100% script identification rate from bilingual scripts: Roman and other regional languages. Obaidullah et al. (2016) 40 proposed a line level script identification technique from eight Indic scripts: Bangla, Devanagari, Kannada, Malayalam, Oriya, Roman, Telugu and Urdu. Lines were automatically extracted and a dataset of 782 lines were
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
11
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
12
FILE
kc-survey-
DM AN US
prepared. The features used in this work are: structure based features like circularity, rectangularity, convex hull, fractal dimension, directional morphological filter, interpolation and Gabor filter. The effectiveness of fractal dimension and structural feature has already been reported Obaidullah et al. 2013 38 . Fractal dimension itself is a very light weight and efficient feature to distinguish scripts with ‘matra’ like Bangla and Devanagari from their counterpart scripts without ‘matra’ like Roman and Urdu. Other script dependent features are used in combined with fractal features. As per reported results, they have achieved script identification average and bi-script accuracies of 95.7% and 98.51%, respectively using MLP classifier. 3.1.4. Word level script identification
EP
TE
Word level script identification is a very common approach compared to other approaches available. It requires thorough segmentation into words. For readers, we refer to the a few works on line segmentation and word segmentation techniques 19,25,26,43,3 . In this section, we concentrate on script identification from the segmented words (either automatic or manual). Roy et al. (2004) 51 proposed a method for script identification from postal documents considering Bangla/Devanagari and English languages. In the work towards Indian postal automation development, at first skewed documents were corrected, and non-text parts were separated using run length smoothing algorithm. Using a piecewise projection method, the destination address block was decomposed into lines so that a set of words can be collected from each line. To extract feature, they have considered features to detect shirorekha, water reservoir, and simple morphological features (e.g., small connected components). The water reservoir principle is efficient in terms of performance but, suffers from a heavy computational cost associated with it. Using Tree classifier, they have reported script identification rate of 89%. In another work, Roy et al. (2005) 53 used fractal based, busy-zone based, water reservoir based, presence of small component, topological features to classify between Bangla and English language. Fractal and busy-zone based features are script useful in distinguishing ‘matra’ based scripts from their counterpart very easily. MLP was used for classification and achieved script indentification rate of 97.62%. In Roy et al. (2006) 52 , authors extended the work to identify Bangla and Oriya language and reported a successful classification rate of 97.69%. Similarly, in the year 2010 50 , authors proposed a script identification scheme considering Roman and Persian script. Using KNN classifier, they reported script identification rate of 99.20%. Benjelil et al. (2009) 6 reported a work considering Arabic and Roman script using steerable pyramid (SP) feature. The SP is based on linear multi scale and orientation image decomposition which provides a handy front-end for image-processing and computer vision applications. SP has the ability of capturing the texture variation in both orientation and intensity. The image was separated into low and high pass sub-bands, using low pass and high pass filters. The low pass sub-band was
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
EP
TE
DM AN US
then divided into a set of oriented band pass sub-bands and a lower pass sub-bands. This lower pass sub-band was sub-sampled by a factor of 2 in both axes. The recursive pyramid construction was performed by insertion of a copy of the diagram’s shaded portion at the solid circle’s location. The steerable pyramid’s basic functionality were directional derivative operators that come in multifarious sizes and orientations. The reported script identification rates were 97% for Arabic and 96% for Roman. Sarkar et al. (2010) 55 proposed a word level script identification technique from Bangla and Devanagari handwritten texts mixed with roman script. In this work, they extracted the text lines and words from document pages using a script independent neighboring component analysis technique (Khandelwal et al. 2009) 21 . Horizontalness, segmentation related, foreground-background transition related features are considered. Horizontalness property is directly related to ’matra’ or ‘shirorekha’, which is present in Bangla and Devanagari but not in Roman. The feature was extracted by calculating the row wise sum of continuous black pixels. In segmentation, they have considered a number of extra pixels and number of segmentation point pixels. In the foreground-background transition feature they have observed that the horizontal pixel density varies in different regions. Considering this, they have taken changeover of foreground and background pixels as a feature for classifying ‘matra’-based scripts from their counterpart. The average script identification rates are: 99.29% for Bangla-Roman combination and 98.43% for Devanagari-Roman combination. Singh et al. (2013) 65 reported a technique, which automatically identifies the script of handwritten words from a document page, written in Devanagari script mixed with Roman script. In their work, 39 distinctive features were extracted and classification was done using MLP with 3-fold cross validation. An average script identification rate of 99.54% was reported. An application of automatically classification of content type in torn documents is proposed by Chanda et al. (2011) 8 , which is based on script of the handwritten text. For classification, they have used rotation invariant Zernike moment feature with SVM classifier. In addition, gradient features are also computed for a comparative analysis between rotation dependent and independent aspects. They have reported an average eleven script identification rate of 81.39% for isolated component and 94.65% at word level. Dhandra and Hangarge (2007) 11 reported a word level script identification technique from three Indic languages: Kannada, Roman and Devanagari text words and numerals. They have carried out (i) script identification of text words using morphological filters and regional descriptors based features of these three major Indic languages, and (ii) Kannada and Roman handwritten numeral script identification. Stroke density and pixel density, aspect ratio, eccentricity and extent were used. Overall results were 96.05% for word script identification and 99% for numeral script identification. In a recent work, Hangarge et al. (2013) 15 proposed a word level script identification technique considering Roman, Devanagari, Kan-
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
13
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
kc-survey-
EP
TE
DM AN US
nada, Telugu, Tamil and Malayalam scripts. Their captured diagonal edge based shape information using 1D and 2D DCT, which they reported as directional DCT based features. Firstly, the input word image matrix was considered and normalized into a square matrix by padding zeros. 1D and 2D DCT were then computed for each of the N − 2 upper and lower diagonals (assuming the matrix size is an N × N ) and their standard deviations were computed to reduce feature vector size. Conventional DCT values were also computed by dividing the whole word image into four zones. Altogether, a feature vector of size 10 comprising of six features from the directional DCT and four features from conventional DCT. In their test, bi-script and tri-script identification rates of 96.95% and 96.42%, respectively were reported. Their approach has novelty, as it was capable to capture different inner directional features that are very dominant in South Indian and Devanagari script. Pardeshi et al. (2014) 44 reported a technique based on different image transform based method to identify 11 Indic scripts: Roman, Devanagari, Urdu, Kannada, Oriya, Gujarati, Bangla, Gurumukhi, Tamil, Telugu and Malayalam using SVM and KNN classifiers. The Radon transform, DWT, Statistical filters and DCT methods were used for feature extraction. Experiment was carried out on 28100 word images of these scripts and average bi-script and tri-script identification rate were 98% and 96% respectively. This work has limitation in terms of execution time in comparison with other local or script dependent features that are computationally cheaper. Singh et al. (2015) 63 reported a word level script identification technique from seven Indic scripts namely: Bangla, Devanagari, Gurumukhi, Malayalam, Oriya, Telugu and Roman. They used elliptical and polygon approximation based techniques. The authors had prepared a dataset of 7000 words from handwritten pages, and using 5-fold cross validation an average identification rate of 95.35% has been reported with MLP classifier. Obaidullah et al. (2015) 39 proposed a word level handwritten numeral script identification technique from four Indic scripts: Bangla, Devanagari, Roman and Urdu. The authors have used a feature vector size of 55, and is composed of Daubechies wavelet decomposition. The word images were decomposed into different coefficients at first level, and wavelet entropy was computed. The script identification rates were not competitive, since the features are not sufficient to distinguish similar digits from two or many classes. To address this issue, one can address as a 31 class problem by combining all digits excluding the common ones, instead of considering it as a four class problem, where each script was considered as a separate class. 3.1.5. Character level script identification Unlike other levels, character level script identification does not attract attention since, in general, multi-script document does not occur at the character level. Therefore, notable number of works are not available in literature. However, we come across different artistic documents, where a single word contains multiple script
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
14
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
DM AN US
characters (see Section 6). Considering such a challenge, Rani et al. (2013) 48 introduced a script identification technique on pre-segmented, multi-font characters and digits. Two scripts: English and Gurumukhi were used. Segmented characters and digits of volume 19448 were considered. They have computed 189 features from using Gabor filter with different directional frequencies and 200 features from using gradient based technique. With multi-class SVM as a classifier (four classes: Roman and Gurumukhi, comprising of character and digit), their average script identification rates were 98.9% and 99.45%, respectively, where 10-fold cross validation was used. The effectiveness of Gabor filter was addressed as an important texture feature to characterize different Indic scripts. Gradient information works well at character level compared to word, line, block or page level. 3.1.6. Summary: script identification at different levels We summarize our discussion as follows:
(1) We observed that, numbers of reported works can be ranked as follows: word level > block level > page level > line level > character level . (2) Considering number of scripts used, we found that for word level script identification all official Indic scripts were considered compared to other levels. (3) Regarding classifiers, MLP and KNN are preferable classifier irrespective of the level of work. 3.2. Online script identification techniques
EP
TE
Online script identification system handles real time handwritten data. The available works are discussed below. Namboodri and Jain (2004) 32 proposed a scheme for online script identification from six scripts: Arabic, Cyrillic, Devanagari, Han, Hebrew and Roman, where there exist three Indic scripts. They have used stroke, shirorekha-based features and a complex set of classifier consisting of KNN, Bayes quadratic classifier, Bayesian classifier with a mixture of Gaussian densities, decision tree, neural network and SVM for script identification. The reported script identification rate was 95% at word level and 95.5% at line level. Tan et al. (2009) 70 proposed a scheme based on information retrieval model to identify Arabic, Roman and Tamil scripts. Their work consists of three stages: (i) prototype building, (ii) document indexing and (iii) retrieval stage. They have used seven features out of eleven features inspired from Namboodri and Jain (2004) 32 at prototype building stage. KNN classifier was used for clustering the line segments extracted from the training set. They have then computed the script similarity based on the distribution of frequency vectors computed from each document. Using chisquare distance classifier, the reported average script identification rate was 93.3%. Schenk et al. (2009) 57 proposed a novel approach for script line identification, for script normalization and feature extraction in online handwritten whiteboard
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
15
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
16
kc-survey-
DM AN US
note recognition. They first used script lines to normalize skew and size of the text lines. The feature vector of a standard recognition system was then segmented by the explicit script line membership of each sample point, aiming at reducing the confusions between characters that are different in size rather than shape. Their experimental output shows a relative improvement of 3.3% in character level and 3.4% in word level accuracy compared to a baseline based system where the handwriting recognition has been performed considering the baseline of input words/lines 24 . 3.3. Video script identification
Beside online and offline script identification, dynamic (video) script identification is an important issue. Few works are reported in literature till date 28,45,58 , it refers to script identification from a moving scene like video frame. Dynamic script identification helps for automatic content-based video indexing and retrieval, which is very essential for handling huge multimedia data in the internet, for instance. Results of the ICDAR 2015 competition on Video Script Identification 60 shaped a different and challenging orientation in the domain. Benchmark results on word level video script identification were reported from 10 different official Indic scripts: Roman, Devanagari, Bangla, Oriya, Gujarati, Gurumukhi, Kannada, Tamil, Telugu and Arabic. 4. Features, classifiers and evaluation protocol
In this section, the main issues are discussed: features, classifiers and evaluation protocol.
TE
Features. Feature extraction and selection is one of the primary issues in any classification problem, since it largely affects the overall performance. The selected features should be able to discriminate classes and easy to compute (computationally less expensive).
EP
Classifiers. Like selecting the best features, choosing best classifiers is important. At the same time, one needs to understand the usefulness of their combination (feature-classifier). Such a feature-classifier combination cannot be generalized since it entirely depends on the test i.e., application dependent. In script identification problem, we have found that MLP has been widely used 18,33,36,37,38,39,40,53,50,52,55,65,63,59,28 , in comparison to others. After that, KNN 20,16,46,47,31,6,15,44,32 and SVM 74,5,48,44,59 are two other popular classifiers for script identification problem. In addition, LDA, simple logistic, multi prototype etc. have also been used18,66,61,15 . Besides, a few other classifiers such as euclidean distance based classifier by Spitz 67 to classify Chinese, Japanese and Korean printed scripts; statistical classifier by Lam et al. (1998) 23 to classify printed Latin, Oriental scripts; frequencybased classifier by Lam et al. (1998) 23 to classify Chinese, Japanese and Korean
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
17
DM AN US
EP
TE
Validation (evaluation protocol). Splitting train and test sets for experimenting plays a major role on script identification. (1) When we compare the cross validation based method with the residual based one, then the former seems better. The problem with residual evaluations is: no indication is available of how well the learner will do at the time of new predictions for the data which is unknown. This problem can be tackled if the entire data set is not used for training. Prior to beginning the training process some samples are removed and these samples can then be used for testing the learned model. (2) Another method is known as holdout method which is the simplest form of cross validation. At first the data set is divided into two sets: training and testing. Using the training set, only the function approximator fits a function. Then the function approximator is applied on the test set and asked to predict the output values for the data (it has never seen these output values before). The generated errors are accumulated as before to give the mean absolute test set error. This mean error value is then used to evaluate the model. The major advantage of this method is that it takes very less time to compute. However, this method suffers from very high variance during evaluation. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made. (3) Holdout method can be improved with the help of the K-fold cross validation method. In K-fold cross validation method, the data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the remaining k − 1 subsets are put together to form a training set. The average error across all k trials is then calculated. The advantage of this method is that every data point gets to be in a test set exactly once, and gets to be in a training set k − 1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that it suffers from heavy computational complexity, because the training algorithm has to be rerun from scratch k times, so it takes k times as much computation to make an evaluation. (4) Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N , the number of data points in the set. No methods have used this technique to validate the results.
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
scripts; hamming distance based classifier by Hochberg et al. (1997) 17 to classify printed Arabic, Armenian, Devanagari, Chinese, Cyrillic, Burmese, Ethiopic, Japanese, Hebrew, Greek, Korean, Latin and Thai scripts; feed forward NN by Elgammal and Ismail (2001) 13 to classify Arabic and Latin script.
Performance evaluation metrics. Some standard and relevant performance
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
18
FILE
kc-survey-
DM AN US
evaluation metrics that are used for evaluating script identification system are: Recognition rate, Precision, Recall and F-measure. A brief information of each of them is provided in the following section.
(1) Recognition rate: The Recognition rate can be expressed as, Recognition rate = TP × 100. Total (2) Precision, recall and f-measure: Precision can be defined as proportion of the test samples that are truly classified to a particular class among all those that TP , where TP and were classified to the particular class, Precision = (TP+FP) FP are true positives and false positives, respectively. Recall can be defined as TP Recall = (TP+FN) , where FN is false negative. F-measure is a calculated from Precision×Recall a combined value of precision and recall: f-measure = 2 × (Precision+Recall) . Apart from standard evaluation metrics, there are few other measuring parameters such as: Kappa statistics, mean absolute error and relative absolute error. (3) Kappa statistics: By this we measure the agreement of prediction with the true class. It can be defined as KS = (P(A)−P(E)) (1−P(E)) , where P(A) is the percentage agreement and P(E) is the chance agreement. Complete agreement is referred by the value KS = 1 and KS = 0 indicates chance agreement. (4) Mean and relative absolute error: Mean absolute error or in short MAE is measured as the average of the difference between output or predicted result and target or actual result in all the test cases. Relative absolute error or RAE is the absolute error made relative to what the error would have been if the prediction simply had been the average of the target values.
5. Indic script datasets (available)
EP
TE
For all research works, availability of standard dataset will provide an opportunity to make a fair comparison. Dataset on Bangla, Roman, Devanagari and Telugu script has been reported in literature 56,4,10 . Several versions are available at two different levels: word or character level, including numeral datasets. A huge Urdu dataset was developed in the year 2009 54 . This dataset contains isolated digits, numeral strings having/without having decimal points, five special symbols, 44 isolated characters and 57 Urdu words. IAM dataset is a very popular Roman script dataset, comprising of 1539 pages, 5685 sentences, 13353 lines and 115320 words distributed at document, line, sentence and word levels 30 . A Kannada script dataset consisting of 204 documents, 4298 lines, 26115 words distributed over document, line and word level was developed in the year 2011 2 . About 51 writers with varying age groups contributed on this dataset. In the same year, a Devanagari script dataset at character level, both for alphabets and numerals was developed 12 from more than 750 different writers. An offline sentence dataset of Urdu handwritten documents was reported in the year 2012 49 , where more than 200 native writers contributed. Qatar University Writer Identification (QUWI ) dataset 27 contain handwritten
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
DM AN US
sentences from Arabic and Roman scripts. This dataset was prepared in the year 2013 and it consists of a total of 4068 handwritten documents written by 1017 volunteers of varying ages, nationalities, genders and education levels. The CVL dataset is a writer information retrieval dataset, which was contributed by 309 writers and it consists a total of 2163 handwritten form pages 22 . Though the German script is not in our interest, researchers can take Roman part of this dataset for testing their algorithm. The Tamil-DB dataset was developed in the year 2013, which contain Tamil city names contributed by 500 writers 71 . In the year 2006 a huge Bangla numeral dataset was also reported 9 . This dataset consists of 8348 online numeral strings and 23392 offline isolated numerals. In Archana 2010 42 100000 words with the help of 600 different subjects were reported as a part of Tamil and Kannada handwriting recognition work. A page level dataset named PBOK, of four different scripts: Persian, Bangla, Oriya and Kannada was proposed in the year 2012 3 . This dataset is having a total of 707 text pages which contains 12565 text lines, 104541 words and 423980 characters. PBOK was build from contribution of 436 writers. Pixels information and content information based ground truth were reported for this dataset. In 2015, a word level numeral dataset of four popular Indic scripts namely Bangla, Devanagari, Roman and Urdu was reported 34 . In this dataset (5659 numeral strings), there are 1602 Bangla, 1139 Devanagari, 1602 Roman and 1316 Urdu images, and 43 different writers had contributed to build the dataset. This dataset is first of its kind having numeral word images from four popular Indic scripts. Very recently Singh et al. (2017) 64 presented a page level mixed type dataset which consists of a total of 300 document pages where 150 Bangla-Roman and 150 Devanagari-Roman pages are there. 6. Current challenges
TE
Till the year 2017, most of the script identification works were performed considering three popular scripts: Devanagari, Bangla and Roman. Moreover, availability of dataset is another issue to be taken care of. In this paper, handwritten script identifications were done at different levels: page, block, line, word, character, where segmentation remains the major challenge (except page level). Besides, in what follows, we address major directions where researchers may take them as other open challenges.
EP
(1) Standardized pre-processing module development. In any automatic script identification system, pre-processing is a crucial step. Standard steps are: normalization of skew, size of the documents, unwanted noise removal and binarization. (2) Segmentation free technique for identification of document type (single/multiscript). To design a fully automatic OCR, human intervention needs to be avoided. As said earlier, segmentation techniques are required to be precise enough. To this end, to a few extent, existing works did not have fully automated segmentation
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
19
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
kc-survey-
DM AN US
20
Fig. 6. Artistic documents: words showing multiple scripts at character level.
Fig. 7. Sample scene images (different scripts).
EP
TE
free approach. (3) Mobile/handheld devices based script and language identification. Mobile or other handheld devices have very less computational resources compared to high-end processor. Developing algorithms for such an application has serious issue in terms of computational recourse restrictions. Language or script identification from images captured using these devices is a very challenging task and an area of research interest in coming future. (4) Artistic document script identification. Fig. 6 shows sample artistic document images where several scripts (multiscript) occur at character level. Finding scripts from these images is a challenging task due to segmentation problem, complex background and uneven contrast information. (5) Script and language identification from scene images. Text detection from scene images is one of the current research interests. It has several applications such as tracking license plate from moving vehicles, developing automated vehicles, building software for a blind person for freely walking on the road, building biometric devices, and extracting GPS information from Google map (see Fig. 7).
7. Conclusion
We have presented a comprehensive survey on handwritten Indic script identification techniques that are available in the literature. Till date, several attempts have
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
References
DM AN US
been made to work on offline and online script identification. In general, previous works are divided into different categories: page, block, line, word and character level. We have also reported the state-of-the-art on Indic script datasets. Further, we have not observed that no generic concept is found for script identification problem (at least for two different scripts). Similarly, we have found that feature-classifier selection is application dependent. Besides, we have noticed that script identification in the online/portable device environment could be taken as one of the challenging applications in the domain. Another potential application is video/dynamic script identification that helps in automating video content archiving in the web. Interestingly, unlike other levels, character level script identification of artistic document images is a new research direction in this field.
EP
TE
1. Abugida, Abugida writing system url=http://en.wikipedia.org/wiki/Abugida, (2016), Online; accessed 01st August 2016. 2. A. Alaei, P. Nagabhushan and U. Pal, A benchmark kannada handwritten document dataset and its segmentation, in International Conference on Document Analysis and Recognition (2011) pp. 141–145. 3. A. Alaei, U. Pal and P. Nagabhushan, A new scheme for unconstrained handwritten text-line segmentation, Pattern Recognition 44 (2011) 917–928. 4. S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri and D. Basu, Text line extraction from multi-skewed handwritten documents, Pattern Recognition 40(6) (2007) 1825 – 1839. 5. S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri and D. K. Basu, A novel framework for automatic sorting of postal documents with multi-script address blocks, Pattern Recognition 43(10) (2010) 3507 – 3521. 6. M. Benjelil, S. Kanoun, R. Mullot and A. M. Alimi, Arabic and latin script identification in printed and handwritten types based on steerable pyramid features, in 10th International Conference on Document Analysis and Recognition (2009) pp. 591–595. 7. J. Canny, A computational approach to edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 8(6) (1986) 679–698. 8. S. Chanda, K. Franke and U. Pal, Identification of indic scripts on torn-documents, in 2011 International Conference on Document Analysis and Recognition (2011) pp. 713–717. 9. B. Chaudhuri, A complete handwritten numeral database of bangla – a major indic script, in G. Lorette (ed.), 10th International Workshop on Frontiers in Handwriting Recognition (Suvisoft, La Baule (France), October 2006) 10. N. Das, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri and D. K. Basu, A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application, Applied Soft Computing 12(5) (2012) 1592 – 1606. 11. B. V. Dhandra and M. Hangarge, Global and local features based handwritten text words and numerals script identification, in International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), Vol. 2 (2007) pp. 471–475. 12. V. J. Dongre and V. H. Mankar, Development of comprehensive devnagari numeral and character database for offline handwritten character recognition, CoRR
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
21
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
kc-survey-
EP
TE
DM AN US
abs/1309.5357 (2013). 13. A. M. Elgammal and M. A. Ismail, Techniques for language identification for hybrid arabic-english document images, in Proceedings of Sixth International Conference on Document Analysis and Recognition (2001) pp. 1100–1104. 14. D. Ghosh, T. Dube and A. Shivaprasad, Script recognitiona review, IEEE Transactions on pattern analysis and machine intelligence 32(12) (2010) 2142–2161. 15. M. Hangarge, K. C. Santosh and R. Pardeshi, Directional discrete cosine transform for handwritten script identification, in 2013 12th International Conference on Document Analysis and Recognition (2013) pp. 344–348. 16. M. Hangarge and B. Dhandra, Offline handwritten script identification in document images, International Journal of Computer Applications 4(6) (2010) 6–10. 17. J. Hochberg, L. Kerns, P. Kelly and T. Thomas, Automatic script identification from images using cluster-based templates, in Proceedings of 3rd International Conference on Document Analysis and Recognition, Vol. 1 (1995) pp. 378–381 vol.1. 18. J. Hochberg, K. Bowers, M. Cannon and P. Kelly, Script and language identification for handwritten document images, International Journal on Document Analysis and Recognition 2(2) (1999) 45–52. 19. C. Huang and S. N. Srihari, Word segmentation of off-line handwritten documents, in Electronic Imaging 2008 (2008) pp. 68150E–68150E. 20. S. Kanoun, A. Ennaji, Y. Lecourtier and A. M. Alimi, Script and nature differentiation for arabic and latin text images, in Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition (2002) pp. 309–313. 21. A. Khandelwal, P. Choudhury, R. Sarkar, S. Basu, M. Nasipuri and N. Das, Text Line Segmentation for Unconstrained Handwritten Document Images Using Neighborhood Connected Component Analysis in S. Chaudhury, S. Mitra, C. A. Murthy, P. S. Sastry and S. K. Pal (eds.), Pattern Recognition and Machine Intelligence: Third International Conference, PReMI 2009 New Delhi, India, December 16-20, 2009 Proceedings. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2009), Berlin, Heidelberg, pp. 369– 374. 22. F. Kleber, S. Fiel, M. Diem and R. Sablatnig, Cvl-database: An off-line database for writer retrieval, writer identification and word spotting, in 2013 12th International Conference on Document Analysis and Recognition (2013) pp. 560–564. 23. L. Lam, J. Ding and C. Y. Suen, Differentiating between oriental and european scripts by statistical features, International Journal of Pattern Recognition and Artificial Intelligence 12(01) (1998) 63–79. 24. M. Liwicki, A. Graves, H. Bunke and J. Schmidhuber, A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks, in 9th International Conference on Document Analysis and Recognition, ICDAR 2007 (2007) pp. 367–371. 25. G. Louloudis, B. Gatos, I. Pratikakis and C. Halatsis, Text line detection in handwritten documents, Pattern Recognition 41(12) (2008) 3758 – 3772. 26. G. Louloudis, B. Gatos, I. Pratikakis and C. Halatsis, Text line and word segmentation of handwritten documents, Pattern Recognition 42(12) (2009) 3169 – 3183, New Frontiers in Handwriting Recognition. 27. S. A. Maadeed, W. Ayouby, A. Hassane and J. M. Aljaam, Quwi: An arabic and english handwriting dataset for offline writer identification, in International Conference on Frontiers in Handwriting Recognition (2012) pp. 746–751. 28. Z. Malik, A. Mirza, A. Bennour, I. Siddiqi and C. Djeddi, Video script identification using a combination of textural features, 2015 11th International Conference on Signal Image Technology and Internet Based Systems (SITIS) 00 (2015) 61–67.
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
22
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
EP
TE
DM AN US
29. J. Mantas, An overview of character recognition methodologies, Pattern Recognition 19(6) (1986) 425 – 430. 30. U. Marti and H. Bunke, The iam-dataset: An english sentence dataset for off-line handwriting recognition, International Journal on Document Analysis and Recognition 5 (2002) 39–46. 31. S. B. Moussa, A. Zahour, A. Benabdelhafid and A. M. Alimi, Fractal-based system for arabic/latin, printed/handwritten script identification, in 19th International Conference on Pattern Recognition (2008) pp. 1–4. 32. A. M. Namboodiri and A. K. Jain, Online handwritten script recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1) (2004) 124–130. 33. S. M. Obaidullah, N. Das, C. Halder and K. Roy, Indic script identification from handwritten document images - an unconstrained block-level approach, in IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS) (2015) pp. 213–218. 34. S. M. Obaidullah, C. Halder, N. Das, and K. Roy, A corpus of word-level offline handwritten numeral images from official indic scripts, in 2nd Springer International Conference on Computer and Communication Technologies (IC3T 2015) (2015) pp. 703–711. 35. S. M. Obaidullah, C. Halder, K. C. Santosh, N. Das and K. Roy, Phdindic 11: pagelevel handwritten document image dataset of 11 official indic scripts for script identification, Multimedia Tools and Applications (2017). 36. S. M. Obaidullah, R. Karim, S. Shaikh, C. Halder, N. Das and K. Roy, Transform based approach for indic script identification from handwritten document images, in 2015 3rd International Conference on Signal Processing, Communication and Networking (ICSCN) (2015) pp. 1–7. 37. S. M. Obaidullah, N. Das and R. Kaushik, Convolution based technique for indic script identification from handwritten document images, International Journal of Image, Graphics and Signal Processing 7(5) (2015) p. 49. 38. S. M. Obaidullah, S. K. Das and K. Roy, A system for handwritten script identification from indian document, Journal of Pattern Recognition Research 8(1) (2013) 1–12. 39. S. M. Obaidullah, C. Halder, N. Das and K. Roy, Numeral script identification from handwritten document images, Procedia Computer Science 54 (2015) 585 – 594. 40. S. M. Obaidullah, C. Halder, N. Das and K. Roy, An Approach for Automatic Indic Script Identification from Handwritten Document Images in R. Chaki, A. Cortesi, K. Saeed and N. Chaki (eds.), Advanced Computing and Systems for Security: Volume 2. (Springer India, New Delhi, 2016), New Delhi, pp. 37–51. 41. S. M. Obaidullah, A. Mondal, N. Das and K. Roy, Script identification from printed indian document images and performance evaluation using different classifiers, Applied Computational Intelligence and Soft Computing 2014(article id 896128) (2014) 1–12. 42. A. C. P., N. B., V. Kumar, A. G. Ramakrishnan and S. K., Creation of a huge annotated database for tamil and kannada ohr, in International Conference on Frontiers in Handwriting Recognition (IEEE Computer Society, Los Alamitos, CA, USA, 2010) pp. 415–420. 43. U. Pal, S. Sinha and B. B. Chaudhari, Multi-script line identification from indian documents, in 7th International Conference onDocument Analysis and Recognition (ICDAR) (2003) pp. 880–884. 44. R. Pardeshi, B. B. Chaudhuri, M. Hangarge and K. C. Santosh, Automatic handwritten indian scripts identification, in 2014 14th International Conference on Frontiers in Handwriting Recognition (2014) pp. 375–380. 45. T. Q. Phan, P. Shivakumara, Z. Ding, S. Lu and C. L. Tan, Video script identifi-
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
23
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
24
49.
50.
51.
52.
53.
54.
55.
56.
57.
EP
58.
59.
60.
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
48.
DM AN US
47.
61.
kc-survey-
cation based on text lines, in International Conference on Document Analysis and Recognition (ICDAR) (2011) pp. 1240–1244. G. G. Rajput and H. B. Anita, 2010 handwritten script recognition using dct and wavelet features at block level, International Journal of Computer Application Special Issue on Recent Trends in Image Processing and Pattern Recognition (2010) 158–163. G. G. Rajput and H. B. Anita, Handwritten script identification from a bi-script document at line level using gabor filter, in International Workshop on Soft Computing Applications and Knowledge Discovery (2011) pp. 94–101. R. Rani, R. Dhir and G. S. Lehal, Script identification of pre-segmented multi-font characters and digits, in 2013 12th International Conference on Document Analysis and Recognition (2013) pp. 1150–1154. A. Raza, I. Siddiqi, A. Abidi and F. Arif, An unconstrained benchmark urdu handwritten sentence database with automatic line segmentation, in 2012 International Conference on Frontiers in Handwriting Recognition (Sept 2012) pp. 491–496. K. Roy, A. Alaei and U. Pal, Word-wise handwritten persian and roman script identification, in 2010 12th International Conference on Frontiers in Handwriting Recognition (Nov 2010) pp. 628–633. K. Roy, A. Banerjee and U. Pal, A system for word-wise handwritten script identification for indian postal automation, in Proceedings of the IEEE INDICON 2004. First India Annual Conference, 2004. (2004) pp. 266–271. K. Roy and U. Pal, Word-wise Hand-written Script Separation for Indian Postal automation, in G. Lorette (ed.), Tenth International Workshop on Frontiers in Handwriting Recognition (Suvisoft, La Baule (France), October 2006) http://www.suvisoft.com. K. Roy, U. Pal and B. B. Chaudhuri, Neural network based word-wise handwritten script identification system for indian postal automation, in Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005. (2005) pp. 240–245. M. W. Saqheer, C. L. H, N. Nobile and C. Y. Suen, A new large urdu dataset for off-line handwriting recognition, in 15th International Conference on Image Analysis and Processing (2009) pp. 538–546. R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri and D. K. Basu, Word level script identification from bangla and devanagri handwritten texts mixed with roman script, Journal of Computing abs/1002.4007 (2010). R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri and D. K. Basu, Cmaterdb1: A database of unconstrained handwritten bangla and bangla–english mixed script document image, Int. J. Doc. Anal. Recognit. 15 (March 2012) 71–83. J. Schenk, J. Lenz and G. Rigoll, Novel script line identification method for script normalization and feature extraction in on-line handwritten whiteboard note recognition, Pattern Recognition 42 (2009) 3383–3393. N. Sharma, S. Chanda, U. Pal and M. Blumenstein, Word-wise script identification from video frames, in 2013 12th International Conference on Document Analysis and Recognition (2013) pp. 867–871. N. Sharma, U. Pal and M. Blumenstein, A study on word-level multi-script identification from video frames, in 2014 International Joint Conference on Neural Networks (IJCNN) (2014) pp. 1827–1833. N. Sharma, R. Mandal, R. Sharma, U. Pal and M. Blumenstein, Icdar2015 competition on video script identification (cvsi 2015), in 13th International Conference on Document Analysis and Recognition (ICDAR) (2015) pp. 1196–1200. P. K. Singh, I. Chatterjee and R. Sarkar, Page-level handwritten script identifica-
TE
46.
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
December scriptR2
11,
2017
19:52
WSPC/INSTRUCTION
FILE
T
CR IP
Accepted manuscript to appear in IJPRAI
kc-survey-
25
65.
66.
67.
68.
69.
70.
71.
72.
EP
73.
74.
AC C
Int. J. Patt. Recogn. Artif. Intell. Downloaded from www.worldscientific.com by UNIVERSITY OF NEW ENGLAND on 04/20/18. For personal use only.
64.
DM AN US
63.
TE
62.
tion using modified log-gabor filter based features, in 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS) (2015) pp. 225–230. P. K. Singh, R. Sarkar and M. Nasipuri, Offline script identification from multilingual indic-script documents: A state-of-the-art, Computer Science Review 15-16 (2015) 1–28. P. K. Singh, R. Sarkar, M. Nasipuri and D. Doermann, Word-level script identification for handwritten indic scripts, in 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (2015) pp. 1106–1110. P. K. Singh, R. Sarkar, N. Das, S. Basu, M. Kundu and M. Nasipuri, Benchmark databases of handwritten bangla-roman and devanagari-roman mixed-script document images, Multimedia Tools and Applications (2017). P. K. Singh, R. Sarkar, N. Das, S. Basu and MitaNasipuri, Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents in P. Maji, A. Ghosh, M. N. Murty, K. Ghosh and S. K. Pal (eds.), Pattern Recognition and Machine Intelligence: 5th International Conference, PReMI 2013, Kolkata, India, December 10-14, 2013. Proceedings. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2013), Berlin, Heidelberg, pp. 509–514. V. Singhal, N. Navin and D. Ghosh, Script-based classification of hand-written text documents in a multilingual environment, in Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation (2003) pp. 47–54. A. L. Spitz, Determination of the script and language content of document images, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3) (1997) 235– 245. O. System, A literature survey url=http://shodhganga.inflibnet.ac.in/bitstream/10603/4166/10/10 chapter%202.pdf, (2016), Online; accessed 01st August 2016. W. System, Writing system of india, url=http://en.wikipedia.org/wiki/Writing system, (2016), Online; accessed 1st August 2016. G. X. Tan, C. Viard-Gaudin and A. C. Kot, Information retrieval model for online handwritten script identification, in 2009 10th International Conference on Document Analysis and Recognition (2009) pp. 336–340. S. Thadchanamoorthy, N. D. Kodikara, H. L. Premaretne, U. Pal and F. Kimura, Tamil handwritten city name database development and recognition for postal automation, in 12th International Conference on Document Analysis and Recognition (2013) pp. 793–797. K. Ubul, G. Tursun, A. Aysa and I. Yibulayin, Script identification of multi-script documents: a survey, IEEE Access (99) (2017) 1–12. L. Zhou, Y. Lu and C. L. Tan, Bangla/English Script Identification Based on Analysis of Connected Component Profiles in H. Bunke and A. L. Spitz (eds.), Document Analysis Systems VII: 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2006), Berlin, Heidelberg, pp. 243–254. G. Zhu, X. Yu, Y. Li and D. Doermann, Language identification for handwritten document images using a shape codebook, Pattern Recogn. 42 (December 2009) 3184– 3191.