Writer Identification from Handwritten Devanagari Script

0 downloads 0 Views 6MB Size Report
Writer Identification from Handwritten. Devanagari Script. Chayan Halder, Kishore Thakur, Santanu Phadikar and Kaushik Roy. Abstract This paper presents ...
Writer Identification from Handwritten Devanagari Script Chayan Halder, Kishore Thakur, Santanu Phadikar and Kaushik Roy

Abstract This paper presents analysis of Devanagari characters for writer identification. Being originated from Brahmic script, Devanagari is the most popular script in India. It is used by over 400 million people around the world. Application of writer identification of Devanagari handwritten characters covers a vast area such as The Questioned Document Examination (QDE) is an area of the Forensic Science with the main purpose to answer questions related to questioned document (authenticity, authorship and others). Signature verification in banking, in Graphology (study of handwriting) a theory or practice for inferring a person’s character, disposition, and attitudes from their handwriting. Here we collect 5 copies of handwritten characters to nullify intra-writing variation, from 50 different people mainly students. After preprocessing and character extraction, 64-dimensional feature is computed based on gradient of the images. Some manual processing is required because some noises are too difficult to remove automatically as they are much closer to the characters. We have used LIBLINEAR and LIBSVM classifiers of WEKA environment to get the individuality of characters. We have done the writer identification with all the characters and obtained 99.12 % accuracy for LIBLINEAR with all writers. Features collected from this work can be used in the next level to identify writers from their cursive writing.

C. Halder (&)  K. Roy Department of Computer Science, West Bengal State University, Barasat, Kolkata 700126, West Bengal, India e-mail: [email protected] K. Roy e-mail: [email protected] K. Thakur  S. Phadikar Department of Computer Science, West Bengal University of Technology, Salt Lake, Kolkata 700064, West Bengal, India e-mail: [email protected] S. Phadikar e-mail: [email protected] © Springer India 2015 J.K. Mandal et al. (eds.), Information Systems Design and Intelligent Applications, Advances in Intelligent Systems and Computing 340, DOI 10.1007/978-81-322-2247-7_51

497

498

C. Halder et al.



Keywords Individuality of handwriting Writer identification handwriting analysis WEKA LIBLINEAR LIBSVM









Devanagari

1 Introduction India can be considered as multilingual and multi-script country. There are so many languages are spoken like Sanskrit, Urdu, Hindi, Bengali, Tamil, Telugu, Punjabi etc. As we know that Hindi is our national language and most popular language throughout the country. Hindi language uses Devanagari script and this script is also used in many other languages such as Marathi, Konkani and Sanskrit. From more than three decades, language recognition and writer identification are very important area of research. It covers very important area of applications in forensic science, banking, graphology etc. [1]. According to Graphology we can extract human behavior such as person’s character, disposition, and attitudes from their handwriting. Now we can say that it is highly probable to authenticate and identify a person from his/her handwriting. Various works has been reported in literature on writer identification like Srihari et al. [2] have worked on Roman script to develop a complete system on writer identification. The subsequent part of the paper is organized as follows: Sect. 2 describes the properties of Devanagari script. In Sect. 3 Data Collection is described. Pre-processing steps are described in Sect. 4. Feature Extraction is described in Sect. 5. Section 6 describes about WEKA tool followed by Results in Sect. 7. At last we concluded in the Sect. 8.

2 Properties of Devanagari Script Devanagari script is evolved from one of the oldest scripts in India and played an important role in the development of literature. It is written from left to right, does not have distinct letter cases, and is recognizable (along with most other North Indic scripts, with a few exceptions like Gujarati and Oriya) by a horizontal line that runs along the top of full letters [3]. Since the 19th century, it has been the most commonly used script for writing Sanskrit. Devanagari is used to write Hindi, Marathi and Nepali among other languages and dialects. It was formerly used to write Gujarati. Because it is the standardized script for the Devanagari language, Devanagari is one of the most used and adopted writing systems in the world [4]. Figure 1a–c shows the Vowels, Consonants, Numerals of Devanagari script. If we subdivide the window containing characters into 3x3 windows, we can see the main distinction i.e. a diagonal bar but a horizotal bar(matra) for every characters. Every characters of Devanagari script can be categorized into three main types, first end bar characters, second middle bar characters and no bar characters. From the Fig. 1d

Writer Identification from Handwritten Devanagari Script

499

Fig. 1 Example of a Devanagari vowels, b Devanagari consonants, c Devanagari numerals, d-1 character with end bar, d-2 character with middle bar, d-3 character with no bar

we can understand the concept of bar. For instance, Ga ( ), Gha ( ), Ja ( ), Jha ( ) etc. are last bar characters, Ka ( ), Pha ( ) are middle bar characters and Ra ( ), Da ( ), Dha ( ) etc. are no bar characters.

3 Data Collection As here we want to do writer identification, we have used a sample document consisting of all Devanagari alphabets, numerals and vowel modifiers. We were only able to manage a total number of 50 writers, mainly students. All of they were asked to write in a particular area in the document and guided by suggestive characters. Five documents are chosen for avoiding Intra-writing variation. As we mentioned that most of our writers are students they aged between 12 and 25 years and all others are in between 25 and 60 years. Unfortunately most of the writers are right handed but about equal in gender comparison. Each set contains 10 Devanagari numerals and 43 Devanagari alphabets and 12 Devanagari vowel modifiers. We have a total of 10,750 Devanagari alphabets, 2,500 numerals and 3,000 vowel modifiers. An example of our designed character sample collection document form is shown in Fig. 2. We scanned our documents using a flatbed scanner for better digitization. All the images are in gray tone and digitized at 300 dpi and stored in Tagged Information File Format (TIFF). [5].

4 Preprocessing This step is very important for any digital image processing work because every scanned digital images are error prone. Sometimes various kinds of noises appears on the images after digitization due to the noisy objects on the scanner’s glass, quality of the paper, unwanted pen stokes etc. These noises are difficult to remove using automatic techniques as sometimes these are too close to the original data.

500

C. Halder et al.

Fig. 2 Sample data collection form used for collection of Devanagari handwritten isolated characters and vowel modifiers

So before using the character extraction technique on the digitized data collection forms to extract individual characters the manual noise removal has been done. Figure 3 shows the kind of noises that are being removed manually.

4.1 Character Extraction Technique This technique is used for extraction of each individual character from the document form of handwritten characters. The steps are: Firstly, the global binarization

Fig. 3 a The character with unwanted noises extracted without manual noise removal, b The character extracted after manual noise removal

Writer Identification from Handwritten Devanagari Script

501

Fig. 4 Isolated characters after manual noise removal and character extraction

of the whole document has been carried out. Then maximum run length has been computed on horizontal and vertical histogram of that document form. Using the maximum run length of horizontal and vertical histogram, we have identified the horizontal lines and vertical lines of the document form. After the identification of vertical and horizontal lines we have deleted those lines from the document form image to get an image which contains only the suggestive characters and the original handwritten characters. Then using the horizontal and vertical line information we have calculated the top corner point values of each block and then we have removed the suggestive characters. After this, bounded box for each handwritten character has been calculated and these information have been stored for further processing [5]. Figure 4 shows the individual characters after automatic character extraction.

5 Feature Extraction To analyze the individuality along with writer identification and to evaluate the quality of the database 64 dimensional feature extraction technique has been used [6]. For the 64 dimensional feature extraction firstly contour points of the two-tone image has been computed. Initially all the contour points are divided into 7 × 7 blocks one contour point in each block. Then direction code for each block containing contour point has been computed. After down sampling the initial 7 × 7 blocks into 4 × 4 blocks 64 (4 × 4 × 4) directions code features are being obtained. For normalizing the features (between 0 and 1) the maximum value of the histogram peaks in each direction from all the blocks has been computed. Each of the above features is then divided by the maximum value of their respective direction to

502

C. Halder et al.

Fig. 5 Flow chart of 64 dimensional feature extraction technique

get the feature value between 0 and 1. The flow chart of the Fig. 5 shows the feature extraction technique. For more details about 64 dimensional feature extraction see [6].

6 Weka WEKA is one of the widely used tools in the area of machine learning [7]. The built in tools can be called from own Java code or using the weka.jar file of the package or directly from GUI interface. It contains tools for various applications like data pre-processing, classification, clustering, regression, association rules, visualization etc. For the current work the Library for Large Linear Classification (LIBLINEAR) and Multi-Layer Perceptron (MLP) have been used for computation.

6.1 LiBLINEAR The LIBLINEAR is suitable linear classifier for most cases when the amount of data with instances or features to be classified is large enough. The convergence rate is much faster in comparison with other classifiers of WEKA for our data-set. We

Writer Identification from Handwritten Devanagari Script

503

have used the L2-Loss Support Vector Machine (dual) as the SVM Type parameter of the LIBLINEAR. Both the Bias and Cost parameters are 1.0. The EPS (the tolerance of the termination criterion) is 0.01. For more details see [8].

6.2 LIBSVM The Support Vector Machine (SVM) is a linear classifier associated with supervised learning. It mainly able to classify two classes. The LIBSVM of weka tool is used to get muti-class classification support along with the simplicity of SVM classifier. Here for the present experiment the C-SVC type of SVM has been chosen along with radial basis function as the kernel type. For more details see [8].

7 Results 7.1 Result of Individuality of Characters The individuality for all the characters and vowel modifiers has been calculated using 64- dimensional feature. The results of the LIBLINEAR and LIBSVM classifiers for the individuality of each character on all the writers are shown in Table 1. From the table it can be observed that the character JA ( ) is most individual with writer identification accuracy of 61.98 % followed by the character AI ( ) with accuracy of 60.74 % for LIBLINEAR classifier and the character OU ( ) is the most individual with accuracy of 57.85 % followed by CH ( ) with accuracy of 56.61 % for LIBSVM. The least individual is the character -UNG ) with accuracy of 09.09 and 11.16 % preceded by -AHH (:) with the accuracy of 19.83 and 12.80 % for both LIBLINEAR and LIBSVM classifiers respectively.

7.2 Result of Writer Identification The present work has been carried out on 10,750 Devanagari alphabets, 2,500 numerals and 3,000 vowel modifiers from 250 documents written by 50 writers. The 5-fold cross validation scheme has been used for computing writer identification. Then writer identification has been computed by combining all alphabets, numerals and vowel-modifiers. To get the writer identification using the combined individuality of all the characters the 64-dimensional features of the characters are firstly extracted then all the features of all the characters are combined. Using this technique for each sample from each writer we have extracted (64 × 65) 4,160

504

C. Halder et al.

Table 1 Individuality of each Devanagari character

Chars LIBLIN- LIBEAR(%) SVM(%)

Chars

LIBLINLIBEAR (%) SVM(%)

( )A

52.07

46.69

( ) BHA

49.17

38.43

( ) AA

55.54

52.48

( ) MA

40.08

37.60

( )I

45.04

43.80

( ) RA

48.76

42.98

( ) II

54.77

50.20

( ) LA

51.65

46.28

( )U

52.89

41.74

( ) WA

41.74

32.23

( ) UU

54.36

51.04

( ) TSA

52.07

46.69

( )E

52.07

51.24

( ) MSA

41.32

36.36

( ) AI

60.74

54.96

( ) SS

54.54

50.41

( )O

54.96

50.50

( ) HA

50.07

46.69

( ) OU

58.26

57.85

( ) -AA

28.93

21.90

( )K

58.68

50.83

( ) -I

47.93

37.60

( ) KH

59.91

56.20

( ) -II

44.21

37.60

( )G

53.71

49.59

( ) -U

36.78

35.95

( ) GH

42.15

38.43

( ) -UU

32.78

31.12

( ) UN

48.76

43.80

( ) -RI

37.34

26.97

( )C

53.30

45.87

( ) -E

36.51

29.46

( ) CH

57.43

56.61

( ) -AI

46.02

38.91

( ) JA

61.98

54.55

( ) -O

33.20

23.24

( ) JH

59.50

54.96

( ) -OU

42.56

36.36

( ) EN

47.10

41.74

( ) -UNG

09.09

11.16

( )T

43.39

37.19

( ) -AHH

19.83

12.80

( ) TTA

41.32

37.19

( ) YA

50.00

42.08

( ) DDA

58.68

48.76

( )0

29.05

30.29

( ) DDH

45.45

31.02

( )1

43.80

37.19

( ) MNA

43.33

36.25

( )2

44.63

45.04

( ) TA

38.33

32.08

( )3

42.74

39.00

( ) THA

51.24

50.41

( )4

46.28

44.63

( ) DA

55.79

47.93

( )5

51.65

46.69

( ) DHA

51.24

38.84

( )6

60.33

54.54

( ) NA

52.07

46.28

( )7

48.96

48.55

( ) PA

41.73

30.58

( )8

37.71

28.93

( ) PHA

50.41

46.28

( )9

45.87

45.04

( ) BA

41.74

37.60

Writer Identification from Handwritten Devanagari Script

505

dimensional features. We have used the LIBLINEAR classifier and got 98.67 % accuracy. But if we use discretize attribute of unsupervised filter we got accuracy 99.12 %.

8 Conclusion The main emphasis has been on data collection and evaluation of the individuality of characters. We have collected 5 documents per user and total numbers of users are 50. After digitization character extraction has been carried out to extract individual characters and then 64-dimensional feature has been used to compute the character feature after that LIBLINEAR and LIBSVM of WEKA tool has been used for classification. The individuality of all the characters have been claculated using both the classifiers. Using LIBLINER we have achieved an writer identification rate of 99.12 %. In future we intend to increase the size of the database both in terms of writer as well as the number of samples and also make the database open. Acknowledgments One of the author would like to thank Department of Science and Technology (DST) for support in the form of INSPIRE fellowship.

References 1. Hiremath, P.S., Shivashankar, S., Pujari, J.D., Kartik, R.K.: Writer identification in a handwritten document image using texture features. In: International Conference on Signal and Image Processing, pp. 139–142 (2010) 2. Srihari, S.N., Cha, S.H., Arora, H., Lee, S.: Individuality of handwriting, pp. 1–17. Report of National Criminal Justice Reference Services (2001) 3. Ramteke, A.S., Rane, M. E.: Offline handwritten devanagari script segmentation. Int. J. Sci. Technol. Res. 1(4), 142–145 (2012) 4. Patil, P.M., Ansari, S.: A research survey of devanagari handwritten word recognition. Int. J. Eng. Res. Technol. 2(10), 1010–1015 (2013) 5. Halder, C., Paul, J., Roy, K.: Individuality of Bangla numerals. In: Proceedings of 12th International Conference on Intelligent Systems Design and Applications, pp. 264–268 (2012) 6. Roy, K., Pal, U.: On the development of an OCR system for Indian postal automation. LAP LAMBERT Academic Publishing, Germany. ISBN: 978-38-443-1403-8 (2011) 7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009) 8. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

.