2010 International Conference on Pattern Recognition
An Efficient Method for Offline Text Independent Writer Identification Golnaz Ghiasi, Reza Safabakhsh Computer Engineering and Information Technology department, Amirkabir University of Technology, Tehran, Iran
[email protected],
[email protected] Abstract—This paper proposes, an efficient method for text independent writer identification using a codebook. The occurrence histogram of the shapes in the codebook is used to create a feature vector for the handwriting. There is a wide variety of different shapes in the connected components obtained from handwriting. Small fragments of connected components should be used to avoid complex patterns. A new and more efficient method is introduced for this purpose. To evaluate the methods, writer identification is conducted on three varieties of a Farsi database. These varieties include texts of short, medium and large lengths. Experimental results show the efficiency of the method especially for short texts. Keywords-writer codebook;
identification;
I.
handwriting;
II.
Fragmented parts of different people’s handwritings are quite different. As a result, the histogram of incidence of different fragments appearing in handwriting can be employed as a feature vector for writer identification. To construct this histogram, a collection of fragments that usually appear in handwritings should be available. This collection is called the “codebook”, and each member of it is called a “code”. In non-cursive handwritings, each connected component can be considered as a code for making the codebook [4]. But in cursive handwritings, connected components may contain several characters; so, they may be too long and have a wide variety of shapes. Therefore, shorter fragments are desirable. In Section 2.1, two fragment extraction methods are introduced. The extracted fragments should be normalized as explained in Section 2.2. After extracting normal fragments from a sufficient number of handwritings, a codebook can be generated. Details of codebook construction and making a feature vector are described in Section 2.3. Writer identification using feature vectors is explained in Section 2.4. Contours are suitable for shape matching, and no problem regarding the starting point and inner loops appears when using them. As a result, a codebook from contours of fragments is made. Besides, contours of connected components are employed for segmentation or extraction of their fragments. Moore’s algorithm [9] is used for computing contours of connected components.
Farsi;
INTRODUCTION
Diversity in education and personal interests results in different writing habits in people. Writer identification is feasible through exploiting these differences in different people’s handwritings. This behavioural biometric is not as strong as physiologic biometrics such as the fingerprint; but in the cases in which only a person’s handwriting is available, this biometric is useful. Furthermore, this biometric can be used as a complement to other biometrics and in a variety of domains such as security, financial activities, forensic and criminal justice systems. In offline writer identification, the goal is to determine the writer of a text among a number of known writers using images of their handwritings. Writer identification can be text dependent, where writers should write a fixed text, or text independent, where any text can be used to establish the identification. There are different approaches for offline text independent writer identification. Texture analysis [1, 2], study of line directions and their changes [3, 4], examination of visual characteristics of handwritings like height, width and slant [5], and the application of HMM recognizers [6] are some common methods. Another method is based on a codebook of shapes appearing in people’s handwritings [4, 7, 8]. In this paper, a new and efficient method for offline text independent writer identification, based on a codebook, is proposed. This paper is organized as follows. Section 2 contains details of the proposed method and section3 introduces the experimental results. Finally, conclusions are presented in section 4. 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.310
THE PROPOSED METHOD
A. Extracting fragments Two methods will be discussed for fragment extraction from connected components. The first method has been introduced by Schomaker and Bulacu [7, 8]. The second one is a novel method presented in this work. 1) Segment on Y-minima. In this method, the vertical local minima of the lower contour are found and the nearest local minimum in the upper contour are determined for each of them. If the distance between two corresponding points is in the order of the ink-line width, the connected component will be segmented at this point. The contour of a sample connected component and its resulting fragments is shown in Fig 1.
1249 1245
(a)
B. Fragment normalization Normalization is necessary to improve fragment clustering and codebook search. Contours of the extracted fragments do not have equal number of points. The number of points on a contour can be resampled to a specific number of points by an interpolation algorithm. Furthermore, the coordinates of the contour points are normalized to the origin (0,0) and the standard deviation radius of one, using the following relations.
(b)
Figure 1. Segmenting a connected component by Y-minima method. a) Contour of a connected component b) resulted fragments of a.
2) Novel method for fragment extraction. In the “segment on Y-minima” method, connected components are divided, preferably, in such a way that the resulting fragments are meaningful. In fact, a similar segmentation method has been successfully used for cursive text recognition [10]. But extracted fragments for writer identification do not need to be meaningful, and their shapes are important. Moreover, segmentation of a connected component in its local minima results in losing information about the shape of the local minima. An improved method is introduced in this section. First, a number of segments with specific lengths are chosen from the lower contour. Then, two points with the same x coordinate as the end points of each segment get determined on the upper contour. If the distance between each end point and its corresponding point are in the order of the ink-line width, the fragment of the connected component, defined by the selected segments, will be considered as a code. By choosing all possible segments in the lower contour, numerous bunches of codes will be extracted. To avoid this problem, the beginning points of different segments on lower contours can be considered some distance apart. We call this distance parameter the “gap”. The contour of a sample connected component and some extracted codes using this method are shown in Fig 2. The number of codes obtained through this method is higher than the prior one. As a result, the shapes of the fragments a person usually uses in his/her handwriting can be determined more accurately. Besides, the codes extracted with this method contain information about all parts of the connected components. Moreover, sometimes there are not local minima on the lower and upper contour with the inkline width or the points with this property are far from each other. Thus, some of the resulted fragments using the Yminima method become complex. But, length of the lower contour of the extracted fragments by the novel method is equal to a specific value and there will not be any big or complex extracted fragment.
x y
x y
µ ⁄σ , µ σ ,
(1)
where x and y are collections of x and y coordinates of a contour fragment, µ and µ are averages, and σ and σ are variances of x and y, respectively. C. Codebook construction and feature vector computation Vectors containing x and y coordinates of the normalized contours can be used to train a clustering algorithm. After training, a specific number of common fragments appearing in people’s handwritings are determined. Using k-means, 1D Self-Organizing Map (SOM) and 2D SOM as the clustering method is investigated in [11]. The results show approximately the same performance for these clustering methods. In this study, we use a standard 2D SOM for clustering. After construction of the codebook, the feature vector is calculated by an occurrence histogram, each bin of which corresponds to one codebook member. To construct this histogram, all fragments of the handwriting are first extracted and normalized. Then, for each fragment, the most similar member of the codebook is selected using a Euclidean distance function, and the corresponding bin is increased by one. Finally, the members of histogram are divided by the sum of them. Using the new method, the number of extracted fragments can be modified with changing the gap parameter. By decreasing this parameter, the number of extracted fragments increases. This increase is especially useful when a short text is available. D. Writer identification function is used to measure the similarity The between two handwritings. This function is computed as follows, ,
(2)
where n is the size of the feature vectors, is the kth th element of the feature vector and is the k element of the feature vector . I. (a)
(b)
EXPERIMENTAL RESULTS
Two different databases are used in this study. The first one [1] contains Farsi handwritings of 40 individuals in
Figure 2. Extraction of fragments using new method. a) A contour of connected component b) resulted codes of a
1246 1250
codebook. We have selected 40 points equal to length of the segments chosen on the lower contour. In this way, the average number of points on contours of extracted fragments using both methods is about 80. The number of points on normalized contours is selected as 80. The fragments are extracted from the 40 handwritings of the first database. The number of extracted fragments, using “segmentation on Y-minima”, is 35074. Training SOM by these data lasted about 50 hours on a personal computer with 3.0GHz Intel Core2 Quad CPU. Using new method with a gap value of 10, the number of extracted fragments was 75070 and the training time was about 105 hours. Patterns of a trained codebook using the new method are shown in Fig 3. Table 1 contains the experimental results. Writer identification is performed by both leave-one-out and twofold cross validations. Besides, results are reported on different hit list sizes. For a hit list of size m, m handwritings with the lowest distances are found. If the correct handwriting is among them, then identification is considered correct. The gap and time columns show the gap parameter and the average time taken for computation of the feature vector, respectively. Results show that “segmenting on Y-minima” is not proper when a short text is available; but by increasing the size of the text, the performance of this method improves. However, the proposed method performs well even when a short text is available. The correct performance of the former method is 83.8% for a short text and a hit list of size one, while the proposed method displays a performance of 92.7%. Moreover, the proposed method shows a 99.4% and a 100% correct performance for medium and large size texts, respectively; higher than the 96.1% and 99.7% correct results of the former method. One may note that the computation time for extracting features is higher in the proposed method. This time can be altered by changing the gap parameter. In fact, there is a trade-off between the performance and the computation time. According to [12] which has provided a summary of the results of writer identification studies on Farsi handwriting, the performance of the proposed method is remarkable. The
Figure 3. Patterns of a 30×30 codebook using a novel method for fragment extraction.
which every person has written a 12-line arbitrary text. This database is used for creation of the codebook. The second database is collected by the authors of this paper. It contains handwritings of 180 persons each of whom has written 3 pages of arbitrary Farsi text. Each of these pages contains 8 lines. This database is used to test the methods. Three sizes of short, medium and large lengths of written texts are used. These sizes are defined by their number of lines as: 4, 8 and 12 lines, respectively. Sheets of both databases are scanned with 300 dpi in 256 gray scales. During SOM training, the learning rate varied from 0.9, at the beginning, to almost 0, at the end; and the neighborhood radius decreased from an initial value equal to the network size to the final value of 1. These parameters were decreased based on a power function. The size of the SOM was set to 30×30 and 500 epochs were used for its training. According to [8], 30×30 is a suitable size for the TABLE I. Method
Short Text Database Variety Medium Text Database Variety Large Text Database Variety
Y-minima New method Y-minima New method Y-minima New method
Time (sec)
Gap
3.8 4.8 6 16.2 7.9 8.8 11.4 31.4 11.4 15.5 20 53
20 10 2 20 10 2 20 10 2
Top1 83.6 87.5 90.5 92.7 96.1 96.9 98.3 99.4 99.7 99.7 100 100
EXPERIMENTAL RESULTS.
2-fold cross validation Top- Top- Top- Top2 3 4 5 91.1 92.7 92.7 93.6 92.5 93.3 94.7 95 95.2 96.3 96.6 97.2 96.1 96.6 97.2 98 96.9 97.5 98.3 98.6 98.8 99.1 99.4 99.4 99.7 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
1247 1251
Top10 95.8 96.6 97.5 98.6 99.1 100 100 100 100 100 100 100
Top1 79.4 84.4 86.9 91.6 94.7 95.8 97.7 98.8 99.4 99.4 99.7 99.7
Leave-one-out cross validation Top- Top- Top- Top2 3 4 5 86.1 90.8 91.3 91.6 89.4 91.6 92.7 93.3 92.7 94.4 95.5 95.5 94.1 96.1 96.1 96.1 96.3 96.6 97.2 97.2 97.7 98.6 98.8 99.1 99.1 99.7 99.7 99.7 99.7 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Top10 93.3 94.7 96.9 97.7 98.8 99.7 100 100 100 100 100 100
best result is declared about 95% on 100 persons using five A5 pages for each person. But in the proposed method, we have reached 100%, 99.4% and 92.7% on 180 persons using databases with 3, 2 and 1 A4 pages, respectively. II.
[3]
Bulacu, M., L. Schomaker, and L. Vuurpijl, Writer identification using edge-based directional features. Seventh International Conference on Document Analysis and Recognition, 2003: p. 937941. [4] L.Schomaker and M. Bulacu, Automatic writer identification using connected-component contours and edge-based features of upper-case western script. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004. 26(6): p. 787–798. [5] Marti, U.V., R. Messerli, and H. Bunke, Writer identification using text line based features. Proc. ICDAR’01, 2001: p. 101-105. [6] Schlapbach, A. and H. Bunke, A writer identification and verification system using HMM based recognizers. Pattern Analysis & Applications, 2007. 10(1): p. 33-43. [7] Schomaker, L., M. Bulacu, and K. Franke, Automatic writer identification using fragmented connected-component contours. Proc. of 9th IWFHR, 2004: p. 185-190. [8] Schomaker, L., K. Franke, and M. Bulacu, Using codebooks of fragmented connected-component contours in forensic and historic writer identification. Pattern Recognition Letters, 2007. 28(6): p. 719–727. [9] Gonzalez, R. and R. Woods, Digital Image Processing. 2002: Addison-Wesley. [10] El-Yacoubi, A., et al., An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Trans. Pattern Anal. Mach. Intell., 1999. 21(8): p. 752–760. [11] Bulacu, M. and L. Schomaker, A comparison of clustering methods for writer identification and verification. Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR2005), 2005. 2: p. 1275–1279. [12] Sadeghi, S. and M.E. Moghaddam, Text-independent Persian writer identification using fuzzy clustering approach. International Conference on Information Management and Engineering (ICIME), 2009: p. 728 - 731
CONCLUSIONS
This paper studies the writer identification problem using a codebook approach. Three varieties of short, medium and long text databases were used for testing the methods. Results show that the “segment on Y-minima” method does not perform well on short texts. A new method is proposed through which, the correct performance of 92.7%, 99.4% and 100% for samples with 4, 8 and 12 lines, respectively, is reached. The proposed method was applied to Farsi handwriting. Nevertheless, it is language free and can be applied to other languages as well. ACKNOWLEDGMENT The authors wish to express their gratitude to the Iran Telecommunications Research Center (ITRC) for partial financial support of this study. REFERENCES [1]
[2]
Shahabinejad, F. and M. Rahmati, A new method for writer identification and verification based on Farsi/Arabic handwritten texts. Ninth International Conference on Document Analysis and Recognition, 2007: p. 829-833. Said, H.E.S., T.N. Tan, and K.D. Baker, Personal identification based on handwriting. Pattern Recognition, 2000. 33(1): p. 149-160.
1248 1252