Document Examiner Feature Extraction: Thinned vs. Skeletonised Handwriting Images Vladimir Pervouchine and Graham Leedham School of Computer Engineering, Nanyang Technological University N4-02C-77 Nanyang Avenue, Singapore 639798 Email:
[email protected] [email protected] Telephone: +65 6790 6250, Fax: +65 6792 6559 Abstract— This paper describes two approaches to approximation of handwriting strokes for use in writer identification. One approach is based on a thinning method and produces raster skeleton whereas the other approximates handwriting strokes by cubic splines and produces a vector skeleton. The vector skeletonisation method is designed to preserve the individual features that can distinguish one writer from another. Extraction of structural character-level features of handwriting is performed using both skeletonisation methods and the results are compared. Use of the vector skeletonisation method resulted in lower error rate during the feature extraction stage. It also enabled to extract more structural features and improved the accuracy of writer identification from 78% to 98% in the experiment with 100 samples of grapheme “th” collected from 20 writers.
I. I NTRODUCTION For many centuries handwriting has been used to identify an individual. The hypothesis about handwriting being a personal biometric relies on the fact that the process of handwriting is an unconscious act learnt over time, and some pen movements are invariant and not easily changed when an attempt at forgery or disguise is made. When doubts of authenticity of handwriting arise, forensic document examiners are asked to conduct an analysis of the questioned document. They seek characteristics of handwriting (features) that are consistent in a person’s normal writing by analysis of shapes and structure of the handwriting [1], [2]. While other branches forensic science, such as DNA analysis, have been explained and proven by experimental studies, the forensic techniques used in handwriting analysis have far less scientific support. The methods used by the examiners are intuitively reasonable and have been derived from experience. It is the credibility of the document examiner that has been a key basis in a court of law rather than scientific basis of the techniques. In recent cases the scientific acceptability of forensic analysis of handwriting has been successfully challenged [3]. To provide a scientific support for the handwriting analysis pattern recognition techniques have been employed [4]. It has been demonstrated that handwriting indeed can be used to identify an individual with high accuracy [5], [6]. However, it has not been shown that writers can be distinguished using the techniques forensic document examiner use in their analysis. There are two types of features that can be extracted from handwriting. Document examiner features are those used by forensic examiners to establish the authorship of questioned
V. CATALOG NUMBER
D-ROM Version
documents [7]. Many of those features like writing quality are defined quite ambiguously and thus are subjective. Computational features are those that can be strictly defined in terms of computational algorithms used to extract them [5]. Such features are unambiguous and hence remove subjectivity from feature measurement. Not all computational features correspond to some document examiner features; some of them do not correspond to anything a person can see when looking at a handwriting image. To determine whether handwriting can be used to identify a person all computational features are suitable. However, not all of them can be used when the question is whether the techniques of forensic document examination allow to identify a person, since those techniques to not include the computational features that cannot be measured by humans. Several studies have been carried out that use formalisation of document examiner features and extraction of them from handwriting samples using computer algorithms [8]–[10]. Many document examiner features require analysis of stroke characteristics such as tangents, curvatures, junctions and end points. These are most naturally extracted when handwriting characters are represented in their original form — as a set of strokes. When dealing with scanned images it is necessary to approximate the original set of strokes somehow in order to be able to detect the important points and branches and measure their characteristics. To achieve this, various skeleton extraction methods are tried. II. BACKGROUND There are a number of techniques to obtain skeletons of handwritten samples. Applicability of a particular technique depends on the problem in hand. Thinning methods [11], [12] are very popular for skeletonisation mainly because of their implementation simplicity and high computational speed. However, the results they produce have many drawbacks such as erroneous branches and other artefacts, affecting the subsequent feature extraction (Figure 1(b), 1(c)). Thinning also destroys stroke order information, which can be used to extract additional features representing individual character formation. Because of the drawbacks of thinning methods a number of skeletonisation and stroke extraction methods have been proposed which are based on thinning. Such methods use
he following catalog numbers must appear on the face of the CD-ROM, on the jacket ewed when the CD-ROM is accessed: IEEE Catalog Number: ISBN:
05CH37710C
0-7803-9312-0
addition, the following CCC Code must be placed on the bottom left hand side of t cording to the guidelines set forth in the attached enclosure: 0-7803-9312-0 /05/$20.00 ©2005 IEEE
After thinning, correction of some artefacts was performed as shows the pseudo-code below: do {
(a) Thinned image. Fig. 1.
(b) Junction point.
(c) Extra branch.
Artefacts produced by most thinning methods.
thinned image as the first approximation to the desired skeleton and then use various methods to correct junctions and branches, recover loops and sometimes also to recover the writing sequence [13]–[15]. It is also possible to try to compensate for changes in character shapes produced by thinning when extracting features. The combination of thinning-based skeletonisation with such a compensation has been used by the authors in the studies of document examiner features [10] and is presented in this paper in Section III-A. Other method of handwriting skeletonisation and stroke extraction include methods based on contour tracing [16], [17], principal curves [18], B´ezier curve fitting [19], self-organising maps [20], wavelet transforms [21], bi-tangent circle [22]. In handwriting recognition the features are aimed at distinguishing different characters or their combinations and thus need to represent dissimilarities between different alphabet characters and have similar values for the same characters written by different writers. In writer identification area the features of interest are aimed at distinguishing between writers and thus need to emphasise the differences in shapes of the same characters written by different writers. Unfortunately, all the methods of stroke extraction/skeletonisation have been designed for application to handwriting recognition, and do not aim to preserve individual traits the writer endows a character. That is why for author identification these methods hardly give better results than the thinning techniques [23]. In this paper a new skeletonisation method presented in [23] is shortly described and extraction of document examiner features is presented when both the thinning-based (raster) and the new (vector) skeletonisation methods are used. The results of feature extraction are compared. For simplicity further in this paper the thinning-based method is sometimes referred to as “thinning”. III. E XTRACTION
OF CHARACTER SKELETON
A. Thinning-based method The thinning-based skeletonisation method has been developed on the base of the thinning function provided in Matlab Image Processing Toolbox [24], which is a modified version of Zhang and Suen thinning [25]. Other thinning methods were tried too [12] but no significant difference was found for the problem at hand. The method requires a binary image for input and produces 1-pixel thick image by erosion of the outer black pixels that satisfy the pixel removal conditions.
remove small connected components find junction points find end points correct spurious loops prune short branches } while there are some changes in the skeleton image Removal of small connected components was necessary because of noise in the image and because binarisation sometimes produced small disconnected parts at the ends of handwriting strokes. Removal of spurious loops was accomplished by analysing all loops with the area smaller than a predefined threshold and removal of the loops that resulted from small white “holes” in the binarised image as in Figure 2(b). Unfortunately not all such loops could be removed correctly. Short branches that did not exist on the original image were pruned (Figure 2(c). All these operations could make some connected components that had been too big to be removed previously small enough to be removed now, so the operations were repeated until one application of them did not result in any changes in the skeleton. Figure 2 shows the original image and the resulting image after each stage. B. New skeletonisation method The new skeletonisation method uses a grayscale image as its input and approximates the skeleton by a set of B-splines [23]. The method has three stages. The first stage builds the initial skeletal branches. The idea behind the method is to divide the handwriting image in a set of almost rectangular regions and use the centres of the resulting rectangles as spline knots. Each spline knot is marked either as an end point (if it has only one neighbour knot), or as a middle point (if it has two neighbours), or as a junction point (if it has more than two neighbours). To form the initial branches, cubic Bspline interpolation is applied to each sequence of spline knots between either ends or junctions or an end and a junction (Figure 3(b)). The second stage forms handwriting strokes from the branches. It is based on optimisation of cost function which consist of both global and local terms, similarly to [26]. Global cost includes the cost associated with the total number of strokes and the number of disconnected ends in junctions. Local cost consists of cost of merging of the branches, which is low when the resulting curve is smooth, and cost of retracing of a branch or a sequence of branches, which is lower for vertical, straight, and short branches. For each retraced segment hidden loop restoration is tried, which changes the cost of the retracing. If this cost becomes lower the restored loop is accepted, otherwise it is rejected. Skeleton on Figure 3(c) contains only strokes with retracing (marked by arrows) whereas skeleton on Figure 4(a) contains a hidden loop that was restored. Each possible connection of branches in each junction point was coded as a binary string and the
(a) Original.
(b) Binarised. Fig. 2.
(c) Thinned.
(d) Corrected.
Stages of thinning-based skeletonisation.
configuration of a skeleton was defined by concatenation of those strings. Depending on the number of possible configurations either exhaustive search or genetic algorithm search was applied to minimise the cost function. After the second stage any two intersecting or touching curves still shared the common spline knot (junction point). On the third stage each curve was adjusted using the pixels of the underlying grayscale image [27]. After the adjustment stage the new junction points normally did not coincide with the original junction points, as seen in t-stem and crossbar junction on Figure 3(d). Figures 4(a) and 4(b) show examples of successful application of the new method. Figure 4(c) shows incorrect skeletonisation resulted from broken branches formed on stage I, which, in turn, was caused by faint strokes in the original image.
the current knowledge of the shape the main program selected an appropriate method to locate important points and branches of the skeleton to extract features from them. At the end the feature vector and some additional information that allowed to check correctness of the feature extraction were given as an output. A. Features from thinned images The main program of feature extraction from thinning-based raster skeleton required the original, binarised, and thinned versions of an image. The end points and junctions were analysed and the branches were traced to determine the strokes they correspond to and to extract document examiner features [28]. The analysis took into account some distortions of a skeleton such as changes of junction points and possible existence of extra branches and spurious loops that had not been corrected on the skeleton extraction stage. The list of features extracted is shown in Table I. TABLE I O RIGINAL FEATURE SET. (T IS TOP OF T- STEM , H IS TOP OF H - STEM ,
C
IS THE POINT OF INTERSECTION OF A VERTICAL LINE
DRAWN AT
(a)
(b) Fig. 4.
(c)
Examples of skeletonised images
IV. E XTRACTION
OF FEATURES
Extraction of document examiner features from thinned images was performed for characters “d”, “y”, “f”, “t”, and grapheme “th” [10]. Since it was found that use of the grapheme features has more advantages than use of single character features [28], further study was conducted mainly on grapheme “th” and extraction of document examiner features from new vector skeleton was performed for this grapheme only. Feature extraction consisted of the main program and a number of subroutines that were called from it. The skeletal branches were traced and the shape of each grapheme sample was analysed so that the feature extraction program was able to handle many possible different grapheme forms. Stage after stage some features were extracted along with more information about the shape of the current grapheme. Based on
H fi 1 2 3 4 5 6 7 8 9 10 11 12 13 14
WITH A HORIZONTAL LINE DRAWN AT
T .)
feature height width height to width ratio distance HC distance T C distance T H angle between T H and T C slant of t slant of h position of t-bar connected / disconnected t and h average stroke width average pseudo-pressure standard deviation of f13
Slant features were measured as angles between a vertical line and the line representing a slant. Such representing line was fitted by the least squares method into the set of points corresponding to the pixels of the stem. Position of t-bar was a binary feature, having the value of 1 if the t-bar crossed the stem and 0 otherwise, including the case where the crossbar was touching the stem in its upper point. It was observed that writers who tend to produce t-bars touching the stem also
(a) Original.
(b) Branches. Fig. 3.
(c) Strokes.
(d) After adjustment.
Stages of thinning-based skeletonisation.
tend to produce completely disconnected t-bars in grapheme “th”. Pseudo-pressure was measured by averaging the gray level of pixels in the original image. Average stroke width was measured from the total number of black pixels and the number of border pixels using the binarised image.
TABLE III A CCURACY OF FEATURE EXTRACTION . Method Thinning (old) Skeletonisation (new)
Accuracy, % 87 94
B. Features from vector skeletons The program for feature extraction from vector skeletons required the original version of an image, its distance map calculated from the binarised image, and the vector skeleton. The kind of strokes on the skeleton (stem, crossbar, etc.) was also recognised from the analysis of the strokes, their end points and their intersections. Some features from Table I were extracted differently, e.g. average stroke width was calculated from using the distance map. Such an approach did not result in significantly different value of average stroke width but allowed to calculate the standard deviation of the stroke width as an additional feature partially representing the writing quality (document examiner feature). Vector skeletonisation also made possible extraction of other features which could not be measured correctly when thinning was used. The additional features are listed in Table II. TABLE II A DDITIONAL FEATURES . E XTENDED FEATURE SET IS FORMED BY ADDING THESE FEATURES TO THE ORIGINAL SET.
fi 15 16 17 18 19 20 21 22 23 24 25
feature standard deviation of stroke width number of strokes number of loops and retracings straightness of t-stem straightness of t-bar straightness of h-stem presence of loop at top of t-stem presence of loop at top of h-stem maximum curvature of h-knee average curvature of h-knee relative size (diameter) of h-knee
Accuracy of extraction was measured as the percentage of samples for which all features were extracted, i.e. the extraction algorithms did not fail. The values of accuracy are shown in Table III. The new skeletonisation algorithm allows to make feature extraction algorithms less prone to failures than the thinning algorithm. In order to estimate how accurately the feature values were extracted it was decided to compare writer classification accuracy using features extracted from thinned and skeletonised images. For brevity the thinning-based method together with extraction of features from the resulting images is referred to as “old” method, and the new skeletonisation method together with extraction of features from the resulting vector skeletons is referred to as “new” method. Only samples of writers with all 5 graphemes processed correctly in both “old” and “new” methods were used in the test, which made total of 100 samples from 20 writers. A DistAl constructive neural network was used as a classifier [29] with normalised Manhattan distance measure. To determine the classification accuracy, the data was divided into 5 equal parts to perform 5-fold crossvalidation. The average values of accuracy along with their standard deviations are shown in Table IV. The classification accuracy tests were performed for three cases: when “old” method was used, when “new” method was used but the feature set remained the same as in the “old” method, and when the “new” method was used with the extended feature set. TABLE IV A CCURACY OF WRITER CLASSIFICATION AND
V. R ESULTS
AND COMPARISON
Feature extraction was performed using the methods described above from 150 images of grapheme “th” written by 30 different writers. The samples were manually extracted from samples of the CEDAR letter [6].
Method Original feature set + thinning Original feature set + new method Extended feature set + new method
ITS STANDARD DEVIATION .
Accuracy, % 78 83 98
VI. C ONCLUSION
AND DISCUSSION
One of the main problem with thinning method was loss of information due to binarisation because it could not always be possible to correctly binarise the whole character image using one threshold value. Figure 5 shows an example of binarisation when either the loop or the stroke marked by the arrow could be binarised correctly, but not both. Binarisation also resulted in small “holes” that led to spurious loops later, which were very hard to remove and which affected feature extraction algorithms, sometimes resulting in their failure.
(a) Grayscale loop Fig. 5.
(b) Low threshold
(c) High threshold
Loss of information due to binarisation.
The accuracy of classification when the original feature set was used is slightly higher when new stroke extraction method was used. This is most likely caused by more precise measurement of some feature values, especially the angular features like slants and stroke angles. The accuracy of writer classification improves significantly when more features are taken into account. The latter improvement is important because it is the new stroke extraction algorithm that made reliable extraction of additional features possible. Thus the advantage of the new method is twofold: (i) it allows to extract structural features with higher precision than it is possible when thinning-based skeletonisation is used, and (ii) it allows to extract more features which contribute to writer discrimination. The main problem of the skeletonisation method arose when the input image was too faint, which resulted in broken skeleton branches produced at stage I and inability of the method to recover the correct strokes, as shown in Figure 4(c). Another problem is related to the hidden loop restoration. It seems that analysis of stroke thickness and length only, which was used in the method, is not robust enough. Analysis of a stroke shape is probably needed to make the loop restoration more dependable [30]. There are also several parameters in the new method in stages II and III, like weight coefficients for local and global costs. Some of them, e.g. the weight at the number of strokes, are content-dependent. Also some coefficients are hard to choose a priori even when the content is known, but they can easily be adjusted by trial-and-error method. This suggests possible usage of the new skeletonisation method as an interactive tool in a document examiner toolset.
R EFERENCES [1] W. R. Harrison, Suspect Documents, Their Scientific Examinations. Illinois, USA: Nelson-Hall, 1981. [2] O. Hilton, Scientific Examination of Questioned Documents. Florida, USA: CRC Hall, 1993. [3] “Daubert et al. v. Merrell Dow Pharmaceuticals,” 509 U. S. 579, 1993. [4] S. N. Srihari and C. G. Leedham, “A survey of computer methods in forensic document examination,” in Proc. 11th Conf. Int’l. Graphonomics Society (IGS2003), H. L. Teulings and A. W. A. Van Gemmert, Eds., Scottsdale, AZ, USA, Nov. 2003, pp. 278–281. [5] S. N. Srihari, S.-H. Cha, and S. Lee, “Establishing handwriting individuality using pattern recognition techniques,” in Proc. 6th Int’l Conf. Document Analysis and Recognition (ICDAR’2001), Seattle, USA, Sept. 2001, pp. 1195–1204. [6] S. N. Srihari, S.-H. Cha, H. Arora, and S. Lee, “Individuality of handwriting,” Journal of Forensic Sciences, vol. 47, no. 4, pp. 1–17, 2002. [7] R. A. Huber and A. M. Headrick, Handwriting Identification: Facts and Fundamentals. CRC Press, LCC, 1999. [8] C. M. Greening, V. K. Sagar, and C. G. Leedham, “Automatic feature extraction for forensic purposes,” in Proc. 5th IEE Int’l Conf. Image Processing and its Applications, Edinburgh, UK, July 1995, pp. 409– 414. [9] P. J. Sutano, C. G. Leedham, and V. Pervouchine, “Study of the consistency of some discriminatory features used by document examiners in the analysis of handwritten letter ‘a’,” in Proc. 7th Int’l Conf. Document Analysis and Recognition (ICDAR’2003), Edinburgh, UK, Aug. 2003, pp. 1091–1095. [10] C. G. Leedham, V. Pervouchine, W. K. Tan, and A. Jacob, “Automatic quantitative letter-level extraction of features used by document examiners,” in Proc. 11th Conf. Int’l. Graphonomics Society (IGS2003), H. L. Teulings and A. W. A. Van Gemmert, Eds., Scottsdale, AZ, USA, Nov. 2003, pp. 291–294. [11] L. Lam, S.-W. Lee, and C. Y. Suen, “Thinning methodologies — a comprehensive survey,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14, no. 9, pp. 869–885, Sept. 1992. [12] C. Y. Suen and P. S. P. Wang, Eds., Thinning Methodologies for Pattern Recognition, ser. Series in Machine Perception and Artificial Intelligence. Singapore, New Jersey, London, Hong Kong: World Scientific, 1994, vol. 8. [13] A. Amin and S. Singh, “Machine recognition of hand-printed Chinese characters,” Intelligent Data Analysis, vol. 1, no. 2, pp. 101–118, 1997. [14] K. Liu, Y. S. Huang, and C. Y. Suen, “Identification of fork points on the skeletons of handwritten chinese characters,” IEEE Trans. Pattern Anal. Machine Intell., vol. 21, no. 10, pp. 1095–1100, Oct. 1999. [15] Y.-M. Su and J.-M. Wang, “A novel stroke extraction method for Chinese characters using Gabor filters,” Pattern Recognition, vol. 36, no. 3, pp. 635–647, Mar. 2003. [16] R. Plamondon and C. M. Privitera, “The segmentation of cursive handwriting: An approach based on off-line recovery of the motortemporal information,” IEEE Trans. Image Processing, vol. 8, no. 1, pp. 80–91, Jan. 1999. [17] E. L’Homer, “Extraction of strokes in handwritten characters,” Pattern Recognition, vol. 33, no. 7, pp. 1147–1160, July 2000. [18] B. K´egl and A. Krzy˙zak, “Piecewise linear skeletonization using principal curves,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 1, pp. 59–74, Jan. 2002. [19] C.-W. Liao and J. S. Huang, “Stroke segmentation by Bernstein-Bezier curve fitting,” Pattern Recognition, vol. 23, no. 5, pp. 475–484, 1990. [20] A. Datta, S. K. Parui, and B. B. Chaudhuri, “Skeletonization by a topology-adaptive self-organizing neural network,” Pattern Recognition, vol. 34, no. 3, pp. 617–629, Mar. 2001. [21] Y. Y. Tang and X. You, “Skeletonization of ribbon-like shapes based on a new wavelet function,” IEEE Trans. Pattern Anal. Machine Intell., vol. 25, no. 9, pp. 1118–1133, Sept. 2003. [22] L. M. Mestetskii, I. A. Reyer, and T. W. Sedeberg, “Continuous approach to segmentation of handwritten text,” in Proc. 8th Int’l Workshop on Frontiers in Handwriting Recognition (IWFHR-8), Ontario, Canada, Aug. 2002, pp. 440–445. [23] V. Pervouchine, C. G. Leedham, and K. Melikhov, “Handwritten character skeletonisation for forensic document analysis,” in Proc. 20th Annual ACM Symposium on Applied Computing, Santa Fe, NM, USA, Mar. 2005, pp. 754–758.
[24] Z. Guo and R. W. Hall, “Parallel thinning with two-subiteration algorithms,” Comm. ACM, vol. 32, no. 3, pp. 359–373, 1989. [25] T. Y. Zhang and C. Y. Suen, “A fast parallel algorithm for thinning digital patterns,” Comm. ACM, vol. 27, no. 3, pp. 236–239, 1984. [26] H. Bunke, R. Ammann, G. Kaufmann, T. M. Ha, M. Schenkel, R. Seiler, and F. Eggimann, “Recovery of temporal information of cursively handwritten words for on-line recognition,” in Proc. 4th Int’l Conf. Document Analysis and Recognition (ICDAR’97), Ulm, Germany, Aug. 1997, pp. 931–935. [27] V. Pervouchine, C. G. Leedham, and K. Melikhov, “Three-stage handwriting stroke extraction method with hidden loop recovery,” 2005, accepted to ICDAR’05; to be published. [28] C. G. Leedham, V. Pervouchine, and W. K. Tan, “Quantitative letterlevel extraction and analysis of features used by document examiners,” Journal of Forensic Document Examination, vol. 16, pp. 21–39, 2004. [29] J. Yang, R. Parekh, and V. Honavar, “Distal: An inter-pattern distancebased constructive learning algorithm,” Department of Computer Science, Iowa State University, Technical Report ISU-CS-TR 97-06, 1997. [30] D. Doermann, N. Intrator, E. Rivlin, and T. Steinherz, “Hidden loop recovery for handwriting recognition,” in Proc. 8th Int’l Workshop on Frontiers in Handwriting Recognition (IWFHR-8), Ontario, Canada, Aug. 2002, pp. 357–380.