Document not found! Please try again

Construction of Handwriting Databases Using ... - Semantic Scholar

7 downloads 0 Views 169KB Size Report
cepted with about гвдТ0 еб reliability. Jim Elder. 829 Loop Street, Apt 300. Allentown, New .... [4] Rejean Plamondon and Sargur N. Srihari, “On-line and off-line ...
Construction of Handwriting Databases Using Transcript-based Mapping Bin Zhang Departments of Human Genetics and Biostatistics School of Medicine, UCLA Los Angeles, CA 90095 [email protected] Catalin Tomai, Sargur Srihari and Venu Govindaraju CEDAR, Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14228 catalin,srihari,govind @cedar.buffalo.edu 

Abstract A recognition-based system was developed for constructing handwriting databases. The system automatically recognizes the word and the character images in handwritten document images by applying a transcript mapping algorithm. The transcript-mapping process is modeled as an optimization problem involving multiple word-segmentation hypotheses, word recognition and word alignment. The extensive experiments show that a large number of character and word images can be automatically extracted with very high reliability.

1. Introduction Handwriting processing generally involves two major tasks: handwriting recognition and handwriting identification with its two subproblems: writer identification and verification. Handwriting recognition is to recover the content of handwriting by filtering out the differences among writers. Handwriting identification tries to establish the identity of the writer from the idiosyncrasies of the writing (identification model) or to determine whether two writings originated from the same writer or not (verification model) [1, 2]. Handwriting recognition and identification have been actively studied in the past thirty years [3, 1, 4]. However, unconstrained handwriting recognition remains a challenge [5, 6]. The lack of automated or semi-automated handwriting identification systems (with few exceptions: FISH ([7], [8]) and SCRIPT ([9]) ), is proof of the difficulty of the problem. The difficulty of handwriting recognition in unconstrained domains leads to a new technology, handwriting

retrieval, for indexing handwritten documents [10, 11, 12]. Handwriting retrieval is the task of searching a repository of handwritten documents for those most similar to a given query writing which may be an image of a handwritten word, phrase, or document. While handwriting recognition and writer identification are multi-class classification problems, writer verification is a two-class problem. Retrieval process doesn’t make any decision, whereas users decide how to utilize the retrieved matches. The key problem underlying handwriting identification and retrieval is pattern matching which involves seeking effective features and defining similarities between feature vectors. Three research fields of handwriting processing, i.e., recognition, identification, and retrieval, have been studied relatively independently in the past because of their different concentrations. Recent research in studying the individuality of handwriting [13, 14, 15, 16], has shown the effectiveness of handwritten characters and words for identification and verification. In the previous studies [15, 16], all character and word images were manually extracted from a large number of handwritten documents and groundtruthed. As manual extraction of character and word images requires a lot of effort and time, there is a great need for coupling recognition with handwriting matching and retrieval. Not only can handwriting recognition largely boost the efficiency of handwriting matching/retrieval, but also it provides more information and features for matching and retrieval. Partially because of the lack of recognition capabilities, most existing handwriting identification techniques rely heavily on certain types of features, and cannot be easily adapted to a large number of applications where

these features may not be possible to extract. While a number of systems fulfill the identification task by analyzing graphemes of writing [17, 18, 19, 20, 21], some others use characteristic handwritten words [1, 22]. Said et al. [23] used texture features from handwritten documents for deciding writers-hip. To be able to automatically extract characters and word images, we constrain the recognition by the transcript of the content of the given document image, named as transcriptbased word recognition. Even though we have a transcript associated with a document image, we don’t know the correspondences between word images the words in the transcript. Two main problems are to be solved: word segmentation and matching segmented word images against the words in the transcript. As perfect segmentation and recognition are impossible for general handwritten documents, an algorithm employing multiple segmentation hypotheses may satisfactorily solve the segmentation problem. Transcript-based word recognition allows a lexicon reduction based on the local and global constraints from both a writing image and its truth transcript. The challenge is how to optimally utilize the information from a document image and its complete truth transcript to get the best mapping results. In an early work by Hobby [24], matching machineprinted document images with ground truth was simplified to optimize a character-bounding-box based transformation, but it is effective only for machine-printed document images with well-separated characters. For handwritten document images, especially cursive handwritings, bounding boxes of character images are very difficult to obtain, thus the method developed by Hobby is not viable for matching handwritten document images with ground truth. We propose a recognition-based matching architecture with direct application in handwriting identification and retrieval where handwritten document images are mainly described by effective binary features. The system transcribes a handwritten document image into a number of word and character images by applying a transcript mapping algorithm. Binary features are then extracted from the recognized character and word images for the purposes of handwriting identification and retrieval. The rest of the chapter is organized as follows. Section 2 outlines the system’s architecture and the modules’ functionality. Section 3 describes word segmentation with multiple hypotheses. Section 4 details the transcript-based word recognition/mapping algorithm algorithm based on multiple word-segmentation hypotheses. Section 5 introduces a methodology for evaluating performance of the transcriptbased recognition/mapping algorithm. Section 6 presents the experimental results and analysis and finally Section 7 concludes the paper.

2. Outline of the System The system first converts a grayscale image into a binary image by Otsu’s binarization algorithm [25], then represents the binary image with a series of contours encoded as chain codes [26]. Then, the system will perform line segmentation, word segmentation, and word mapping. Line segmentation step, performed on the chain code representation of the binary image, attempts to correctly segment the handwritten image into lines so that each line can be further divided into words [27, 28, 29]. The contours are grouped into components which are higher-order groups of contours (e.g. a word or a character). Groups of components are segmented into lines in three steps: (i) local maxima and minima are extracted and the average component height is computed; (ii) extrema are clustered; each cluster corresponds to a line; (iii) for components spanning multiple lines we use ascender/descender information to merge/split lines. Line segmentation has the capability of handling variability in the baseline position, line skew, character-size and inter-line distance. For each line image, the actual word images are segmented. We differentiate two types of word segmentation, direct segmentation and multi-hypothesis based segmentation, which will be detailed in the next section. If a truth transcript is associated with the document image, the segmented line images are decoded into a set of recognized word images by a transcript mapping algorithm and each recognized word image is further decomposed into a certain number of character images by employing a character-segment based word recognizer. If a document image has no transcript, the system will automatically recognize the isolated alphabetic and numeric character images in the document. To search a handwritten document image for patterns similar to a query word image, the document image is first decomposed into a set of word images by using either direct word segmentation or the transcript-based mapping algorithm, and then the images in the set are ranked by their similarities to the query sample. In the subsequent sections, we detail the key part of the system, transcript-based word mapping algorithm.

3. Word Segmentation with Multiple Hypotheses A general assumption for word segmentation is that inter-word spacing is greater than inter-character spacing. Punctuation information together with inter-word gaps is used for word segmentation. Direct word segmentation, which gives only one segmentation configuration, was adopted by many handwriting recognition tasks [30, 6, 5]. Also, in [31] the authors avoid line segmenta-

tion and word segmentation while identifying candidate locations by cross-correlating the document with a set of keyword prototypes which have been extracted from a set of documents.

4. Transcript-based Mapping with Multiple Word Segmentation Hypotheses

A better approach is to obtain several segmentation configurations and to combine them into a new configuration with all word images correctly separated. Figure 1 shows an example of word segmentation with multiple hypotheses.

The result of the line and word segmentation of the handis a set of word separation hywritten document image potheses for each line. Then, for each line, we have to match the word image hypotheses against a subset of words from the transcript by using word recognition algorithms. The subset of hypotheses that has the best match against the set of words is returned by a local optimization procedure. Moreover, global optimizations, i.e. paragraph match and page match, have to be satisfied. The global optimizations may refine the local optimization process by rebuilding the subset of words from the transcript to be used as a lexicon for the word recognizer. Before we formalize the problem, we specify some definitions.  A word break hypothesis,  , for a line image consists of a list of ordered disconnected word images, defined as

(a) A line image

(b)A word-segmentation hypothesis

(c) Another One word-segmentation hypothesis Figure 1. Multiple word-segmentation hypotheses: neither of the two can correctly segment all words, but all correct word segmentations are contained in the two hypotheses.

We believe that it is essential to generate multiple word segmentation hypotheses for a certain line to be sure that we detect the right word segmentation configuration. Otherwise, the right configuration will be missed and that would badly influence the later stages of the matching process. To generate multiple word segmentation hypotheses, the gaps between centers of adjacent components, that is, the distance between the components’ convex hulls ([29] for more details) are being ranked. Then, the hypotheses for choosing words (where can take values between  and  ) from the given line are ranked and returned. With the help of word recognition the correct segmentations can be identified out of the generated hypotheses. In the following section, we describe the algorithm for transcript-based mapping using multiple word-segmentation hypotheses. The goal is to build up a correspondence between each word image in a document image and its content in the transcript of the document image.

4.1. Problem Description

    !#"   (1)   where,  , $&%('*)+-, , represents a word image segment containing the word image data and some other information to be explained later. A line word-hypothesis set,  , contains multiple word  break hypotheses for a line image , given by (2) .0/12   3!54 A document word-hypothesis list, 6 , is an ordered list &

of all line word-hypothesis sets for

67 89 :  0; " & A truth transcript, < , for a document image

(3) is an

ordered list of text words, expressed as:

=?9@=AB&@=C "

(4)

A mapping D , between a truth transcript < and its image with a document word-hypothesis list 6 is defined as,

D FEG=HB JI K & EG=A JI&K *EG=CL JCBI K "  (5) "   < and  ( $2,P+Q,8R ) is a O where, M=  N=   @= C word image segment embedded in 6 . A word image segment  contains the following information: (i) a line number S % , specifying its line location in the document image; (ii) a word sequential number  T , specifying its index in a line word-hypothesis; (iii) a bounding box (  %('UWVXS Y+[Z\VX  V5]R^  _ ]V`V5]a ) specifying its position in the original image; (iv) a recognized/mapped

text content S and a confidence value  U assigned by the mapping process. To differentiate a set and a list, we represent a member _. of a set by  and a member _ in a list by Then the problem of transcript-based word mapping can be formalized as follows. Given a document word-hypothesis list 6 and a truth transcript < for a document image I, the goal is to find an ordered list ( ) of word image segments (embedded in 6 ) that matches the best the transcript < :



 







Document Word−hypothesis Set

Search for Global Anchors

 Y*Z a-+

Search for the Best Match by DP



 

 

    

New Constraints Confirmed

Yes

 

   > 

(6) where, each ordered list 8     &  !3" is constrained by the following two conditions: (i)   %('9UWV   * % '9UWV if Y  and   % F * % ; (ii)  9 % ,  * % if Y  . In the formula above, E  K represents the distance between two text words, and the constraints for every ordered list of word image segments ensure that the order of words in conforms to the word/line layout in the document im age .



Lexicon Selection

Word Recognition for Each Element

 W B &  C " 

C       E(  @= K       U   F6

Constraint−base

Next Line Word−hypothesis Set





Truth Transcript



Refine Previous Results? No Yes More Lines? No Post Processing

 



Word Spotting

End

Figure 2. Algorithm Diagram of Word Recognition/Mapping

4.2. MiWRM: An Algorithm for Transcript-based Mapping An algorithm named as Mixture of Word Recognition and Mapping (MiWRM) was designed to solve the problem described above. The diagram of MiWRM is shown in Figure 2. MiWRM dynamically takes advantage of all the global constraints (anchors) and local constraints obtained from the the truth transcript < , the word-hypothesis set 6 , and even the document image , to find an optimal lexicon set from < for each word image segment embedded in 6 (coarse word mapping process). The input to the word recognizer is the word image and the previously computed lexicon set, and the output is the ranked list of lexicon words (word recognition processing). Therefore, for each line word-hypothesis set 6   , $ , + , , each word image segment expressed as 6      >  ( $ ,a , a % , $H, , % ) will be associated with the word recognition result, which is a list of pairs (character string, confidence) sorted in descending order by confidence. A dynamic programming (DP) algorithm is used to find an ordered list   , which ( ) of word image segments embedded in 6 matches the best the transcript corresponding to the line. So, each line will be associated with and a confidence value that evaluates the degree of certainty of the system in the recognition/mapping result.

#

 

#



"!



Once a line is successfully processed, we obtain a set of new mappings. The new mappings indicate the start point in the transcript for the subsequent mapping process and the start point is recorded into a constraint database (to be explained in the next subsection). 4.2.1. Constraint-base Constraints, which were obtained from the previous mapping process, are used to regulate the subsequent mapping process. Constraints include anchors and statistical information such as average width of character images. An anchor is a reliable mapping between a word segment (image) and a text word. If a mapping is identified with very high confidence, it becomes an anchor. These information is stored in a constraint-base (CB). CB is dynamically updated during the recognition/mapping process once new mappings have been confirmed. Coarse and fine mapping processes will intensively query CB. 4.2.2. Coarse Word Mapping Given a word-hypothesis set  for a current line image (all line images above this line has been processed), coarse word mapping first finds a list of candidate words from < for the whole line. Such a list can be expressed as < 8=  N=  ^  @=  S " with $, $ K , R and $Q, $ ) $ K , R . < is a sublist of < . Computation of $ and ) need take into account

/.

%$

&$ '$ ( 0. 21 43 .

&$ )(+*-, $

the previous mapping results and the length of the line image. Then, coarse word mapping decides a candidate word list (again a sublist of < ) for each word segment in each wordhypothesis. For the -th word segment in a word-hypothesis with  segments, its candidate list is expressed as <  =  @=  ^ &@=  S " with $ , $ , $ ) $ and $ , $ Y $ K , $ ) $ . The start index $ is computed by

$

  )(  (  , . . 21 43 ". +1 3

". 1 3



(7) 0. $ 1 a  S/L   ) 3 ) 4 * The size of the list, Y , is set as  so that each word segment is mapped with half of the words in < $ . Such a rough mapping ensures that the right word candidate for each word segment is in <  .

$ 

4.2.3. Word-model Word Recognition Word-Model Word Recognition (WMR) ([5]) takes as inputs a word image  and a lexicon, and finds the best match between a word in the lexicon and the image. To match the word image against a lexicon, WMR involves three major phases, i.e., segmentation, feature extraction and matching. The segmentation phase separates a word image into smaller pieces called segments. Each segment represents a character or a sub-character (i.e. a part of a character). During the phase of feature extraction, features are extracted from all possible combinations of 1-4 consecutive segments (called as super-segments). A super-segment corresponds to a single character in words of the lexicon. Given a lexicon word, the matching phase uses dynamic programming to match features of the super-segments with ”ideal” features (obtained in the training procedure of WMR) of characters from the word and takes the edit distance as the matching/recognition score. The matching phase is repeated for all lexicon words. WMR’s output consists of a list of lexicon entries ranked in descending order of their confidence values. While the matching phase determines the segmentation points between segments that correspond to characters in a lexicon word, WMR can also be used to segment character images from a word image if a single true lexicon word is presented.



4.2.4. Fine Word Mapping Given a word-hypothesis set   6  of a line image , for which each word image segment  embedded in 6  has been tagged with the word recognition result (a sorted list of character string and confidence pairs), fine word mapping consists of finding an ordered list ( ) of word image segments embedded in  6   , that best matches  the transcript  corresponding to . The transcript  for is obtained in the coarse word mapping stage. A dynamic programming algorithm, Longest Common Subsequence (LCS), is designed to find the common sub



#







word-sequence (CSWS) between 6     and  , $ ,   + ,a . Here, for each  , only the top entry with the highest confidence is considered. Moreover, if the confidence of the top entry in   is lower than a threshold,   will be ignored. If the confidence of the top entry in   is quite high,   will become an anchor. In this way, each word T in the transcript  will be associated with several (or none in some cases) word images (hypotheses), and the word image  with the highest confidence is chosen as a mapping to T . Therefore, we get a line  mapping D between  and  . Finally, we need to examine the correctness of the map ping D . Given

D



7E =H  K &E =Q J K *EG=CL \C K " 

(8)

A valid mapping should satisfy the following condition:



#$



,>Y





, R  %('UWV ,   %('9UWVX

(9)

D will be assigned a confidence value computed by averaging the word-recognition confidences of all entries in   D . If D is illegal, its confidence will be 0. The highest confidence value is $  .  Fine word mapping, D , means new constraints on the following lines, i.e., we can narrow down the searching range of coarse word mapping for the left line word hypothesis sets by starting from the word next to the last element   in D . D is included in the constraint-base (CB). 4.3. Post Processing Up to now, we’ve partly mapped the truth transcript < to the image . The anchors are scattered inside the mapping D , and any word between two anchors is in a dangling (unconfirmed) state. Let’s consider two consecutive anchors in D , EG=      K and E =  B   K .

D>7&EG=  9     K &E = 9   K  EG = X;   X ; K *EG=       K  "

(10)

Assume the two anchors belong to the same line), and the line bounding box is (Lleft, Ltop, Lright, Lbottom). A rough mapping for any word EG=    K  $ ,+ , between the two anchors can be done as follows,

!

   %('UWV#    V5]R-    Y+[Z\V    _X]V`V5]a

1 +3

    Y+[Z\V $ V5]R and      % '9UWV $  _ ]V`V5]a 

.

.

and and

In the constraint-base, the average width of character images is computed based on the mappings with very high confidence. Therefore, if we know the width of a line image, we can estimate the number of characters written in

the line. If the number of characters of the words in a single line is too big, some words in the transcript mapped to the line have to be removed according to the width of the line; If the number of characters of the words in a line is too small, some additional words in the transcript will be added to the line mapping according to the width of the line.

5. Methodology for Evaluating Performance of MiWRM Evaluation of the performance of the MiWRM algorithm is crucial. As mentioned before, visually inspecting and filtering the results requires a great deal of effort and time. When a large number of documents exists, this work becomes almost impossible. We propose a verification-based methodology to automatically estimate the mapping precision of MiWRM. When verifying a word image I with a lexicon containing a single word W, we differentiate two scenarios, matched verification (  ) and mismatched verification (  ). Matched verification means that the lexicon word W is the content of I and mismatched verification means that I represents a different word from W. The distribution of verification scores for matched or mismatched verification can be obtained by performing word verification on a set of manually extracted word images, each associated with a true or false (for matched and mismatched verification, respectively) lexicon word.

0.5 p(x|w1) p(x|w2) q(x)

0.45

a mismatched scenario and )\E  K be the distribution from a real experiments which include matched and mismatched verification. Figure 3 shows the relationship among the three distributions. According to the Bayes theorem, the recognition accuracy of )\E  K can be computed as:





E )\E  K@K 

 )\E  K R^E    K   E  K    R^E    K E  K



E   K )\E  K 

(11) In handwriting identification and verification, only the correctly extracted word images are used for feature extraction or decomposed into character images. There is a tradeoff between acceptance rate and reliability, and usually we need look at both parameters in order to make a decision. Reliability E V K associated with R^E  K is a function of the score threshold, given by:



   E   K )\E  K  *   C  C            QE V K     )\E  K     )\E  K (12) The acceptance rate, Q  EGV K , is simply computed by  EGV K    )\E  K Q (13)

In this study, a set of 13,551 ground-truthed images of four words, “been”, “Cohen”, “Medical”, and “referred”, is used to generate the matched verification score distribution (R^E   K ) and the mismatched one (R^E   K ). Considering only the four words, there are three mismatches for a word image. For matched verification each image and its true content (a lexicon of size 1) are given to the word recognizer WMR. For mismatched verification each image is recognized three times, each time with a different mismatched lexicon. In the subsequent sections, E )\E  KNK is used to evaluate the mapping performance of the MiWRM algorithm on every document image of the set.







0.4 0.35 0.3 0.25 0.2

6. Experiments and Analysis

0.15 0.1 0.05 0 0

2

4

6

8

10

12

x

Figure 3. Illustration of the evaluation methodology.



Let R^E   K be the distribution of verification scores from a matched scenario, and R^E   K be the distribution for



We first describe the experimental settings, then present the mapping results. We also present the results of applying mapping for the different applications considered (writer verification and identification, handwriting retrieval). We end up with an analysis of these results.

6.1. Experimental Settings Experiments to determine the transcript-based mapping performance were conducted on 2,997 documents written by 999 individuals in the US. Each writer has three copies of the CEDAR letter containing 156 words [13]. Figure 5

0.05 p(x|w1) p(x|w2) 0.045 0.04 0.035

Probability

0.03 0.025 0.02 0.015 0.01 0.005 0 2

4

6

8

10

12

14

WMR Recognition Score

Figure 4. Distribution of WMR verification scores: R^E   K for matched verification and RE   K for mismatched verification.





shows a sample image and the content of the CEDAR letter. Word images in all 2,997 images are extracted using the mapping algorithm, then the character images in each word image are segmented. The extracted characters are then used for the tasks of writer identification and verification.

6.2. Word Mapping Results In total, 373,825 word images,  of all 467,532, were extracted and recognized with the mapping algorithm. Figures 6 and 7 show the transcribing result of the sample in Figure 5. Figure 7 presents the transcribed content of Figure 6, specifically, the n-th text-word at the m-th line in Figure 7, T(m,n), corresponds to the n-th segmented word-image at the m-th line in Figure, I(m,n). The tag ’?...?’ in Figure 7 indicates that the algorithm doesn’t find the matched word-image in Figure 6. As for this example, only eight text words miss their counterparts, indicating a mapping accuracy of  . The evaluation method developed in Section 5 is used to estimate the mapping performance. Figure 8 shows the distribution ( )\E  K ) of the recognition scores of all 373,825 word images. By applying (11), the mapping accuracy is es $ . Figure 9 displays the acceptance-rate vs timated at  reliability curve. From 9, we notice that:  When t=5.0, $  of all extracted word images are accepted with about $   reliability.  When t=5.8,  of all extracted word images are accepted with L  reliability.

 When t=6.4,   of all extracted word images are accepted with about \  reliability.

From Jim Elder 829 Loop Street, Apt 300 Allentown, New York 14707

Nov 10, 1999

To Dr. Bob Grant 602 Queensberry Parkway Omar, West Virginia 25638 We were referred to you by Xena Cohen at the University Medical Center. This is regarding my friend, Kate Zack. It all started around six months ago while attending the ‘‘Rubeq’’ Jazz Concert. Organizing such an event is no picnic, and as President of the Alumni Association, a co-sponsor of the event, Kate was overworked. But she enjoyed her job, and did what was required of her with great zeal and enthusiasm. However, the extra hours affected her health; halfway through the show she passed out. We rushed her to the hospital, and several questions, x-rays and blood tests later, were told it was just exhaustion. Kate’s been in very bad health since. Could you kindly take a look at the results and give us your opinion? Thank you! Jim

Figure 5. A handwriting sample with a transcript.

 When t=7.3,  of all extracted word images are accepted with about  reliability. The high reliability of the word images with low recognition scores allows their usage for the writer identification and verification tasks, without introducing significant errors. For example, if we choose t=5.8, then an average of 46 word images per document can be accepted with very high reliability. Apart from the overall mapping performance, the mapping effect for individual documents is also examined. For each of the 2997 images, a distribution of recognition scores from mapped word images is computed. Then we apply (11) to estimate the mapping accuracy. Figure 10(a) shows the distribution of the estimated mapping accuracies for all individual document images, and Figure 10(b) gives the trend of accumulated probability of accuracy. It is observed that

Figure 7. Recognized contents of the word images in Figure 6. Figure 6. Located and recognized word images in the handwriting sample shown in Figure 5. 0.05

6.3. Decomposition of Word Images

0.04 0.035 0.03 Probability

200 document images are processed with over   accuracy and 1000 images with   accuracy. In the following subsections, we will detail some more promising results.

q(x) p(x|w1) p(x|w2)

0.045

0.025 0.02 0.015

From 373,825 word images, 1,331,253 character images were extracted by using the segmentation-based word recognizer WMR. We manually verified all 56,179 numeral images ( \

  of the total), and   of them were correctly recognized. Three distributions of confidence values are computed from the 56,179 images, shown in Figure 11. Most character images are recognized with high confidence value. Figure 12 displays the acceptance rate vs reliability curve. Here we see an ideal case of high acceptance rate and    relihigh reliability, e.g.,  acceptance rate with   ability,  acceptance rate with    reliability, and    acceptance rate with   reliability. This implies that the actual word mapping performance should be better than the

0.01 0.005 0 2

4

6

8 10 WMR Recognition Score

12

14

Figure 8. The distribution ()\E  K ) of the recognition scores of all 373,825 word images. Two score distributions for matched and mismatched verification are also drawn for comparison.

16

1

0.025

0.95 0.02

0.9

Probability P(x)

Reliability

0.85

0.8

0.75

0.015

0.01

0.7

0.65

0.005

0.6

0.55 0.1

0 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.2

Acceptance Rate

0.4

0.6

0.8

1

0.8

1

Mapping Accuracy (x)

(a) Figure 9. Acceptance rate v.s. reliability for the extracted 373,825 word images.

1

estimated one in the previous subsection. Inaccurate estimation of the word mapping results from the small word image set for computing RE   K and RE   K , and this set will be expanded in the future.





6.4. Analysis By (partially) solving the handwritten document recognition using transcripts, the proposed MiWRM algorithm can automatically retrieve a large amount of information from handwritten document images. The initial version of the algorithm can recognize over 200 document images with high accuracy. The extensive experiments on 2,997 handwritten document show that more than  of the recognized word images are of very high reliability. A large number of samples from over 1.33 million character images, automatically extracted by MiWRM, have been found to have high recognition accuracy. The extensive experiments have revealed some problems with the system to be solved in the future:  Line segmentation: sometimes words from different lines are grouped together.  Word segmentation: the word recognizer WMR is sensitive to punctuation following a word image. The removal of punctuation will likely improve the mapping performance.

 Mapping Algorithm: The MiWRM mapping algorithm can also be improved using direct word segmentation. When a document image is very noisy or a sequence of words is badly written, the dynamic lexicon generation for line images sometimes loses track of words

Accumulated Probability P(x>t)

0.8

0.6

0.4

0.2

0 0

0.2

0.4 0.6 Accuracy Threshold (t)

(b) Figure 10. Performance of word mapping for individual document: (a) the distribution of mapping accuracy, (b) the accumulated probability of accuracy.

in a transcript. Performing word recognition after direct word segmentation will help locate a number of global anchors so that MiWRM will become aware of the local failures. MiWRM will be more efficient if the multiple word-segmentation hypotheses generation is needed only in-between two global anchors.

7. Conclusion We developed a recognition-based system for constructing very large handwriting databases. The system applies a transcript mapping algorithm to automatically match word and character images in handwritten documents with their counterparts in ground truth. The transcript-mapping pro-

cess is modeled as an optimization problem involving multiple word-segmentation hypotheses, word recognition and word alignment. The extensive experiments show that a large number of character and word images can be automatically extracted with high reliability.

0.6

Probability

All Characters Extracted Characters Correctly Extracted Characters Wrongly Extracted 0.5

References

0.4

[1] Rejean Plamondon and Guy Lorrette, “Automatic signature verification and writer identification - the state of the art,” Pattern Recognition, vol. 22, no. 2, pp. 107–131, 1989. [2] Sargur N. Srihari, Bin Zhang, Catalin Tomai, Sangjik Lee, Zhixin Shi, and Yong-Chul Shin, “A system for handwriting matching and recognition,” in 2003 Symposium on Document Image Understanding Technology (SDIUT03), Greenbelt, Maryland, April 9-11 2003, pp. 67–75. [3] C. Y. Suen, M. Berthod, and S. Mori, “Automatic recognition of handprinted characters: the state of the art,” Proc. of IEEE, vol. 68, no. 4, pp. 63–84, April 1980. [4] Rejean Plamondon and Sargur N. Srihari, “On-line and off-line handwriting recognition: A comprehensive survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 63–84, 2000. [5] G. Kim, V. Govindaraju, and S. N.Srihari, “An architecture for handwritten text recognition systems,” International Journal on Document Analysis and Recognition, vol. 2, no. 1, 1999. [6] S. N. Srihari and G. Kim, “Penman: A system for reading unconstrained handwritten page images,” in Symposium on Document Image Understanding Technology (SDIUT), Annapolis, MD, April 1997, pp. 142–153. [7] V. Klement, R. D. Naske, and K. Steinke, “The application of image processing and pattern recognition techniques to the forensic analysis of handwriting,” in International Conference on Security through Science and Engineering, 1980, pp. 75–79. [8] V. Klement, “An application system for the computerassisted identification of handwritings,” in International Carnahan Conference on Security Technology, 1983, pp. 75– 79. [9] Wim C. de Jong, Leny N. Kroon van der Kooij, and Dick P. Schmidt, “Computer-aided analysis of handwriting, the nifotno approach,” in The 4th European handwriting Conference for Police and Government Handwriting Experts, 1994. [10] R. Manmatha, C. Han, and E. M. Riseman, “Word spotting: A new approach to indexing handwriting,” in IEEE Computer Vision and Pattern Recognition Conference, San Francisco, CA, June 1996, pp. 631–637. [11] A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G. Viorel Popescu, “A line-oriented approach to word spotting in handwritten documents,” Pattern Analysis & Applications, vol. 2, no. 3, pp. 153–168, 2000. [12] Catalin I. Tomai, Bin Zhang, and Venu Govindaraju, “Transcript mapping for historic handwritten document images,” in Proceedings the Eighth International Workshop on frontiers in Handwriting Recognition (IWFHR-8), Niagara-on-

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4 0.5 Recognition Confidence

0.6

0.7

0.8

0.9

Figure 11. Three distributions of confidence values of the manually truthed 56,179 character images.

1

0.99

Reliability

0.98

0.97

0.96

0.95

0.94

0.93 0.2

0.3

0.4

0.5

0.6 Acceptance Rate

0.7

0.8

0.9

Figure 12. Acceptance rate v.s. reliability from the manually truthed 56,179 character images.

1

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

the-Lake, Ontario, Canada, August 6-8 2002, pp. pp. 413– 418. Sargur N. Srihari, Sung-Hyuk Cha, Hina Arora, and Sangjik Lee, “Individuality of handwriting,” Journal of Forensic Sciences, vol. 47, no. 4, pp. 1–17, July 2002. Sargur N. Srihari, Catalin I Tomai, Bin Zhang, and Sangjik Lee, “Individuality of numerals,” in Proceedings of Seventh International Conference on Document Analysis and Recognition, Edinburgh, Scotland, August 3-6 2003. Bin Zhang, Sargur N. Srihari, and Sangjik Lee, “Individuality of handwritten characters,” in Proceedings of Seventh International Conference on Document Analysis and Recognition, Edinburgh, Scotland, August 3-6 2003. Bin Zhang and Sargur N. Srihari, “Analysis of handwritten individuality using handwritten words,” in Proceedings of Seventh International Conference on Document Analysis and Recognition, Edinburgh, Scotland, August 3-6 2003. F. Mihelic, N. Pavesic, and L. Gyergyek, “Recognition of writer of handwritten texts,” in Proc. 1977 Int. Conf. on Crime Countermeasures - Sci. Engin., University of Kentuky, Lexington, 1977, pp. 237–240. W. Kuckuck, “Writer recognition by spectrum analysis,” in Proc. 1980 Int. Conf. Security through Sci. Engin., West Berlin, 1980, pp. 1–3. R. D. Naske, “Writer recognition by prototype related deformation of handprinted characters,” in Proc. 6th Int. Conf. on Pattern Recognition, Munich, 1982, pp. 819–822. I. Dinstein and Y. Shapira, “Ancient hebraic handwriting identification with run-length histograms,” IEEE Trans. Syst. Man Cyber., vol. SMC-12, pp. 405–409, 1982. Isao Yoshimura and Mitsu Yoshimura, “Off-line writer verification using ordinary characters as the object,” Pattern Recognition, vol. 24, no. 9, pp. 909–915, 1991. Long Zuo, Yunhong Wang, and Tieniu Tan, “Personal identification based on pca,” in http://nlprweb.ia.ac.cn/english/irds/papers/zuolong/PR025.pdf. H. E. S. Said, G. S. Peake, T. N. Tan, and K. D. Baker, “Personal identification based on handwriting,” Pattern Recognition, vol. 33, pp. 149–160, 2000. John D. Hobby, “Matching document images with ground truth,” International Journal on Document Analysis and Recognition (IJDAR), vol. 1, no. 1, pp. 52–61, 1998. N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Trans. Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979. G. Kim and V. Govindaraju, “Efficient chaincode based image manipulation for handwritten word recognition,” in Proc. of the IS&T/SPIE’s Symposium on Electronic Imaging: Science & Technology, San Jose, CA, Jan. 1996, pp. 262–272. G. Seni and E. Cohen, “External word segmentation of offline handwritten text lines,” PR, vol. 27, no. 1, pp. 41–52, January 1994. G. Seni and E. Cohen, “Segmenting handwritten text lines into words using distance algorithms,” pp. 61–72, 1992.

[29] U. Mahadevan and R. Nagabushnam, “Gap metrics for word separation in handwritten lines,” in ICDAR, 1995, pp. 124– 127. [30] R. Manmatha, Chengfeng Han, and E.M. Riseman, “Indexing handwriting using word matching,” in Digital Libraries ’96: 1st ACM International Conference on Digital Libraries, 1996. [31] P. Keaton, H. Greenspan, and R. Goodman, “Keyword spotting for cursive document retrieval,” in Proceedings of the Workshop on Document Image Analysis - DIA’97, 1997.