Indexing and Retrieving Cursive Documents without Recognition Antonio Clavelli1, Luigi P. Cordella2,Claudio De Stefano3 and Angelo Marcelli1 1 DIIIE, University of Salerno, Fisciano, ITALY 2 DIS, University of Napoli “Federico II”, Napoli, ITALY 3 DAEIMI, University of Cassino, Cassino ITALY
[email protected];
[email protected];
[email protected];
[email protected]; Abstract A large amount of handwritten documents exist in image form, as scanned documents. The supporting electronic media allows for better preservation, but to access their content they must be processed by some kind of recognition technologies that convert the image to searchable text. In case of cursively written documents, even the best available technology introduces recognition errors that may drive down the performance of a document retrieval system. We propose a recognition-free approach which embodies two main components: a shape matching algorithm, working on the ink, and a string matching algorithm working on the ink interpretation of a reference set. Experiments on a data set of 16,500 cursive words produced by hundreds of writers show promising results and suggest that the proposed method can be a viable tool to build inexpensive retrieval system for cursive documents.
1. Introduction The advancement in scanning and storing technologies of the past 20 years has favored the conversion of paper documents into electronic image files for better preservation and easy access. Preservation is ensured by the nature of the media, which is less sensitive to environmental factors, such as lighting condition, temperature, humidity and dustiness, that greatly degrade the quality of the documents along the years. On the other hand, easy access is ensured when the content of document can be dealt with information retrieval (IR) technology, so as the system can process user’s queries asking for documents with specific features or content. For that to happen, document images must be processed and
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
their content extracted and stored in an electronic format that can be manipulated by IR systems. An excellent survey on the whole field of document image analysis has been published recently [1]. In this paper we are interested in documents mainly containing cursive text, such as city and parish records of events, court decisions, medical files and prescription, just to mention a few. Documents like the ones we are interested in have been scanned and stored in image format, but their content is not available to IR systems and won’t be available in a short time either. The main reason is the large variability exhibited by those documents, usually produced by different writers, which requires very powerful recognition engines, and human post-processing to correct recognition errors at character-level, since those latter greatly affect the IR performance. Both those factors rise the system cost, so that only large organizations may afford them. As a matter of fact, cursive recognition systems have been successfully adopted by those organizations, such as banks, insurance companies, postal services and nation-wide institutions, but in very specific domains, where contextual information from other sources, in the form of dictionaries, word statistics etc., limit the human intervention and thus reduce the operational costs to achieve a recognition rate suitable for the specific application. Nonetheless, the acquisition cost for such systems is too high and discourage small or medium organizations to adopt them. On the other hand, many possible applications may trade errors in document retrieval with cost, on condition that errors can be easily, effectively and as cheaply as possible corrected by human intervention. In this framework, we present an approach to build such an inexpensive, yet performing system, for retrieving document containing cursive words. It is assumed that the query word is typed in by the user, and the system produces as result the bitmap of the
cursive specimen of the query word found in the archive, in ranked order. Thus, by visual inspection, the user can easily remove from the list the bitmaps associated to cursive specimen erroneously associated to the query word. Each bitmap is linked to the document where it appears, so that deleting the bitmap removes the document, unless this latter is associated to other specimen still in the ranked list. The match between the query and the word images does not rely on any kind of recognition, but rather on matching the character string of the query with those associated to each word image during an off-line indexing phase. Thus, such an approach avoids both expensive recognition technologies and costly human correction of character-level recognition errors. The proposed system has two modes of operation. During the off-line installation, the set of scanned document is processed in order to extract the bitmap of each word and to build an index for each of them. During the operational phase, the query typed in by the user is manipulated and matched against the index of each word to retrieve cursive instances of the query. In the following we will illustrate the algorithms for indexing the word images and their retrieval in response to a user’s query, the experiments for assessing the performance of the proposed system, draw some conclusions and outline future work.
2. Indexing and retrieval In the following, we assume that a bitmap of each cursive word has been extracted from the documents, and that the skeleton of each bitmap has been extracted and unfolded, so as the word had been drawn on-line. We also assume that the ink has been segmented into strokes, each one corresponding to an elementary writing movement. There are many methods proposed in the literature for recovering dynamic information from off-line data, as well as to perform stroke segmentation of on-line handwriting. In the following we will assume that both unfolding and stroke segmentation has been accomplished by using algorithms previously developed by us [2-4]. Fig. 1 a) and b) show the bitmap of a word and the segmentation into strokes of its corresponding ink.
2.1 Word indexing The indexing of the word relies upon an ink matching followed by a labeling. The ink matcher finds sequence
Figure 1: The indexing of a cursive word. a) the bitmap; b) the stroke segmentation; c-d) the ink matching with two reference words; e) the index associated to the word.
of strokes with similar shapes between the inks of a pair of words. The algorithm adopts a multi-scale representation of the ink, where the scale is represented by the number of strokes of the sequences we want to match, computes a shape similarity measure between the sequences at each scale, and eventually adopts a saliency-map to combine the similarity measure across the level and decide which are the matching sequences of strokes between the two inks [5]. As with regards to the labeling algorithm, it has been developed under the assumption that is available a set of reference words, i.e. words whose strokes are labeled with a symbol representing the character to which they belong to in that word. This is achieved by estimating the distribution of the number of strokes for each character class p(ni|ci), i=1,..C (where ni and ci are, respectively, the number of strokes and the character class labels) from a Training Set (TS) for which the stroke labels have been manually entered. Once the distributions have been estimated, given the string of characters c1c2..cM of the reference, the actual number of strokes ni to be assigned at each character ci can be computed by maximizing: (1) Σi=1..M p(ni|ci) with the constrain that (2) n1+n2+..+nM = NR where NR is the total number of strokes extracted from the ink of the reference. The distributions p(ni|ci) have been estimated by using a variant of the algorithm described in [6], while details on the algorithm developed for solving the maximization problem can be found in [7]. At the end, for each reference an index is built, made of as many symbols as the number of strokes found in the ink, each one representing the character to which the strokes belongs to, as shown in fig. 1.c)-d). Thus, the indexing is achieved as it follows. The data, i.e. the word whose index is being built, is matched with each reference, and once the matching sequences have been found, the label of each stroke of the reference is copied in the label of the matching stroke of the data. Since each data is matched with all the references, it may happen that one of its stroke receive different labels. This is not unusual, in that, due to shape variability in handwriting, pieces of inks similar to the one of the data may be found in different references with different labels. In this case, the final label for the stroke of the data is the most frequent one among those assigned to it. Eventually, each data is associated with a string of as many symbols as its strokes, where each symbol is either a character or the symbols”-“, meaning that the corresponding stroke didn’t receive any label. Fig. 1.e) shows an example of such an index.
2.2 Word retrieval The retrieval step is based on an elastic string matching between the string representing the query word typed in by the user and the index of each data word. For this purpose, the query is stretched by repeating each character of the query a number of times corresponding to the mean value of the distributions p(ni|ci), estimated as described previously. The matching is elastic in the sense that the length LQ of such a query string, is compared with the length of the current data word LD. If the difference |LQ-LD| is larger than the threshold TL it is assumed that the data does not match the query and skipped. On the contrary, the query is shortened/elongated so as to have LQ = LD while maximizing the same quantity as in eq. (1). Then, the score S of the matching is computed as the ratio between the number of times the two strings have the same label in the same position and LQ, normalized in the interval [0,100]. No penalty is applied to mismatches. If S is bigger than the threshold TS the data word is assumed to match the query, its bitmap retrieved from the data set and shown to the user as result of the query. Otherwise it is skipped and the whole procedure iterated on the next word of DS. At the end, the retrieved words are shown on the screen ranked accordingly to the score in descending order for the user to perform the final selection. Since we expect the false positive words to receive lower scores with respect to the true positive, such an arrangement should contribute to reduce the time spent by the users for the final selection. Fig. 2 shows an example of the output of the system.
Figure 2: An example of the final output provided by the system when the query word is “problemi” (Italian for problems).
3. Experimental results The data sets TS, RS and DS were extracted from a collection of 1,000 forms, each produced by a different writer, containing approximately 160,000 words. It is worth mentioning that the forms used in the experiments reported below were not collected by us,
but made available by a sponsoring organization. In this study, TS contains 10 forms with approximately 1,000 words, RS is made of 90 forms and contains about 13,000 words, while DS is made of 15 forms containing about 2,500 words. For performance evaluation, the ground truth for the words of DS has been manually provided by three subjects and cross-checked for error correction. Each word of both RS and DS has been processed as described in Section 2.1 and shown in fig. 1. The performance of the system on DS is reported in fig. 3. By correct (C) we mean the retrieved words whose transcripts match exactly the query, while almost correct (AC) denotes retrieved words whose transcript mismatch the query by two characters at most, and error (E) the remaining one. AC been introduced to deal with prefix and suffix, which in Italian are typically less than two characters long. The results reported in fig. 3 have been obtained by averaging C, AC and E while responding to 100 different queries. As with regards to the thresholds, TL is computed as min(6,L/3), where L represents the number of strokes of the reference word to be matched with the query, and TS ranges from 30 to 50. It is worth noticing that correct words are always ranked on top of the list provided by the system, and that there is a large difference between the scores of correct and almost correct words and those of the error ones eventually present in the ranked list. In another experiment, we have presented to the system 100 queries, none of which had a cursive specimen in DS. For those queries, the system retrieved, on average, less than 20 words, but the difference between the score of the word on top of the list and the one at the bottom is much smaller than in the previous experiment. The last issue we have dealt with during the experiment is the computational cost of the method. It took about 40 secs to index a word, while answering a query takes on average 300 msec. Eventually, the user spent about 15 secs to get the desired output.
4. Conclusion and future work We have proposed a system for retrieval of handwritten cursive document especially tailored for small and medium size organizations whose applications are such as it is possible to trade errors in document retrieval with cost, on condition that errors can be easily, effectively and as cheaply as possible corrected by human intervention. The solution we have proposed does not use any recognition technology, but is based on matching the character string of the query with those associated to each word image during an off-line indexing phase.
The experimental results reported in the previous section confirm that pursuing the “human in the loop” approach in document processing is a viable solution for building cursive document retrieval system that avoids both expensive recognition technologies and costly human correction of character-level recognition errors while providing satisfactory performance. Future work will concentrate on improving the indexing, namely the stroke labeling, by adopting a more accurate probabilistic model to label the stroke in case of multiple matching, and to exploit the scores of the ranked list for better performance.
Figure 3: Performance of the system. C, AC and E are expressed as percentage of retrieved words.
References [1] G.Nagy. Twenty years of Document Image Analysis in PAMI. IEEE Trans. On PAMI, 22(1), 2000, pp.38–62. [2] G. Boccignone, A. Chianese, L.P. Cordella and A. Marcelli, Recovering Dynamic Information from Static Handwriting, Pattern Recognition, 26(3), 1993, 409-418. [3] A. Clavelli, C. De Stefano, A. Marcelli, , Handwriting Generation for Writing Order Recovery, Proc. IGS’07, Melbourne (Australia), November 11-14, 2007, 32-35. [4] C. De Stefano, G. Guadagno, A. Marcelli, A saliencybased segmentation method for on-line cursive handwriting, IJPRAI, 18(6), 2004, 1139-1156. [5] C. De Stefano, M. Garruto, L. Lapresa and A. Marcelli, Detecting Handwriting Primitives in Cursive Words by Stroke Sequence Matching, in: Advances in Graphonomics, A. Marcelli and C. De Stefano (eds.), June 2005, 281-285. [6] Y. Xu and G. Nagy, "Prototype Extraction and Adaptive OCR," IEEE Trans. On PAMI , 21(12):1280-1296, December 1999. [7] R. Senatore, Where are the characters? A Bayesian approach to stroke labeling of cursive words”, M.Eng. Thesis, University of Salerno, 2006 (in Italian).