FormPad: A Camera-Assisted Digital Notepad

1 downloads 0 Views 298KB Size Report
IBM Almaden Research Center, 650 Harry Road, San Jose, CA, USA 95120 .... where A,B,C,D are the basis frame and E is a new point on the planar object. P(A ...
FormPad: A Camera-Assisted Digital Notepad Tanveer Syeda-Mahmood, Thomas Zimmerman IBM Almaden Research Center, 650 Harry Road, San Jose, CA, USA 95120 {stf, tzim}@almaden.ibm.com

Abstract. A camera-assisted digital writing tablet was invented recently. It preserves the familiar experience of filling out a paper form while allowing automatic conversion of relevant handwritten field entries into electronic form, without explicit form scanning. In this paper, we focus on two key computer vision problems associated with the invention of this device, namely, form indexing and field projection. These are needed for accurate association of tablet writing with corresponding entries in the electronic form. Form indexing is modeled as the problem of shape-based content retrieval using the perspectively-distorted form appearances seen from the tablet camera. Fast form indexing is achieved using geometric hashing based on projective invariants. The invariants derived from curve and line features reduce the basis search space considerably while still providing for robust localization. We derive field projection as a sequence of projective transformations between the tablet, the camera and the original electronic form coordinates. Results of extensive testing on a medical form database are reported.

1 Introduction Paper-based forms are ubiquitous in hospital environments. With high volume of forms being scanned, and the difficulty of handwriting recognition from filled form entries, most electronic record systems simply store the form images with the field label information entered manually. A camera assisted writing tablet called the FormPad was invented recently to enable direct electronic conversion of form entries. Unlike other digital notepadlike devices such as the CrossPad, FormPad has the ability to recognize the form and accurately project the filled entries against their correct field label. Such digital notepads are a low-cost alternative to tablet PCs for routine form filling. They also preserve the familiar experience of filling in a paper form without disturbing the existing workflow of end users. The FormPad device is a conventional clipboard with a pen digitizing tablet [8] underneath and a VGA digital camera [9] with fish-eye lens (64 x 86 degrees) attached to the metal clip of the clipboard. A wireless inking pen allows the user to enter notes directly on the form, while the digitizer captures pen coordinates and pen tip pressure. Thus form filling actions are recorded as online handwriting signals by the tablet. In order to use the data from the tablet, however, the identity of the form being filled must be known. Further, the handwritten data must be correctly registered against the relevant entries in the electronic form. Accurate form identification and field projection using cameras is a challenging problem. Ease of use considerations require that the camera be placed in un-obstructing locations on the notepad leading to significant perspective distortion in the captured images. Also, since the camera is very close to the imaged object (i.e. the form), weak perspective

projection models do not hold, requiring the use of full projective transforms. Existing method for recognition under perspective projection, are either compute or space intensive requiring at least 5 point correspondence raising the complexity to O(N5). Even after the correct form image has been identified, pose registration errors, if present, can lead to the tablet data being recorded against the wrong field label in the electronic form. Thus careful analysis of the geometric relationships between the tablet, the camera and the electronic form coordinates must be performed.

(a)

(c)

(b)

(d)

(e) (f) Figure 1: Illustration of field projection of tablet writing. (a) Original form. (b) A filled form. (c) The reference model for the form of (a) as seen through camera. (d) The camera appearance of the form of (b) before filling. Note the skew in this appearance. (e) Tablet writing corresponding to the filled entries (f) Tablet writing projected into the electronic form of (a) using our method of field projection.

2. RELATED WORK The technology we exploit in FormPad is based on prior work on object indexing and form recognition, for which a large body of literature already exists. In particular, recognition of scanned forms has been addressed by a number of researchers [6, 7, 10, 11, 12-16], and the technology has matured into many products including AccuForm, CharacTell, iRead, ReadSoft, etc. Several low-level form processing and feature extraction methods [10, 11] exist, including those that analyze layout [7, 17], fields [14], and handfilled entries [11, 12]. Registration methods based on projective geometry have been used for scanned form alignment and recognition [15]. While almost all form recognition work assumes scanned forms, the only significant work on camera-grabbed forms we found was the document imaging camera system ScanWorks by Xerox [16]. The focus in this system has been on image processing of the document to filter, de-skew and produce better document appearance rather than form identification and automatic field extraction. The predominant techniques for identifying the form type use bar codes or OMR technology. The recognition of printed text on forms is done fairly well using commercial OCR engines and most OCR software also offer their engine bundled in form recognition software. The recognition of handwritten text, however, is still a difficult problem for scanned forms. The work on form indexing we report is based on the technique of geometric hashing previously introduced for the model indexing problem in computer vision [2]. Several variants of this technique have appeared in literature including line hashing [3] where the basis space was formed from lines, location hashing [4] and region hashing, hashing based on projective invariants [1], etc. While geometric hashing using affine-invariant features has found some practical applications, much of the work on geometric hashing using projective invariants has remained mostly academic in nature. Building practical embedded systems using such techniques has been a challenge due to the large number of combinations of basis features that need to be retained per model, and their sensitivity to noise and occlusions. Thus building practical form recognition systems using geometric hashing requires intelligent choice of basis features that reduce the time and space complexity while still giving effective recognition.

3. Form Recognition We now turn to the problem of form identification, which can be stated as follows. Given a sample form C’ seen by the FormPad camera, determine the original form O corresponding to C’ using the appearance form images in the database C. In practice, since the number of forms in the database is large, and live form processing is desirable (as manual on-the-spot correction of form entries by the FormPad user may be required), it should be possible to identify the original form without exhaustively searching the form database. To recognize the original form O corresponding to the given sample form on the tablet, it is sufficient to determine if the associated reference form C in the model database and C’ are two views of O. Since forms are planar objects, and since the

distance between the camera and the form is smaller than the form dimensions, the relation between the two views C and C’ is a projective transform P. That is, given a point (x,y) in C’ its corresponding point (x’,y’) in C is related by

x' 

p x  p 22 y  p 23 p11 x  p12 y  p13 , y '  21 p31 x  p32 y  1 p31 x  p32 y  1

(1)

where the coefficients are elements of the projective transform P given by

 p11 P   p 21  p31

p 12 p 22 p32

p13  p 23  1 

(2)

It is well known that the above 8 parameter projective transform for planar objects can be recovered from a set of 4 corresponding points through a linear system of equations. Once the projective transform is recovered, it can be verified by projecting the rest of the sample form features into the model form, and noting the fraction of sample form features that fall near the model form features.

3.1 Form hashing Because of their text and graphical content, form images tend to have a large number of features, for example, 3000 corners and 2000 lines. If each model form in the database was exhaustively searched, this would take O(m4 n4 *N) time where m and n are the average number of features per model form and sample form respectively, and N is the number of forms in the database. The relevant forms can be identified without detailed search using the principle of geometric hashing for form indexing. The basic principle is well-known, and involves recognizing an object by verifying that enough number of object features have the same poseinvariant coordinates with respect to some chosen basis frame [2]. Detailed search is avoided by pre-computing pose-invariant feature information, and indexing the recorded data structure using pose-invariant features derived from the current form on the tablet. To provide robustness to occlusions and noise, many more basis frames may have to be used, leading to a large number of redundant features. Much of the space complexity of indexing by geometric hashing is accounted by these additional basis frames and the pose-invariant features so derived. Our form indexing is based on the area cross-ratio, a projective invariant given by

 ( A, B, C , D, E ) 

P ( A, B, C ) P( A, D, E ) P ( A, B, D) P( A, C , E )

,

(3)

where A,B,C,D are the basis frame and E is a new point on the planar object. P(A,B,C) is the area of the triangle with vertices A, B and C as shown in Figure 2a.

(a) (b) Figure 2 Illustration of 5 point cross-ratios. (a) Cross-ratios from arbitrary 5 points. (b) cross-ratios from carefully chosen basis frames derived from curves and lines. If we retain all possible basis frames to provide robustness to noise and occlusions, the space complexity of hashing is very large. In fact, for N=500 corner features on the object, the basis frames and projective invariants computed would be O(N5) or 1000 Terrabytes, a very large hash table indeed! Furthermore, by choosing features from all over the form image to form the basis frame, the chance of false positives increases with many spurious matches. Both these issues can be avoided if we generate the basis frames carefully to reduce the number of basis frames, and choose at the same time, features that can capture shape-specific information better. For this reason, we choose features from curves to form basis frames. Curves are well-known grouping units that capture shape information present in the model object. Further, due to the order present in curves, the number of basis features can be reduced. Using this rationale, we generate the basis frame as follows. 3.2 Basis frame generation We take corner features from curves to form candidates for point A in the basis frame. Feature points B and C are then taken to be the adjacent corners in the ordered sequence of features along the curve. A fourth point is derived by the intersection of a line anywhere in the image as shown in Figure 2b. Note that at least one intersection point D or D’ exists since the line cannot be parallel to both lines emanating from a corner. Since there are two possible directions along which a curve can be traversed, we consider both intersection points D and D’, if available to form two sets of basis frames. (A,B,C,D) and (A,B,C,D’). Any new feature point E can now be expressed in terms of the basis frame (A,B,C,D) through its projective invariant as defined in Equation 3. The labeling of features points uses the convention of A for a corner in the curve, B and C for the adjacent corners on either side, and D and D’ stand for the intersection of a line with AB and AC respectively. Thus the issue of permutation of feature points leaving ambiguity in matching, is reduced using this naming convention. Use of consecutive features along the curve makes the choice of 3 features of the basis frame linear in the number of features along the curve. However, it can be too narrow a basis frame when the features are close together leading to instabilities in pose computations. By choosing a fourth basis point from an arbitrary line in the image, the basis frame is widened to allow robust computation of projective invariants. Using this method of basis frame generations, the number of basis features is O(N2) with the total number of projective invariants using all features in the image to be O(N3). Using 500 features as the typical example as before, the size of the hash table per form will now be O(109) or 1Gbyte, a reduction by a factor of 106!

3.3 Form indexing Form hashing involves two stages, namely model creation, and form indexing. In the model creation stage, curves are extracted from reference form images using a technique described in an earlier work [4]. Points (corners) and lines are extracted from the curves through line segment approximation of curves. Basis frames are generated as described above, and projective invariant of Equation (3) is computed for all corner features in the form images. In generating the basis frame, we traverse the curves in both directions to account for reversal of ordering during query processing. The resulting information is recorded in a hash table as H 1 (  )  { Basisi , F j  ......} (4) where Basisi = are the basis frame coordinates and Fj is form index. For a given sample form in the tablet, features are extracted through a similar process and candidate basis frames are derived. The projective invariants computed for all feature point on the sample form are used to index the hash table and a histogram of basis indexes is taken. The form index corresponding to the peaks identifies the relevant form. Since hashing indicates likely matches, detailed verification step is still needed to confirm the presence of a reference form using the 4 point correspondences generated from the matching basis frames. The fraction of sample form features that project close to a model feature constitute the verification measure. Such features can also be taken as additional corresponding points for robust computation of the actual projective transform of Equation 1 for form registration later. Although the same cross-ratio can be derived from many combinations of basis frames, using the constrained basis generation process ensures that hashing points to related shapes in the form database.

4. Field projection We now turn to the field projection problem which is illustrated in Figure 3. Figure 3e shows the handwriting recorded on the tablet in terms of tablet coordinates. The actual filled form is shown in Figure 3b. The electronic form is shown in Figure 3a. In order to capture the handwritten entries against the appropriate field label in the electronic form of Figure 3a, we need to project the fields as recorded in tablet writing of Figure 3e into the electronic form of Figure 3a. Since the form on the tablet can be placed with skew, it is not possible to do the field projection of the tablet coordinates without using cameragenerated information. Our method of field projection exploits the geometric relationships between the tablet coordinates, the form coordinates in the camera and the original electronic form coordinates. Let O be the original electronic form. Let C be the image of the printed version of this form as seen through the camera on FormPad when placed within fixed alignment reference markers for model generation. Let T be the tablet frame. Let C’ be the image of a sample (possibly inserted with a skew) form that is currently being filled by a user. The problem of field identification is to convert the tablet coordinates corresponding to the image C’ and rendered into the original form coordinates O. As shown in Figure 3, the projection of a handwriting coordinates (xt’,yt’) into their corresponding field location (x0,y0) on the original form involves a sequence of transforms T’->C’->C->T->O.

Figure 3. Illustration of the sequence of transformations needed for field projection. The relationship between tablet coordinates and original form coordinates (T->O) can

P

be modeled by an affine transform TO . For all electronic forms of the same size, say, 81/2x 11in, and using a systematic generation of the original form image (by pdf to tif conversion software, for example), such a transform need only be computed once. To use this transform directly for any paper form placed on the tablet though, we print the electronic form, and place it on the clipboard within fixed reference markers. This ensures that all forms will be subject to the same reference model creation process, and allows the use of a single alignment transform from tablet to original form. In camera coordinates, the form skew (C’ -> C) can simply be estimated by the projective transform PC 'C computed during the form indexing process. Because of the close positioning of the camera on the tablet, the relationship between the camera and the tablet (T->C) is also modeled by a projective transform PTC from tablet-to-camera and another

P

projective transform CT from camera-to-tablet (C->T). Since the camera is fixed, these are computed only once per tablet during a factory calibration stage. The overall transformation can thus be modeled as a sequence of projections of which, only the transform

PTC   PC 'C   PCT   PTO (9) PC 'C needs to be computed dynamically per form filling.

Note that it is not required for the handwriting on the tablet to be visible to the camera since the camera-to-tablet transform is pre-defined and can be applied as long as the reference model creation process was consistent using the alignment markers. 5. Results Form indexing and field projection was tested on a large medical form database. These forms were collected from actual forms used in hospitals, as well as those available on the internet. The current collection has 180 electronic forms and 200 scanned forms and is growing rapidly. Models of the forms were created by placing the forms on the FormPad, as well as screen grabbing the electronic documents to make the original form images. Some original forms were obtained by scanning the printed forms available. To test form indexing using form hashing, we recorded 5 -13 different appearances for each of the forms assembled above to increase our model database size to 1800. The form images were processed for edge detection using Canny edge detector. Curves were extracted using

a procedure previously described in [4], and corners and lines were assembled. The basis generation process described earlier, was used to record the hash tables entries for form indexing. We tested the performance of form indexing by querying 40 sample forms on the 1800 form model database. The precision recall results are shown in Table 1. As can be seen, form indexing retains good identification accuracy while still containing the number of false positives. It can also be seen from this table that retaining only well-separated basis frames has not degraded the recognition performance. Our experiments have revealed an average precision of 75.21% and an average recall of 92.5%. Table 2 shows the time performance of form hashing in comparison to actual search for matching triples during object recognition. As can be seen, pre-computing the features improves the time performance by several orders of magnitude. The average number of features on the model was 424.62 for our model database. The column on all possible basis triples uses O(N5M5) for its calculation for model and image features for a straw-man comparison. The search is listed in terms of number of basis triples explored, since the number of pose-invariant features computed is the same in both approaches. By using well-separated basis frames derived from curves instead of all possible basis frames, the storage performance was improved remarkably. With the average model features of 424.62 the size of the hash table was roughly 424.62*424.62* 424.62 or 1G. As a result, the hash table was stored in main memory itself for all the 1800 forms in the model database.

5.1 Field projection results To test field projection, the sample forms were derived from the electronic forms whose models were already available. Twenty subjects were recruited for writing on the sample forms using the FormPad. The subjects were asked to follow their normal writing process as if on a clipboard. Thus considerable skew could be present in the sample forms due to inexact insertion into the clipboard. Using the alignment process described in Section 4, the skewed writing on the tablet was projected onto the electronic form to identify nearby field labels. We now illustrate the results of field projection. Figure 3e shows digitizer tablet output from writing on the sample form of Figure 3a. Using the sequence of projective transforms, the writing of Figure 3e was projected onto the original form of Figure 3a. The resulting image is shown in Figure 3f (please zoom in on the images for better viewing). As can be seen, 8 of the 9 text regions are projected close to their correct field labels. In addition, there is close resemblance between such automatically projected writing with their actual physical appearance as shown in Figure 3b. Since all pose computations were done using minimal features, there is some error in pose computations leading to alignment errors at the edges as seen for the phone number field. With higher number of features for alignment, such pose errors can be reduced. It is interesting to note, however, that there is a built-in tolerance to pose errors in the field extraction problem due to the finite space left on the form for the field entries. As long as the projected text is within the space provided for the content against the field label, the correct label can still be recovered. Such tolerance is generally more for the x than the y coordinate. Even so, a neighborhood search of the field labels may still have to be performed. To test the performance of field projection, we measured the pixel difference between the projected tablet writing and the corresponding field label. The text projected within

(+/-5 pixels) was taken as a correct projection. The handwriting data was collected by filling out a total of 80 sample forms showing varying amounts of skew in the image. Each form had 3-10 regions filled including those near the bottom of the form page invisible to the camera. The results of field identification for the writing tested are shown in Table 3. As can be seen, a large fraction of the tablet text projects within =/- 5 pixels to the original field label. The field identification performance indicated is sufficient for further postprocessing using attribute label extraction and online OCR to successfully populate an electronic medical record.

6.

Conclusions

In this paper we have described methods for form indexing and field projection to enable rapid paper form to electronic conversion without explicit need of scanning filled forms or manual population of the electronic medical records.

Quer y Form 1 2 3 4 5 6

Actual Occurrences 10 4 8 13 9 11

Matches Retrieved

False Matches

13 7 10 13 14 15

5 3 3 3 6 4

Correct Matches 8 4 7 10 8 11

Table 1: Illustration of Precision-Recall for Form Indexing Query

Query Features

Retained Features

Retained Triples

1 2 3 4

2032 1453 2240 970

445 230 760 340

445 230 760 340

Search All Possible Triples (x1014) 67.3 9.25 335 30.1

Table 2. Time reduction due to indexing results.

Search Using Form Hashing 445 230 760 340

REFERENCES [1] D. Jacobs, “ The space requirements of indexing under perspective projections,” IEEE Trans. PAMI, 1996, pp.330-333. [2] Y. Lamdan, J. Schwartz, and H.J. Wolfson. Object recognition by affine-invariant matching in Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, pages 335344, 1988. [3] F.D. Tsai. Geometric hashing with line features. Pattern Recognition, 27:377-389, 1994. [4] Tanveer Fathima Syeda-Mahmood: Locating Indexing Structures in Engineering Drawing Databases Using Location Hashing. CVPR 1999: 1049-1055. [5] W.E.L. Grimson, “On the sensitivity of geometric hashing,” ICCV 1997. [6] J.Mao, M. Abayan, K. Mohiuddin, “A model-based form processing subsystem”, ICPR ’96, pp.691-695, 1996. [7] T. Watanabe, Q. Luo, N. Sugie, “Layout recognition of multi-kinds of table-form document, IEEE Trans PAMI, vo.17, no.4, pp.432-445, 1995. [8] Wacom Graphire II Digitier Tablet and Inking Pen, http://www.wacom.com/graphire/4x5.cfm [9] Aiptek VGA PenCam, Irvine, CA 92618 www.aiptek.com [10] A. Pizano, “Extracting Line Features from Images of Business Forms and Tables”, ICPR'92, pp. 399-403, 1992 [11] D.S. Doermann, A. Rosenfeld, “The Processing of Form Documents”, ICDAR'93, pp. 497-501, 1993. [12] A.K. Chhabra, “Anatomy of a Hand-Filled Form Reader”, Proc. IEEE Trans. On Application of Computer Vision, pp. 195-204, 1994 [13] F. Cesarini, M. Gori, S. Marinai, “A System for Data Extraction from Forms of Known Class”, ICDAR'95, Montreal, Canada, pp. 1136-1140, 1995 [14] J.X. Yuan, Y.Y. Tang, C. Y. Suen, “Four Directional Adjacency Graphs (FDAG) and Their Application in Locating Fields in Forms”, ICDAR'95, Montreal, Canada, pp. 752-755, 1995. [15] R. Safari, N. Narasimhamurthi, M. Shridhar, “Document Registration Using Projective Geometry”, ICDAR'95, Montreal, Canada, pp. 1161-1164, 1995. “Xerox mobile camera [16] document imaging,” http://www.ipvalue.com/technology/docs/Xerox_Mobile_Camera_Imaging_Document_Captur e.pdf 2004. [17] Watanabe, T., recognition methods Structure MVA(6), No. 2-3, 1993, pp. 163-176. # of writing segments 1 8 2 10 3 14 4 15 5 4 6 17 Table 3. Field identification pixel error. Query

Luo, for

Q., various

types

Sugie, N., of documents,

Number proNumber projected +/- 20 jected +/- 10 pixel pixel error error 6 2 0 8 1 1 12 2 0 10 2 1 4 0 0 13 2 3 results. Most incorrectly projected ones are still within a +/- 10 Number correctly projected