Building Cameras for Capturing Documents - CiteSeerX

Building Cameras for Capturing Documents Stephen Pollard, Maurizio Pilu HP Labs Europe, Bristol, BS34 8QZ, UK December 9th, 2003 Abstract This paper explores those aspects of document capture that are specific to cameras. Each of them must be addressed in order to close the gap between taking a photograph of a document and capturing the document itself. We present results in five areas: (1) framing documents using structured light; (2) robustly dealing with ambient illumination when capturing glossy documents; (3) improving text quality when using mosaiced color sensors; (4) robust passive recovery of perspective and image plane skew using text flow; and (5) measuring and undoing page curl using structured light and an applicable surface model. The ultimate success of subsequent document recognition tasks will be heavily dependent on the successful completion of these tasks.

1

Introduction

Since working with documents involves both electronic and paper media the need to transfer data across the electronic divide is of increasing importance. While a variety of scanning technologies (e.g., flatbed, handheld, drum) are in common use there is growing interest in the camera as a method for capturing documents. With digital technology developing at its present rate (i.e. the current crop of consumer cameras have 5-8 million pixels) it is clear that the image sensors will provide the resolution necessary to capture the content of full page documents. However a number of important factors, in addition to raw resolution, must be overcome in order for documents to be captured effectively with digital cameras. We have examined a number of the most important limitations of digital photography for capturing document images and in this 1

HP Restricted paper we present an overview of this endeavor with many of our methods and results presented for the first time. In our studies we have considered both handheld and desk mounted camera configurations. Clearly the constraints, limitations and advantages of each scenario are rather different leading us to different solutions. For a desk mounted camera the chief advantages of camera technology are the potential for a capture device with a very small footprint and the opportunity to capture documents face up. Furthermore such a device can offer multi-functionality, potentially doubling as a presentation device or a video conferencing camera (see [1] for a discussion of additional opportunities for supporting collaboration for this and related concepts). For a hand held device the primary advantage can in part be attributed to portability [2]. However there is also a large degree of commonality between the two modalities. In each case there are a number of fundamental issues that must be addressed in order to capture document images rather than just digital photographs of documents. The ultimate success of more specific document and text recognition tasks will ultimately be heavily dependent on how well the document is captured in the first place First of all, even before we worry about the image itself it is essential that we can accurately and easily frame the document of interest and preferably without perspective skew or image rotation. To this end we have explored the use structured light to assist the framing process and in Section 2 we outline our approach and the results of user studies to compare it with traditional camera framing methods. Once we are in a position to capture the image there are still a number of issues that must be overcome. Primary amongst these is the issue of illumination. In comparison to a flatbed scanner ambient lighting is very unpredictable and is especially troublesome for glossy documents with specular reflections. In Section 3 we explore the thorny issue of document specularity and present a novel solution to the illumination problem that utilizes a dual flash/strobe. While sensor resolution is steadily increasing it is always important to get the highest effective resolution possible when it comes to capturing documents of text. At the same time it is desirable to capture color documents as well as monochrome. However goals of good color reproduction and maximizing resolution are in direct conflict due to the way color is captured using a single mosaiced sensor. In Section 4 we show that an alternative treatment of the raw color data can lead to improved document resolution and improved OCR. Finally we must deal with the image distortion that results from the unconstrained relationship between the document camera and the document 2

HP Restricted

(a)

(b)

(c)

Figure 1: DOE based framing aids for cameras. In (a) we illustrate the use of collimated laser optics and a diffractive optic element (DOE) to generate a light pattern. In (b) we illustrate two versions of the combination of a camera device and laser based framing aids. The top one uses a single device for framing the document while the bottom one is a distance based configuration based on triangulation. The corresponding laser patterns are illustrated in (c). surface. This has two major components. In Section 5 we present a robust method for dealing with perspective skew and image rotation based on the structure of the lines of text within a document, while in Section 6 we present a way of dealing with page curl using structured light.

2

Document Framing

One major limitation of existing digital cameras in respect to capturing documents is the apparent difficulty of the task. Existing framing methods, such as optical view finders, were developed for scenic capture with the head held erect and a task to roughly point the camera at the center of the scene. When it comes to capturing a document on a desk-top this framing model is unsuitable and results, without extreme care, in poorly framed and perspectively distorted images.

3

HP Restricted

2.1

Using a structured light pattern to aid framing

We have proposed [3] the use of structured light as an alternative to traditional forms of viewfinder. In our experiments the structured light patterns are generated using a collimated light source and a diffractive optical element (DOE). Figure 1-a illustrates the arrangement; as the laser light is shone through the DOE, complex interference generates a structured light pattern [4]. The details of the pattern are dependent upon the microstructure of the specific DOE used. Virtually any bi-level light pattern can be generated from a simple line or grid of dots to a complex shape such as text or a company logo. In practice most commercial DOE patterns generators [5] tend to be rather small with total fan angles of less than 30o . However we were able to obtain specially fabricated DOEs from the Optical Interconnect Group of the Solid State Application Department of Hewlett Packard Laboratories in Palo Alto (now part of Agilent) with much larger fan angles (90 degrees) which we used in some of our experiments. The Figure 1-b also shows a number of devices examples that are able to frame a document using structured light.

2.2

User studies

We have performed a number of user trials comparing different framing aids: an optical viewfinder, an LCD with a tilting lens and different configurations of structured light (2-corner, 4-corner, full rectangle and triangulation), the details of which are presented in [6]. In these experiments reaction time, percentage framed, absolute offset in X and Y , scaling in X and Y , skew, rotation and handshake were measured for both horizontally and vertically orientated documents for a range of viewing distances. Figure 2-b to 2-d illustrates some of the experimental devices in use and Figure 2-a shows how the image of a target document was processed to compute the experimental parameters (for mode details see [6]). Significant interactions between device and image plane were found for reaction time and each of the accuracy/error measures. A small but illustrative subset of the results is shown in Table 1. We found that all the laser based devices significantly outperformed the optical viewfinder and the LCD. In general the triangulation device was the most accurate; however, despite this superior performance subjective data collected immediately following the experiments highlighted the requirement for additional cues to provide feedback to relay the success or non-success of the capture task. As we can see from the graph of Figure 3 the experiments also showed that, as expected, motion blur tends to increase with viewing distance. In4

HP Restricted

(a)

(b)

(c)

(d)

Figure 2: The image at the top is an example of a document captured using one of the experimental devices. Simple image processing methods enable us to identify calibration targets embedded in the document from which we accurately compute a number of viewing parameters (listed above) using standard geometrical techniques. Three example cameras with laser framing devices are shown at the bottom of the figure. Those on the top row are based on framing the corners document (4 corner on the left; 2 corner on the right) while the device shown at the bottom uses triangulation (where the height of the device is altered to bring a pair of crosses into registration - the user must simultaneously target the center of the document). terestingly however the visual feedback afforded by laser based viewfinders served to reduce handshake in both X and Y . The graph also includes the case of using the elbow as a stabilizer which resulted in large reduction in handshake with respect to free-holding the device at similar distances.

3

Specular Reflections

One of the most important technology hurdles for document cameras is controlling the illumination in a way to give the required image quality across all environments and for a sufficiently wide variety of documents. A particular problem arises with glossy paper (especially in the presence of page curl as 5

HP Restricted

Table 1: Subset of data culled from the experiments. Showing the average time taken to capture the document, the percentage of wasted pixels resulting from over-framing, the skew error (deviation from parallel viewing) and the document rotation error for 4 of our capture devices. is typically the case along the spine of a magazine page). The specular reflection of an illumination source in the environment from the glossy surface will typically be 500 times brighter than the ambient diffuse reflection of the content of the document itself and no amount of image processing and/or exposure correction can overcome its effects.

3.1

Modelling the specularity

Most glossy surfaces are not perfect mirrors and hence specular reflections, including those of point sources, tend to be extended. This is because the surface structure of such material tends to be randomly perturbed either visibly or at a microscopic scale. The complete relationship between incident light at a specific angle and that reflected in a particular direction is contained in the Bidirectional Reflectance Distribution Function (BRDF) of a surface material [7]. In general the BRDF must be measured experimentally, however, in computer graphics the specular part of BRDF is approximated using the Phong model [8], shown in Figure 4, where the distribution of the specularity is given by a bell-shaped curve according to the function cosn φ where φ is the angle between the line of sight and the principal (mirror like) reflected ray. When n is very large (> 10000) the specular reflection is mirror like while for smaller n it becomes more extended. We shall see later that the Phong model gives a useful approximation to the specular reflections of many glossy document surfaces.

6

HP Restricted

Figure 3: Comparison of a single camera device with and without visual feedback. The graph shows mean and standard deviation of the velocity in the Y direction (very similar in X) of the image of the document in mm per second for a number of different viewing distances of the camera above the document. Also shown are the results where the elbow is used to stabilize the arm. As we would expect the perceived motion depends on viewing distance as rotational components dominate. Interestingly visual feedback tends to reduce image motion.

Figure 4: Phong model of specularity

3.2

Constraints on the imaging geometry

Clearly there will be an angle φlimit beyond which the contribution of the specular reflection term cosn φlimit will become practically insignificant with regard to the imaging process. We shall call γ this limiting angle in this paper. Figure 5 shows three examples of how a specularity will appear to the camera. For the simple case of Figure 5-a where the light source is located at infinity, the angle subtended by the specularity θ will equal the deviation from specular reflection φ and hence the fall-off of the specularity in the image will also be predicted by the Phong model and we can fit cosn θ directly. Furthermore the angular visual limit of the specular reflection within the image θmax will give a direct measure of the limiting specular angle γ. 7

HP Restricted

Figure 5: Three reflection conditions. (a): Parallel illumination; (b): illuminant at the same distance as the camera; and (c): the case of a highly curved surface. See text. A more realistic location for an active illumination source is at approximately the same distance from the document as the camera itself as illustrated in Figure 5-b. In this case θ no longer equals φ, it is approximately half, and the approximate Phong model at the image is approximately cosn 2θ (similarly γ = 2θmax ). Notice that in both this and the previous case the angular extent of the specularity is independent of the viewing distance and hence for longer distances the size of the specularity will grow with respect to the underlying image structure. This is not true for the third condition shown in Figure 5-c where the surface is cylindrical (a good model of local page curl). In this case, as the surface orientation changes the overall change in reflected angle will be twice as large so that the total range of surface angles over which a specularity will continue to be visible will be plus and minus half the specular limiting angle γ. Hence the physical extent of the specularity across a tight radius cylinder will not tend to change significantly with viewing distance as it is tied more closely to the surface structure than was the case for specularities on planar surfaces.

8

HP Restricted

Figure 6: Cylindrical (left) and fronto-planar (right) specularities on Matte Photo Paper.

3.3

Real specularities

Figure 6 shows planar and cylindrical (60mm diameter) specularities on Matte Photo Paper. Notice that specularity along the cylinder looks longer than that on the plane. This results from the additional intensity fall off across the cylinder all along its length due to the Lambert’s Law (that reflectance of a perfectly diffuse surface is proportional to the cosine of the angle between the incident ray and the surface normal) and the actual extent of the specularity is in fact unchanged. If we fit the Phong model through vertical cross-sections of the image in Figure 6 passing through the peak of each specularity we get the results in Figure 7 left and right for the plane and cylinder respectively. In each case we must double the angle measured in the image θ and fit cosn 2θ because the illuminant is as far from the document as the camera (case of 5). In each case the specularity is reasonably well fitted with an exponent n = 200 (only the offset and gain of the fit are changed between the specularities). Such a specularity falls to 5% of is peak value at 9.9o (half angle) and to 1% at 12.25o . Our studies [9] indicate that this represents a useful worst-case model for specular reflections. Further experiments show that the Phong model is reasonably representative over the full range of flash locations. Specifically the angle subtended by the specularity is largely independent of viewing distance and hence for higher camera positions the size of the specularity grows with respect to the document. Also the shape of the specularity (as opposed to its location) is largely independent of the angle of the flash with respect to the document normal (although the Phong model is well known to break as the incident angle approaches the grazing angle). More examples of specular reflections are given in Figure 8 for both a small illuminant aperture (12mm diameter) and an extended aperture (the 9

HP Restricted

Figure 7: Left: Plot of intensity value against pixel location for vertical crosssections through the specularities of Figure 6; Right: the Phong model fitted through the intensity data with n = 200.

Figure 8: Examples of specular reflections for 3 different paper types for small (top) and large (bottom) strobe apertures. The three paper types shown cover a large range of glossiness from mirror like to very diffuse: paper type (a) is HP LaserJet Transparency; (b) is the cover of Black n’ Red Notebook cover; and (c) is a document sleeve. whole of the 42x55 aperture of a xenon flash). Note that while the mirror like reflections are very dependent on the size of the aperture this is not the case for the more diffuse forms of specularity where the size is almost constant. This fact is very important as it shows that the worst case performance can not be improved by simply reducing the size of the illumination source as may be expected naively.

3.4

Single Flash Issues

As is well known in photography and industrial vision the best solution to this kind of problem is to ”swamp” the ambient light using a combination of fast strobe illumination (a flash) and a short exposure that tightly brackets it. If the exposure is short enough then all the light from the flash can contribute to the image without allowing much build up from the ambient 10

HP Restricted

(a)

(b)

Figure 9: (a): At any point on the object plane the location of a first strobe presents a constraint 2γ on the allowed location of a second in order to prevent specular reflections overlapping on the document surface. (b) Realistic dual flash location for 1/3 shift-lens. X = 90 and Y = 265. specular reflection (or any other ambient light for that matter). However, for exposures much over a millisecond it may be possible for the effects of the ambient illumination to be noticeable (but not dominant) in which case it may be possible to take a second snapshot with the same exposure (but no flash) and subtract this from the first to remove the ambient contribution. Using a single strobe at a low angle of illumination (in order to avoid specular reflections from planar surfaces) presents a number of issues. First, there is a considerable illumination profile across the document, though this can be ameliorated to some extent using a compensating reflector. Second, small variations in the surface orientation of the document lead to relatively larger changes in reflected intensity when compared to less oblique illumination, and also lead to shadows. Third and most importantly, for curled documents with a high gloss finish it is still easy to produce specular reflections from the strobe; this is also true for 3D objects.

3.5

Dual Flash Solution

Given the near-impossibility of removing glare we have proposed a dual flash approach [10]. In this scheme, rather than try to eliminate the specularity from the document surface we take two pictures with separately located glare spots such that a single document image that is free from glare can be constructed from them. This requires that the specularities produced by the two strobes are (practically) always spatially distinct on the document surface. Figure 9-a shows that given a first strobe location, then in order to 11

HP Restricted

Figure 10: Dual-flash experimental rig. prevent specular reflections it presents a constraint on the possible location of the second strobe. We have shown [9] that in fact the constraint is remarkably simple and that anywhere in the field of view the angle subtended between the two strobes must be at least 2γ (twice the limiting specular angle). Using a limiting specular angle value of 15o , which accounts for the vast majority of specular images we have measured plus a small safety margin, implies that the angle between the strobes must be at least 30o for every point in the field of view. Figure 9-b shows a simple example of using this constraint for the situation of a 13 partial shift lens design (i.e. the field of view of the camera is asymmetric) where one flash is placed as close as possible to the camera. Note that in practice a similar analysis is done into the corners of the document not just the top and bottom of the page.

3.6

Experiments with the dual flash system

We have built an experimental document capture rig (Figure 10) based upon the DVC 1300C 1.3 Mega Pixel colour camera that uses the Sony ICX085AK sensor which has 6.7 µm square pixels. In one mode of operation, see Figure 11, it is equipped with a pair of controllable xenon flash units (from the company Active Silicon) which deliver 1.0 Joule in 7µs. These are arranged symmetrically either side of the camera, subtending an angle of in excess of 30o over the whole field of view. Some of the results obtained from the experimental rig are shown from Figure 11-a to Figure 11-c. In each case the images with the left and right hand strobes are shown with the combined result in the foreground. Each image is compensated according to a calibration profile obtained with a clean white test card. This makes allowance for the illumination profile of the individual flash units and performs a white balance operation for the spectral properties of the strobe illumination. As can be seen all results are very 12

HP Restricted

(a)

(b)

(c)

Figure 11: Examples of results obtained using the dual flash system. See text.

13

HP Restricted promising giving good even illumination and freedom from the glare present in the individual strobe images. Notice that in Figure 11-a the curl of the document is in the same direction as the separation between the strobe units while in Figure 11-b the curl is in the orthogonal direction. In either case the specularities are easily dealt with but the distance between them is much larger (and hence robust) for the horizontal curl (Figure 11-b) as would be expected. The final example in Figure 11-c is more complex as it includes curl in every direction. Even in this case the method deals nicely with the problem.

4

Colour Mosaic Imaging

Digital color cameras almost exclusively employ a mosaiced image sensor in which a single CCD array employs three or more color filters with each pixel capturing a single colour. The most popular arrangement is the Bayer pattern which has alternating rows of Green-Red and Blue-Green filtered pixels. Hence 50% of the pixels are green, approximating luminance, and 25% each are red and blue reflecting the lower spatial sensitivity to chrominance of the human visual system. Furthermore, to overcome sampling errors in the under sampled colour channels, an optical anti-aliasing filter is usually employed immediately on top of the sensor. Clearly this arrangement is sub-optimal from the point of view of capturing high resolution images of text. While a color capture device is desirable and increasingly becoming essential it is unfortunate that we must suffer such a penalty in terms of monochrome text resolution in order to achieve it. The first and most obvious thing to do is to dispense with the anti-aliasing filter thought this does tend to introduce some aliasing artifacts. However this still leaves the issue of full color plane reconstruction, or demosaicing, which requires the interpolation of each color plane (see [11] for a review of interpolation schemes).

4.1

HiPass demosaicing

In order to maintain as high a text resolution as possible and reduce aliasing artifacts we have developed the HiPass demosaicing method [12] outlined in Figure 12. The method attempts to decompose the mosaic into a high frequency monochrome and low frequency RGB data. These can then be combined to approximate the true full colour image. While some chrominance edges are attenuated the method provides full resolution monochrome text and avoids most colour aliasing artefacts for photographic images. 14

HP Restricted

Figure 12: HiPass Demosaicing. The red green and blue pixels that constitute the mosaic are separately smoothed using a low pass filter; monochrome highfrequency data is obtained by pixel-wise subtracting from each red green and blue pixel of the mosaic the correspondingly colored low-pass filtered version; following gradient based correction to reduce zippering this highfrequency data is added to each of the low-pass images to create a full color, full spectrum reconstruction. For each red, green or blue pixel in the raw mosaic (shown expanded into red, green and blue color planes), the corresponding pixel in the HiPass mosaic is constructed by subtracting from it a corresponding low pass red green or blue pixel constructed from the appropriate color plane of the mosaic. This is then corrected to remove zippering artifacts using a variant of the standard gradient-based interpolation schemes outlined in [11]. For example, each pixel of the HiPass mosaic that is derived from a red or blue pixel can be corrected using 1D quadratic interpolation of the 5 neighboring pixels in the vertical or horizontal direction depending which has the lower intensity gradient. The interpolation is of the form C0 = (2I0 −I−2 −I2 +4I−1 +4I1 )/8, where I and C are the intensities of the HiPass mosaic and its corrected version, respectively. The corrected HiPass mosaic is then added to each of the low pass color images to generate the fully reconstructed image.

4.2

Results and OCR tests

Figure 13 shows how HiPass demosaicing gives improved monochrome text when compared to a demosaicing scheme optimized for photographic reproduction in the presence of an anti-aliasing filter (in this case the demosaicing 15

HP Restricted

Figure 13: A comparison between HiPass demosaicing and a typical photo demosaicing scheme for 150dpi image of 6 point times roman text, sharpened and bi-cubically interpolated to 300dpi.

(a)

(b)

Figure 14: (a) 6 point OCR test chart. (b) OCR error rates for text and photo demosaicing with and without additional sharpening for 6 and 8-point Times and Arial test data. scheme used in the HP850 digital camera). In each case the same raw mosaic image data was used as input. This improved image quality feed straight through to OCR errors as shown in the block graph of Figure 14-b for test charts similar to the one in Figure 14-a. In each case text is captured at 150 dpi and interpolated to 300dpi after demosaicing with the option of applying sharpening (using the standard unsharp mask [13]) in between. It is clear that with or without sharpening HiPass demosaicing results in much improved OCR with almost half the error rate in each case. Finally Figure 15 shows a comparison between text captured with a camera (the DVC 1300C) and a scanner (The HP ScanJet 6100C). Thanks to HiPass demosaic the results from the camera are very close to those of the scanner despite the fact that original image was in the form of a mosaic.

5

Correcting perspective page skew

One of the main disadvantages when capturing a document with a camera is that the non-contact image capture process causes geometric distortions 16

HP Restricted

Figure 15: Comparison of camera text vs scanner text at the same resolution (camera on the left scanner on the right). The first column of numbers gives the dots (pixels) per standard character height that defines the point size. Each subsequent column gives the character size this represents for a given dpi (at the top) and equivalent megapixel count (at the bottom).

(a)

(b)

Figure 16: (a): Illustration of perspective deskew in documents and the linear clues. (b): the desired output that can be obtained with a rectifying homography using vertical and linear clues.

dependent upon the camera orientation, in particular perspective skew, as typified in Figure 16-a. In our investigation into this problem we have identified a robust and general solution to rectify text or highly structured document images which can be seen as complimentary to other approaches present in the literature. Whereas the geometry of the rectification is well known [14], the problem of passively and robustly detecting the geometric features needed is still open. Even recent works explicitly addressing camera-based document imaging such as [15] treat only rotation-induced skew. We address the problem of trying to passively deskew a captured document, namely the detection of linear clues that can be used as geometric primitives to determine the document plane orientation with respect to the camera and hence the rectifying homography [16] 17

HP Restricted

Figure 17: (a): The binarized image; (b): the association network; (c): the curvilinear groups extracted; (d): The fitted line bundle.

A substantial body of research has been dedicated to text and page segmentation in document images, but any distortion considered is again only rotation-induced. The main bottom-up methods used include many variations on projection profiles approaches, Hough-inspired techniques [17] and nearest-neighbor clustering [18]. An interesting approach that employs perceptual organization principles is [19], although assuming of parallel-lines. Works dealing with the recovery of vanishing points from images are numerous, some from edges (e.g. [20]) and some others from texture and other soft clues [21]. However, both edges and regular textures cannot be generally assumed in text documents. The work in [22] is one of the extremely few works that tries to extract text from perspectively skewed documents but, it does so using the document quadrilateral, which we do not assume it is always visible1 .

5.1

Illusory clues

Figure 16-left shows several kinds of linear clues that may arise in practice. Clue A is a vertical illusory [25] clue; clue B is a vertical hard line, in this case the projection of the actual document edge; clues C are horizontal illusory lines, inferred from the arrangement of characters into text lines; clue D is a horizontal hard line; finally, clue E is a quadrilateral that can be either 1

A later publication by the same authors [23] follows the same principles laid out in our earlier work [16] and [24].

18

HP Restricted illusory or corresponding to an actual rectangular outline in the document (e.g. a figure box or the four document boundaries). Hard edges of the type B and D can be detected rather trivially in many ways (e.g. the Hough transform) and will not be addressed as such in this paper. Illusory edges such as C and A are difficult to find reliably in practice and most of the literature on document analysis has focused only on the problem of recovering groups of parallel text lines to compensate for rotation for OCR and other scanning applications.

5.2

Extraction of horizontal linear clues

The algorithm to extract the horizontal illusory lines, which is extensively described in [16], is summarized here. A preprocessing stage [26] binarizes the input image, turning it into blobs representing either single characters or (portion of) words or lines, depending upon the font size and the resolution considered (Figure 17-a) These blobs are divided into elongated (major axis longer than thrice the minor axis) or compact. A pairwise saliency measure is computed for pairs of neighboring blobs that represent how likely they are to be part of a text line. A network (Figure 17-b) is then built using the blobs and their associations. The network is then transversed to extract salient linear groups of blobs which constitute the illusory horizontal clues (Figure 17-c). Isolate elongate blobs are also considered as individual clues. Note that the saliency measure used is based on perceptual organization rules and is hence font type and size independent.

5.3

Partial rectification

Once we have a pool of horizontal clues (Figure 17-c), we can perform a partial rectification (or deskew) of the document. The first step is to fit lines to each horizontal clue; using straightforward linear regression. Figure 17-c shows an example of the fitted lines now representing the linear clues. The second step is to find the vanishing point [27] vx , vy in the image. There are several techniques in the literature but most of them would be impractical here due to the relatively imprecise linear clues. We found that excellent results are obtained using a RANSAC approach [28] in the linear clue feature space to explicitly fit a line bundle of the form y − vy = m(x − vx ). The bundle is shown in Figure 17-d. In the absence of vertical clues, full rectification is clearly not possible and hence the second step of this partial rectification is to find a homography in the image plane that makes all the bundle lines horizontal. The homography we compute follows the suggestion of [29] which argues that 19

HP Restricted

Figure 18: Four partial deskewing results using the horizontal bundle fitted to the detected illusory linear clues.

amongst infinitely many plane homographies that would perform this transformation, good results are achieved when the homography closest to an Euclidean transformation. Following [29] the homography is a concatenation of four transforms, shifting the image to the origin or the coordinates, rotating the bundle to position the vanishing point on the horizontal axis, sending the bundle to infinity and shifting the image back. More details can be found on [16]. Figure 18-a to 18-d shows four results of the application of this partial deskewing.

5.4

Extraction and use of illusory vertical clues

Setting aside the case of true vertical edges (clues B in Figure 16) we would like to extract illusory clues arising from high level text organization, such as the paragraph’s left justification. The approach followed is similar to the one for horizontal clues in that we use an association network of blobs (Figure 19-a) that is searched for linear groups. However, the way the associations are made is substantially different. In fact, since the vertical associations are rather weak we have 20

HP Restricted

(a)

(b)

Figure 19: (a): Example of pruned association network for vertical clues; (b): Best illusory vertical clues, still with a large number of outliers.

noticed that it is best not to potentially further weaken associations based on saliency measures such as those used for the horizontal clues. Rather, we have decided to only reject associations based on near-impossibility and let all the others propagate until further committal. The rejection rules are simple and based on perceptual organization; more details can be found on [16]. The application of these pruning criteria dramatically reduce the dense association network, as shown in Figure 19-b. Noticeably, most of the associations belonging to actual illusory vertical clues are preserved, albeit amongst a sea of other meaningless ones. Finally, a greedy split and merge strategy is used to group all these associations into extended near-vertical linear groups as shown in Figure 19-b. The strongest groups are elected as vertical clues but amongst these incorrect ones always crop up. Figure 20 shows six examples of the set of vertical clues detected. Now, armed with the knowledge of the horizontal vanishing point, and a few correct vertical clues that can be used to determine a second vanishing point, we have all the geometric information [24] to both test the correctness of vertical clues (if we know the focal length) and then perform full rectification resulting in an undistorted document like that shown earlier in Figure 16-b.

6

Undoing Paper Curl Distortion

When scanning a book or document with a flatbed scanner, the user can minimize page curl effects by pressing it against the platen. When capturing with a camera, in this case a desktop camera, however, 21

HP Restricted

Figure 20: Six examples of vertical clues found.

there is no way to control page curl like shown in Figure 21-a and a scanned document might well look like that of Figure 21-b. This unsightly image distortion caused by page curl can be corrected using data of the three-dimensional profile of the surface which can then be used to geometrically undo the effect of curl in order to produce an undistorted version of the document. We have explored the two main aspects of the problem, namely designing a viable, cheap method to collect 3D profile data and undoing the curl using it. In this section we present an overview of the techniques, which are discussed in more detail in [30] and [31].

6.1

Recovery of 3D information

We have developed a viable and cheap method that can be used to recover the 3D profile (in a general sense), of an imaged document by projecting a known 2D structured light pattern and triangulating it with the image of it taken by the camera. The use of a 2D pattern, as opposed to a single stripe or point, is particularly desirable in this applications because it does not require expensive moving parts (and electronics) and allows us to recover 22

HP Restricted

(a)

(b)

Figure 21: Example of a thick curled book under a desktop camera-based scanner (a) and its image (b). the page-curl in a single shot and not by sweeping the 1-D pattern over the page, which requires several seconds with an ordinary camera. 6.1.1

Pagewide structured light projector and its modelling.

Figure 22-a shows the prototype we have built to demonstrate the approach. We used a 2D light pattern in the form a set of 15 stripes generated by a laser with a diffractive grid that covers an A4 document, but other configurations are obviously possible. The structured light projector was manufactured by Lasiris (now part of Stoker & Yale Inc. [5]) and uses a patented diffractive grid technology to produce non-Gaussian stripes and various other patterns, illustrated in Figure -b. For generating a set of stripes, the product cleverly uses two diffractive grids. The first one transforms a laser beam into a line by effectively having many Gaussian light profiles overlapping, each centered around peaks of the diffraction. The second one diffracts this line into 15 lines. This particular laser was chosen because it had the right parameter configuration to cover a whole A4 page with its stripes at about 30 centimeter distance. Due to formation process of the stripes, the light sheet is subject to conical aberration and is hence a conic surface. From information provided by the manufacturer of the diffractive gratings we were able to use specific diffractive optics equations to model the peaks of each stripe. However, what we actually needed was a way of mathematically modelling the entirety of each light sheet to be able to express the triangulation in closed-form. For doing so we have pragmatically used the diffractive model to create another, more useful model consisting of the equation of each light conical sheet.

23

HP Restricted

(a)

(b)

Figure 22: (a) The rig used in the experiments consisted of a camera and a structured light projector generating a sparse pattern on the surface of the document. (b) The structured light pattern used generates 15 conic light sheets which can be modelled in closed form. 6.1.2

Stripe identification and triangulation.

In order to perform the triangulation it is necessary to determine where lines are and which one corresponds to which conical sheet sheet. There are two distinct parts in this process, the first being stripe detection and the second being stripe labelling, which makes sure that we know which conical light sheet (and hence its equation) generated which detected stripe we see on the document. To reliably detect the stripes in the image we made some assumptions suitable to our context: the acquisition of 3D data for a static document can be simply done by briefly flashing the laser pattern and synchronizing it with the image capture. In this way we always have two perfectly overlapping images, one with the pattern and one without it, which allows us to use image differencing to make the stripes stand out. Even so it was necessary to employ an enhancing operator to the image difference. Given the prevalently horizontal lines (See Figure 23-a), the operator we used was a 1-D Laplacian kernel (second derivative) applied only to the direction orthogonal to the stripes. Stripe labelling is carried out by a voting method, which is very robust in general situations and can smoothly cope with gaps. First the binary stripes are thinned down to one pixel thickness and connected pixels are joined together into a string via a standard connected-component module. Next, strings that are too short are removed from the classification and deemed noise. Then for each string, a heuristic strength measure is computed and for 24

HP Restricted

(a)

(b)

Figure 23: (a): Stripes projected on a curled document surface.(b): corresponding sparse profile data recovered by detecting and labelling the stripes and triangulation. each image column, starting from the top of the image, we assign successive, increasing label numbers to, and only to, the 15 strongest stripe points. Finally, for each string we assign a label equal to the most popular label assigned to all the points of that string. Tri-dimensional data points are then obtained via triangulation, a well known method for laser-based range finders (see [32],[33] for a good introduction) which consists of finding the intersection between the known conical sheet of light and the optic ray going through a given point of the stripe projection on the image. An example of the semi-sparse date we obtain is given in Figure 23-b.

6.2

Undoing curl

This section is concerned with the reconstruction of the flat state of a curled document or book with the support of sparse depth measurements of its surface with a method such as the one of the previous section. A novel method based on a physical model of paper deformation with an applicable surface is proposed and a relaxation algorithm is described that allows us to fit this model and unroll it to a plane so as to produce the undistorted document. 6.2.1

The problem and related techniques.

A number of techniques have attempted to correct page curl using some form of depth measurements (e.g. [34],[35], [36]) but each of them has been restricted to the recovery and correction of curl using a simplified cylindrical model of deformation rather than the more general case of interest here.

25

HP Restricted

Figure 24: A surface cannot be unrolled onto a plane as in general it is not isometric with it. This example shows the unrolling onto a plane using finite differences which results in a highly distorted mapping. Naively we could fit a surface to the 3D data and unroll it to a planar state. However a general surface cannot be unrolled to a plane without stretching or tearing and thus this approach would inevitably cause local and global distortions. Figure 24 illustrates this point. A cloud of imprecise surface measurements has been smoothed and a B-spline fitted, which exhibits little bumps in some regions due to the inevitable measurement errors and bias. If we want to undo the curl of the document and make it look flat, one has to texture-map patches from the original image onto patches of a plane. This mapping could be computed, for instance, by integration of finite differences in the meshed surface, as shown in Figure 24. However, by definition, a non-applicable surface can only be unrolled onto a plane by either tearing or stretching, which would cause unnatural texture distortions in the image of the unfolded document image. In addition, due to the integrative nature of unrolling a surface, locally small errors tend to build up and lead to unsightly distortions at the edges. Figure 25-a shows the distortions in the reconstructed texture. A problem similar to the unfolding of curled paper comes from cloth deformations modelling in Computer Graphics, as inextensible cloth has the same differential geometry properties of paper [37]. In particular, [38] explicitly tries to map a flat texture onto an arbitrary surface and point out that the mapping is isometric (distortion-free) only if the surface is applicable; this technique, although addressing the opposite problem, resembles the technique we used. 6.2.2

Modelling curled paper with a discrete applicable surface.

Curled paper is mathematically represented by an applicable surface, which has the property of being isometric with the plane and thus easily unrolled. A surface continuous in G0 is called an applicable surface when its Gaussian curvature [39] vanishes at every point. For this very reason, applicable surfaces are isometric with the plane and can be flattened without stretching or 26

HP Restricted

(a)

(b)

Figure 25: (a): The book of Figure 1 curl-corrected by unrolling a B-spline surface. (b): The same book curl-corrected with the approach presented here, showing considerable less distortion due to using applicable surfaces. tearing. The analytical definition of an applicable surface is impractical and thus we use a triangular truss (mesh) as a finite element approximation to the surface. The mesh can deform but if adjacent node inter-distances are kept constant; then the mesh is the best discrete approximation to a developable surface that a particular mesh resolution and topology allow. Making the mesh finer can make the approximation error arbitrarily small allowing us to model, for instance, paper creases. However, in general it is not possible to split triangles and refine the mesh locally in an adaptive fashion to reduce the fitting error once the mesh has started deforming, since the resulting mesh might not be unrolled into a plane.

6.3

Fitting the applicable surface to 3D data

The computational approach to fitting an applicable surface to sparse 3D data is based on representing the applicable surface as a polygonal mesh. First, a mesh of suitable dimension, with known inter-distances between nodes, is initialized on the sparse 3D data points; at this stage the surface could not possibly be unrolled onto a plane, as nodes have moved about. Next, an iterative optimization method adjusts the points such that the initial distances between them are restored, thereby obtaining isometry with the plane. In [31] it is shown that this process converges to a developable state and that, more importantly, in doing so it optimally approximates the noisy 3D data with an applicable surface. A final texture mapping stage produces the undistorted version of the document. On of the advantages of using the process illustrated here is that we do not need to unroll the fitted surface onto a plane. In fact, as we have started with a plane in the first place, we already have a one-to-one 27

HP Restricted correspondence between fitted surface tiles and the tiles of the plane. Figure 25 shows two different curl-corrected versions of the curled book shown in Figure 20. The image of Figure 25-a has been reconstructed by fitting a B-spline surface to lightly smoothed input data and unfolded by integration starting from the center. Where there is no 3D data, the reconstruction goes astray (see, e.g., the top-left side of the book) causing considerable distortion. On the other hand, the result in Figure 25-b is the output with the proposed method. It can be noticed that the reconstruction presents far less distortion (even in the areas that are not covered by 3D data - see the gaps between the stripes of Figure 23-a ) and the page edges are straight. The slight indentation on the book spine is due to the coarse mesh representation used (20 by 15 nodes), which could not follow accurately the spine curvature. By using a denser mesh we can easily overcome this problem but at the cost of a longer convergence time. The slight residual global distortion is due to a bias in the 3D data caused by imprecise calibration of the cheap 3D acquisition system used2 .

7

Conclusion

In this paper we have explored those aspects of document capture that are specific to cameras. In particular we have presented the following results and methods. Firstly we have presented the results of a number of user trials that show that the direct visual feedback using structured light provides a very effective form of framing for document capture with a camera-like device. Subjects were both faster and more accurate with such framing aids than with traditional solutions. Even very restricted forms of direct visual feedback make users better able to frame documents and tend to reduce the presence of handshake which in turn allows for increased exposure times. One of the great strengths of the traditional scanner based document capture devices is their tight control of illumination. In an attempt to mirror this for a desktop document camera we proposed to overwhelm ambient illumination using a flash in conjunction with a tightly bracketed exposure. This is particularly important for glossy media such as magazine pages where the specular reflection of an ambient light source would otherwise overwhelm the situation. We have shown that for a dual flash configuration in which the angle subtended between the two strobes is greater than 2γ it is unlikely 2

Note that the irregular edge at the right hand side of the image is the boundary of the mesh, not that of the reconstructed document image, which is straight.

28

HP Restricted for specularities from the two flashes to overlap and a single image free from glare can be robustly generated. Thirdly we have shown that improved image document resolution can be obtained from a mosaiced color sensor if we dispense with the anti-aliasing filter and use a novel demosaicing algorithm to interpolate the missing colors. This in turn leads to improved OCR performance when compared with methods designed to optimize photographic quality. Next we have addressed the issue of how to deal with the perspective skew that is typical of hand-held capture. We show that even in cases where the full page is not visible (as the page boundary is what is typically used to perform perspective deskew) we can use the little structure available in the document to compute the rectifying homography that can be used to perspectively correct the document. Finally, we tackled another important issue typical of camera-based document capture, page curl. We have shown that extremely cheap diffractive optical elements can be used to generate structured light from which we can recover sparse 3D profile information. This information can be used to dewarp the document and we have done so using a geometrical model of paper fitted onto the sparse and noisy 3D data. Regardless of the specific solutions we have explored, only if we address each of these issues will the gap between taking a photograph of the document and capturing the document itself be closed. It is clear that each of them is of great importance if we are to build a real document capture device based on camera technology.

References [1] D. Frohlich. Scanners for remote collaboration: identifying requirements from the literature. HP Labs Internal Technical Report HPL-2001-27, 2001. [2] D. Frohlich, C. Stott, H. Lacohee, and A. Kidd. Condor field trial evaluation. HP Labs Internal Technical Report HPL-94-49, 1994. [3] S. B. Pollard, M. Pilu, and A. C. Goris. Framing aid for a document capture device. European Patent Application EP1128655, 2000. [4] V. A. Soifer and M. A. Golub. Laser beam mode selection by computer generated holograms. CRC Press, Boca Raton, 1994. [5] Stocker & Yale, Inc. www.stockeryale.com. 29

HP Restricted [6] P. Frost, S. Pollard, and M. Pilu. Framing aids to support document capture using digital cameras: a user study. HP Labs Internal Technical Report HPL-99-146, 1999. [7] D. B. Judd. Gloss and glossiness. Am. Dyest. Rep, 26, 1937. [8] J. Foley, A.M. vanDam, S. Feiner, and J. Hughes. Computer graphics: Principles and Practice. Addison Wesley, 1990. [9] S. B. Pollard and M. Pilu. Practical modelling of specularity from strobes in close-up imaging. HP Labs Internal Technical Report HPL2000-150, 2000. [10] S. B. Pollard and M. Pilu. Digital cameras. European Patent Application EP1233606, 2002. [11] J.E. Adams. Design of practical color filter array interpolation algorithms for digital cameras. Proc. SPIE, Real Time Imaging II, 3028, 1997. [12] A.A. Hunter and S.B. Pollard. Image mosaic data reconstruction. U.S. Patent Application 09/906,786, 2002. [13] R.C. Gonzalez. Digital image processing, pages 196–197. Addison Wesley, 1992. [14] R.M. Haralick. Monocular vision using inverse perspective projection geometry: Analytic relations. In CVPR89, pages 370–378, 1989. [15] M.J. Taylor, A. Zappala, W.M. Newman, and C.R. Dance. Documents through cameras. Image and Vision Computing, 17(11):831–844, September 1999. [16] M. Pilu. Extraction of illusory linear clues in perspectively skewed documents. In IEEE Computer Vision and Pattern Recognition, Dec 2001. [17] Y. Nakano, Y. Shima, H. Fujisawa, J. Higashino, and M. Fojinawa. An algorithm for the skew normalization of document images. In ICPR’90, volume 2, pages 8–13, 1990. [18] A. Hashizume, P.S. Yeh, and A. Rosenfeld. A method of detecting the orientation of aligned components. Pattern Recognition Letters, 4:125– 132, 1986.

30

HP Restricted [19] S. Messelodi and C.M. Modena. Automatic identification and skew estimation of text lines in real scene images. Pattern Recognition, (32):791– 810, 1999. [20] J.M. Coughlan and A.L Yuille. Manhattan world: Compass direction from single image by Bayesian inference. In Internation Conference on Computer Vision, pages 941–947, 1999. [21] J.S.Kwon, H.K.Hong, and J.S. Choi. Obtaining a 3D orientation of projective textures using a morphological method. Pattern Recognition, (29):725–732, 1996. [22] P. Clark and M. Mirmhedi. Location and recovery of text on oriented surfaces. SPIE Conf. on ”Electronic Imaging 2000: Document Recognition and Retrieval VII”, January 2000. [23] P. Clark and M. Mirmehdi. Rectifying perspective views of text in 3D scenes using vanishing points. Pattern Recognition, 36(11):2673–2686, November 2003. [24] M. Pilu. Perspective deskewing of documents from linear clues. HP Labs Internal Technical Report HPL-2001-6, January 2001. [25] V. Bruce and P.R. Green. Visual Perception. 2nd edition, 1991. [26] M. Pilu and S. Pollard. A light-weight text image processing method for handheld embedded cameras. In British Machine Vision Conference, September 2002. [27] R. Haralich and L. Shapiro. Computer and Robot Vision. Addison Wesley, 1992. [28] M.A. Fischler and R.C. Bolles. A RANSAC-based approach to model fitting and its application to finding cylinders in range data. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 637–643, 1981. [29] R.I. Hartley. Theory and practice of projective rectification. International Journal of Computer Vision, 35(2):1–16, November 1999. [30] M. Pilu. Page curl recovery with structured light. HP Labs Internal Technical Report HPL-98-174, October 1998. [31] M. Pilu. Undoing page curl using applicable surfaces. In IEEE Computer Vision and Pattern Recognition, Kauai, HI, December 2001. 31

HP Restricted [32] Y.F. Wang and J.K. Aggarwal. An overview of geometric modeling using active sensing. IEEE Control System, 8(3), 1988. [33] P.J. Besl and R.C. Jain. Three-dimensional object recognition. Computing Surveys, 17(1):75–145, March 1985. [34] Xerox Corporation. Platenless book scanning system with a general imaging geometry. US Patent 5,760,925, June 1998. [35] Xerox Corporation. Platenless book scanner with line buffering to compensate for image skew. US Patent 5,764,383, June 1998. [36] Minolta Camera Kabushiki Kaisha. Document reading apparatus for detection of curvature in documents. US Patent 5,084,611, January 1992. [37] H.N. Ng and L. Grimsdale. Computer graphic techniques for modeling cloth. IEEE Computer Graphics and Applications, September 1996. [38] S.D. Ma and H. Lin. Optimal texture mapping. In Eurographics. Elsevier Science Publishing, 1988. [39] M.P. Do Carmo. Differential Geometry of Curves and Surfaces. Prentice-Hall, 1976.

32

Building Cameras for Capturing Documents - CiteSeerX

Building Cameras for Capturing Documents - CiteSeerX

Suggest Documents

Linear Pushbroom Cameras - CiteSeerX

Panoramic cameras for 3D computation - CiteSeerX

Visual Augmentation Methods for Surveillance Cameras ... - CiteSeerX

Capturing pedagogical knowledge - CiteSeerX

Networked Robotic Cameras for Collaborative ... - CiteSeerX

Visual Augmentation Methods for Surveillance Cameras ... - CiteSeerX

Regularized Clustering for Documents - CiteSeerX

Capturing biodiversity: selecting priority areas for ... - CiteSeerX

Capturing architecture documentation navigation trails for ... - CiteSeerX

Documents - CiteSeerX

Building a reliable snare cable for capturing grizzly and American ...

25 Strategies for Capturing, Analyzing and Applying Building Data

20 types and documents: structuring building project ... - CiteSeerX

Towards Self-Powered Cameras - CiteSeerX

Stereo from Uncalibrated Cameras - CiteSeerX

Caustics of Catadioptric Cameras - CiteSeerX

Smart Cameras: A Review - CiteSeerX

Constant Resolution Omnidirectional Cameras - CiteSeerX

Capturing Knowledge about Philosophy - CiteSeerX

narrative support for technical documents - CiteSeerX

An Infrastructure for Managing Semantic Documents - CiteSeerX

A search engine for Arabic documents - CiteSeerX

Differentiated Strategies for Replicating Web Documents ... - CiteSeerX

Chipless ID for Paper Documents - CiteSeerX