Segmentation of Handprinted Letter Strings using a Dynamic Programming Algorithm Thomas M. Breuel∗ Xerox PARC, Palo Alto, CA, USA
[email protected]
Abstract Segmentation of handwritten input into individual characters is a crucial step in many connected handwriting recognition systems. This paper describes a segmentation algorithm for letters in Roman alphabets, curved pre-stroke cut (CPSC) segmentation. The CPSC algorithm evaluates large set of curved cuts through the image of the input string using dynamic programming and selects a small “optimal” subset of cuts for segmentation. It usually generates pixel accurate segmentations, indistinguishable from characters written in isolation. At four times oversegmentation, segmentation points are missed with an undetectable frequency on real-world databases. The CPSC algorithm has been used as part of a high-performance handwriting recognition system.
1
Introduction
Off-line recognition of handwriting has numerous practical applications in areas such as banking, census taking, mail routing, and commerce. The recognition of isolated characters is addressed by many techniques in pattern recognition and neural networks. However, in many real-world situations, handwritten characters are touching, and a handwriting recognition system either needs to segment the input into characters prior to character recognition [5, 11, 3, 1, 14]. or it needs to take an integrated segmentation/recognition approach (HMMs; [4, 7]), recognition in context [10], and attentional methods [6, 8]). Two frequently used algorithms are the valley point segmentation method [9] and the hit-and-deflect method [16]. Other methods rely on information about the production of handwriting (strokes, motor skills, perceptual models, and neural networks. A general survey of the problem of segmentation and further references can be found in [13]. This paper describes a segmentation algorithm that can be used as part of a connected handwriting recognition sys-
tem that uses separate segmentation and character recognition phases. While a number of segmentation algorithms specifically aim at segmenting printed or handwritten cursive characters, the algorithm presented here addresses specifically the problem of segmenting touching handprinted characters. This is the most common class of unsegmented handwriting input on US Census Forms and in many other forms reading applications, and probably represents the commercially most important character segmentation problem for handwritten inputs. The segmentation algorithm presented here takes advantage of an empirical observation about properties of the Roman alphabet. It relies on a very small number of parameters, and it generates very high quality segmentations. Character subimages generated by the algorithm are often indistinguishable from characters written in isolation. Furthermore, it has only very few tunable parameters and it is easy to implement. It can likely find applications in some other alphabetic writing systems and OCR systems for printed matter. The algorithm has been used with great success in the handwriting recognition system described in [1], which has been evaluated as part of the Second Census OCR Conference [12].
2
Segmentation and Recognition
A diagram showing processing in a typical handwriting recognition system using separate segmentation and character recognition steps is shown in Figure 1. The purpose of the first processing step is to identify lines (cuts) that separate characters. This is what we will refer to as segmentation in this paper. We do not require the segmentation step to work perfectly—a certain amount of oversegmentation is tolerable and will be accounted for at later stages in the recognition process [1]. In fact, since handwriting and printed text contains inherent segmentation ambiguities that require language modeling to resolve (e.g., m vs. rn and d vs. cl), a system must perform oversegmen-
type of left edge straight left edge slanted left edge left concavity curved left edge protruding left horizontal stroke
Figure 1. Processing steps in a handwritten text recognition system that uses separate segmentation and character recognition steps.
tation if it does not want to pay the computational cost of backtracking. Finding cuts that separate characters is not by itself sufficient to identify individual characters in the input. Instead, individual characters correspond to that part of the input that is found between two cuts. Character subimage extraction, the processing step following segmentation, performs this task. Because we allow oversegmentation, not all the cuts hypothesized by the segmenter actually separate characters. Some of the cuts may cut apart single letters into subletter units. In our example (Figure 1), the handwritten letter M, in addition to the correct interpretation as a single letter, could also be interpreted as the pair of letters nn. Since any region in the input can only have a single interpretation in the answer, we cannot choose those two interpretations simultaneously. This results in a number of constraints among possible interpretations of the set of character subimages. These constraints can be captured in a structure called here the hypothesis graph, a structure analogous to the phone lattice in speech recognition systems. A complete and consistent interpretation of the input as a sequence of individual characters consists of a simple path starting at the leftmost node in the hypothesis graph and ending at the rightmost node. For recognition in [1], each of the character subimages in the hypothesis graph is classified using a multilayer perceptron (MLP). Given these classifications, a path through the hypothesis graph now corresponds to an interpretation of the input as a string. The optimal interpretation of the input is finally chosen by taking into account the quality with which each subimage in the hypothesis graph matches its corresponding model, and by constraining paths through the hypothesis graph using a language model.
characters BDEFHIKLMNPRU bhikmnpru AVW vw XY xy CGOQS acdegloqs JTZ fjtz
Figure 2. Classification of handprinted letters according to their left edge.
3
Handwriting Styles
Handwriting occurs in various different kinds of styles. The two broadest classes are handprinted handwriting and cursive handwriting. In a cursive handwriting styles, characters are deliberately linked together, while in a handprinted style, characters are generally generated as moreor-less distinct units, but may touch accidentally. In the task for which the segmentation method described in this paper was developed [12], a handprinted style predominated, that is, a style in which individual handwritten characters are shaped in a way similar to a sans-serif printed font. Nevertheless the great majority of fields in the task contained multiple touching characters, and this is probably fairly typical of this kind of style. Cursive handwriting is often based on the copper-plate style. In that style, characters are linked regularly and predictably using festoon-like strokes. Minima of the upper contour of a cursive input string (valley points) will therefore correspond to segmentation points between characters (in a few cases, it is necessary to split some horizontal stretches between characters, [9]). Of course, there are valley points that do not correspond to cuts between characters, but since we are allowing for oversegmentation, this does not present a problem. Figure 3a shows the valley points for a typical cursive input string. We see that between each pair of letters, there is a cut, in addition to a few spurious cuts. In handprinting, letters are linked much more erratically. The string in Figure 3b is a good example. Note that the individual letters are well-formed, even though they are linked in non-standard ways. If we try to apply valley point segmentation to this kind of input, we find that many of the cuts are in the wrong place and, worse, many essential cut points are missed entirely, resulting almost certainly in recognition errors. Another problem with the valley point segmentation
(a)
(b) Figure 3. Examples of valley point segmentation applied to different writing styles, showing that valley points are not good segmentation points for a touching handprint style (b).
method is that it only indicates a single cutting point on the upper contour of a connected component. In the presence of multiply connected components or kerning (moving one character under another, as in Co), this leads to ambiguities as to how the component should be cut.
4
Finding Cuts
The segmentation algorithm described in this paper begins with a connected component analysis of the input. Cuts are always hypothesized between different connected components whose vertical projections do not overlap significantly. After this preprocessing step, the main task is to find cuts that divide up connected components into their individual characters. The basic idea behind the segmentation method described in this paper is to cut apart an input string just to the left of every near-vertical stroke. We call this pre-stroke segmentation. At first sight, it is perhaps surprising that this works. Figure 2 suggests why. The largest class of handprinted characters in the Roman alphabet, upper case or lower case, has a vertical left edge that consists of a single vertical line (e.g., B), a single slanted line (e.g., A), a convex left edge (e.g., C), or two slanted lines (e.g., X). Only 7 out of 52 letters deviate from this pattern. These are the letters with horizontal strokes that protrude to the left, viz. J T Z f j t z. If these letters are cut before their near-vertical stroke, they leave behind a short, disconnected horizontal line segment that is assigned to the previous character subimage. There are two ways of coping with this: we can introduce a special case into the segmentation algorithm that tries to recognize these cases, or we can make the recognition algorithm robust to the occurrence of these events. In the system described in [1], the latter approach was taken. This phenomenon does not occur with great frequency in handprinted text anyway, since all these letters except T and t are comparatively rare. Furthermore, such
Figure 4. The value of the cost function for all x-coordinates.
letters appear to be connected to the preceding letter less frequently than other letters, meaning that a correct cut will often have been found (in addition to the spurious cut) as part of the connected components pre-processing step. A simple application of these ideas might lead us to identify cut points using this technique and attempt to divide the image using straight cuts. In practice, this does not work very well both because characters are not separated cleanly and because the vertical projects of many character pairs are actually overlapping.
5
Curved Pre-Stroke Cuts
To allow touching and kerned characters to be separated more cleanly, we allow the cuts to be curved. In fact, the idea of using curved lines for separating handwritten characters is not new (see, for example, the hit-and-deflect strategy [15, 16]). However, the approach taken in this paper is different both in how the location and the shape of cuts are identified. The basic idea is to use a dynamic programming algorithm to find a globally optimal set (according to a certain evaluation function to be specified below) of cuts through the input string. The set of cuts and their precise shape are found simultaneously. Consider a path consisting of a sequence of pixels (xi , yi ), i = 0 . . . h; xi , yi ∈ N in a w × h image I of a connected component. The paths we will consider as cuts satisfy yi = i. We assign a cost to each path as follows: c=
h X
cs (xi − xi−1 ) + ci (xi , yi ; I)
i=1
The costs cs and ci should be chosen so as to make a
Figure 5. Examples of pre-stroke cuts applied to cursive and handprinted writing. Cuts found on touching handprint, the predominant style in many forms reading applications, result in high recognition rates.
Figure 6. The final set of aligned subimages chosen by the recognizer for the handprinted string in Figure 5(a) (actual output from the recognizer).
6 reasonable tradeoff between keeping paths simple, making them conform to the left edge of a vertical stroke, and keeping them from cutting through characters. In the actual system, the cost cs (∆x) was chosen to be 0 for ∆x = 0, 1 for |∆x| = 1, and ∞ for all other step sizes. Clearly, this limits the set of paths with finite cost to paths that are contained inside a cone within an angle of π 4 of the vertical. This was found to be sufficient for most segmentations. In addition, allowing paths that are more sloped significantly increases the cost of the computation. In general, ci should be smallest at a left vertical edge of ∆I > 0, which keeps paths close to the a stroke, i.e., where ∆x left edges of strokes. ci should be largest at points inside a stroke, to discourage cuts from going through strokes. It should be intermediate for background pixels. In the actual systems, these values are chosen as -5, 2, and 0, respectively, based on experimentation. To find a set of possible cuts, we proceed as follows. First, we compute the centroid (xc , yc ) of the connected component that we are trying to segment. Then, for all x = 0 . . . w − 1, we compute the cost c(x) of the optimal path passing through the point (x, yc ). Finally, the set of cuts is the set of paths corresponding to points x where c(x) is a local minimum. It is important to appreciate that we are considering the set of cuts passing through each point (x, yc ) and are looking for local minima of the cost function c(x). The line y = yc passes through the region of a connected component where the characters tend to be most strongly connected to one another. The effect is that the algorithm is forced to choose the locally optimal cut hypothesis even in difficult to segment regions of a connected component. If we considered the set of cuts passing through, say, each point (x, 0), difficult to segment regions of a connected component would simply be avoided by cuts; the cuts would “navigate around” such regions. This is a crucial difference to methods like the hit-and-deflect approach [15, 16]. An application of this method is shown in Figure 4. Curved black lines running through the string (FLIGHT ENGINEER) are cuts found by the method. The graph at the bottom shows the value of the cost function at each xcoordinate and the location of local minima.
Dynamic Programming Algorithm
What remains is the question of how to compute c(x) efficiently. For this, we use a dynamic programming algorithm reminiscent of the brushfire algorithm for computing distance transforms. The cost c(x) is computed in two stages. First, for each x, we compute the costs cl (x) and optimal paths starting at (x, 0) and ending at y = yc . Then, we repeat the computation to compute the costs cu (x) and optimal paths starting at (x, h) and ending at y = yc . Finally, the cost c(x) is given by the sum cl (x) + cu (x), and the optimal path through (x, yc ) is given as the concatenation of the optimal paths corresponding to cl (x) and cu (x). Pseudo-code describing the computation of the set of optimal paths starting at each (x, yc ) is given in Figure 7. We describe only the computation of the cost of paths between yc and h; the computation of the lower half is analogous. The algorithm begins by creating a cost array and a source array, both with the same size as the image, padded by one pixel in the y direction. Initially, all costs are set to infinity and all source points are set to undefined. During the execution of the algorithm, the value of the cost array at point (i, j) is either the special value infinity if the point has not been reached yet, or the cost of the best path to pixel (i, j) found so far. Similarly, the value of the source array is either the special value undefined if the pixel has not been reached yet, or the immediate predecessor of the pixel (i, j) on the best path to (i, j) from some point with y = h. Next, the cost for the pixels on the line (0, h)(w, h) is set to zero and those pixels are added to a FIFO queue. Until the queue is empty, pixel coordinates are taken from its front. The set of neighboring pixels that could be part of a valid path and their corresponding costs are computed. If the cost of reaching one of those neighboring pixels via a path through the current pixel is lower than any previously known path to that neighboring pixel, the cost and source arrays are updated, and the coordinates of the neighboring pixel are added to the back of the queue. The algorithm finishes when the queue of pixel coordinates becomes empty. All pixels that can be part of a cut will have finite and well-defined entries in their cost and
for i,j in cost do cost[i,j] := infinity source[i,j] := undefined end for for i from 0 below w do add point (i,h) to queue cost[i,h] := 0 end for while not empty(queue) do take point (i,j) from queue if j>=y_c then for delta in (-1,0,1) do new_cost := cs(delta) + ci(i+delta,j-1,image) if new_cost