Robust Text Detection from Binarized Document ... - Semantic Scholar

1 downloads 0 Views 262KB Size Report
Machine Vision Group, Infotech Oulu and Department of Electrical Engineering. P.O.Box 4500 .... each pair of spatially adjacent components are computed ac-.
Robust Text Detection from Binarized Document Images Oleg Okun, Yu Yan and Matti Pietik¨ainen Machine Vision Group, Infotech Oulu and Department of Electrical Engineering P.O.Box 4500, FIN-90014 University of Oulu, Finland

Abstract Many document images are rich in color and have complex background. To detect text from them, a standard approach utilizes both color and binary information. This often leads to time-consuming processing and requires a lot of parameters to be tuned. In contrast, we propose a new method for text detection using a binary image alone. The main virtues of our method include detection of both normal and inverted text and robustness to various font types, styles and sizes and small skew angles, combined with a moderate number of free parameters.

1. Introduction Text detection is an important task in document image analysis since text is the main source of information in various types of media. The accuracy of text detection greatly influences on the performance of information retrieval and OCR. Text can be found in various documents, which may be categorized into two groups. Documents of the first group are characterized by strict and precise rules on font style and size, text and background colors, interline spacing, etc. Articles in technical journals are a typical example of such documents. The paper [3] describes a comprehensive methodology for their analysis based on the estimation of free parameters on a training set of representative images and using the obtained estimates during testing. In contrast, documents of the second group, such as advertisements, are typically composed with few restrictions on character font, orientation of text lines or background color. It means that the values of free parameters determined during training cannot be optimal due to difficulty to select a training set representative enough for the whole population of images. In this paper, we will only concentrate on documents of the second group, because they have not been studied extensively. A document can be captured as a color, grayscale or binary image. There have been many efforts to detect

text directly from color or grayscale images so as not to lose useful information [1, 5, 7]. In many cases, however, color/grayscale analysis is combined with that of a binarized image because of simpler analysis in the latter case. In this paper, our goal is to rely only on a binary image for text detection in order to avoid sophisticated and often time-consuming grayscale/color analysis and numerous parameters accompanying it. The task is to deal with both horizontal and vertical text printed on both white and black background. We assume that the quality of the input binary images is reasonably good, i.e., text is readable, though characters can be broken, merged or degraded and the number of connected components can be tens of thousands due to binarization noise. It means that the standard methods intended for text detection from binary images of high quality are not suitable in this case.

2. Properties of text characters We relied on the following well-known properties: Property 1 Characters are normally arranged either horizontally or vertically. Property 2 Given a fixed font type, style and size, heights of the characters belonging to the same group (groups include ascenders, descenders, capitals, and lowercases) are approximately constant. Property 3 Characters form regular (or weakly periodic) structures in both horizontal and vertical directions. Property 4 Given a fixed font type, style and size, characters are composed of strokes of approximately constant width. In this paper, we interpret these properties in a unique way, which relaxes requirements on parameters in a sense that these parameters become less sensitive to various font styles and sizes of characters. Although these properties are somewhat restrictive, we believe that they cover a vast majority of possible cases.

Slanted or curved text lines, though being present in modern magazines and advertisements, are not so frequent and therefore they are not considered here.

3.3. Horizontal text detection

Look-up table creation

3. Method description Initial line candidate generation

 An input image

is binary and text can be black on white background and white on black background within the same image. According to Property 1, we assume either horizontal or vertical text. The origin of coordinates is at the upper-left image corner and X-axis (Y-axis) is directed to the right (downwards) from the origin. Text of both orientations is first detected on white (normal) background, followed by text detection of both orientations on black (inverse) background. Fig. 1 shows a flow chart of the whole method.

Line candidate partitioning Search for missed characters

Figure 2. Flow chart of horizontal text detection The flow chart for this step is shown in Fig. 2. Processing starts by creating three 1D look-up tables , , and . The first two tables have the length equal to the height of . For each row of , and store indices of components whose bounding boxes start is of eland end at this row, respectively. ements, where is the number of connected components. The purpose of is to mark components already included in horizontal text lines. All elements of are initially set to 0. Once the look-up tables are created, initial line candidates are formed by using them. Every component satisfying two conditions is considered to be a seed for line candidate generation.

         "!#$ %

 

Image filtering ‘Black’ connected component detection

 &!#$'%  

Horizontal text detection Vertical text detection ‘White’ connected component detection Horizontal text detection Vertical text detection

 

Condition 1 It has not yet been included in any of the lines. Condition 2 Its height is at least

Figure 1. Flow chart of our method

'()+*

pixels.

Other components are included in the current line only if they meet the following conditions (in addition to Conditions 1 and 2) resulting from Property 2:

3.1. Image filtering This operation includes order-statistic filtering, followed by removing isolated black and white pixels. The orderstatistic filtering replaces each pixel in by the pixel in the sorted set of its neighbors in a 3x3 neighborhood. The obtained image is ANDed with and remaining isolated black and white pixels are removed from the image. As a result, we reduce the number of noisy pixels, while preserving character shapes as much as possible.







Condition 3 Height of the component’s bounding box satisfies Eq. 1 [6]:



3.2. Connected component analysis ‘black’ component detected, parameters  For

  each of its bounding box are determined, where   are coordinates of the upper-left corner of the bounding box and  and  are its width and height, respectively.

',/./.0

-,/../021234534"',/././0

(1)

where is the height of the bounding box for the seed component.

7698 =,/8BA ./,/././0./0;::< <  

=698 ;,/8CA,././../0 0?>>@< <  

Condition 4 The component’s bounding box should either and (1) start from a row between or (2) end at a row between and , where and are the y-coordinates of the upper-left and lower-right corners of the seed’s bounding box and .

=,69./8 ./0

;,/8CA ././0 ;,/8BA ././0D =,/69./8 ./0 > ',/./.0 :FE

 

When verifying Conditions 3 and 4, using and provides an instant access to the required components. Entries in for all components included in line





candidates are set to 1. A special list is associated with each candidate storing indices of components assigned to a given candidate. Since initial line candidates may span two columns of text or several regions, these candidates are checked for partitioning into smaller lines. Too short candidates containing less than components are dismissed from further analysis and the corresponding entries in are again restored to 0. For other candidates, components constituting each candidate are arranged in a reading order and distances between each pair of spatially adjacent components are computed according to Eq. 2:     (2) 

#

 

'698

8CA  

 D 698 : 8CA 

  and are the x-coordinates of the upper-left where corner of the  component and of the lower-right corner of the   component.

Property 3 is applied to locate places of partition. According to it, intercharacter distances between characters belonging to the same text line should not significantly differ from each other. This implies that if we encounter an   unusually large  , this points to a line cut.    When , where is the height of the   line, the line is partitioned into two smaller lines, where the   component ends one line, while the  compo

nent starts the other line. Parameters of bounding boxes for each of them are recomputed. Once again, if (as a result of partitioning) some of the lines become too short, they are dismissed and their components become unattached. Such a procedure is repeated until all partitions are identified and done. Simultaneously with line partitioning, the dominant stroke width of characters is computed over all the components assigned to each line. For a stroke width to be dominant, it must correspond to that peak in the stroke width histogram (two highest peaks are considered), for which a feature called area of coverage is maximal. This feature is computed by Eq. 3:

: E 





)

)

: E 





  : -, 

 . D   : ,   "2 is the step function and  ,  (  3 ,  . 4  "! #%$&  ('")% "! & $ +*  /. 10 /'1) * 

(3)

where * represent the stroke width, maximal stroke width and stroke width associated with a particular peak in the histogram, respectively. A set of lines generated so far may not include all characters in . These characters, called missed, can be both inside and outside a line. Outside each line, rectangular regions on both sides (left and right) of it form regions of interest. Each region has the height equal to the height of the current line and the width equal to the maximal width of the bounding box among all components in this line. These



parameters are not fixed but changing after inclusion of a component in the line. Line expansion stops when no component to be included is found. To recover missed characters, all components still unassigned to lines and located in regions of interest are tested. Both a vertical overlap of a line by a component and a vertical overlap of a component by a line should be larger than a parameter 576 8:9   . Property 4 and 4 then help to decide whether  a  par ticular component will be included in a line or not. indicates how large is the area occupied by the strokes with widths around the dominant stroke width. From experiments with characters of a fixed font type, style  and size, we found that strokes with widths within [ " , " ] are present in every character of such a font, though the area of coverage can dramatically vary from character to character. We consider that a component – candidate for inclusion in a line – must have   at least one stroke ,  ]. with the width within [  After each inclusion, parameters of the line’s bounding box are adjusted accordingly and the list containing indices of components attached to a given line is also updated. To prevent unlimited growth of the line height, the ratio  is measured after each line expansion, where  are the line heights before any expansion and and after the last expansion, respectively. If this ratio is less than or equal to , the line expansion is permitted, otherwise not.

! 9 9$

.

.

 .

 . >4< 

: < 

 . : <   . > < 

 * . ?1 -)B*?)  )+*?) - * . 

3.4. Vertical text detection Detection of vertically oriented text is similar to that of horizontally oriented. It, however, does not analyze those components that were already assigned to horizontal lines. Main changes are chiefly because of the fact that features related to ‘height’ are now interchanged with those related to ‘width’. Nevertheless, all parameters and operations introduced in Section 3.3 remain unchanged.

3.5. White text detection on black background This step initiates with connected component analysis by assuming white components. Other operations are essentially the same as described in Sections 3.3 and 3.4.

4. Experiments To test our method, we collected 27 color images from magazines, captured by a HP ScanJet 5370C scanner, in addition to 15 binary images from the UW-I database [4]. The color images containing text of various colors, font sizes and orientations within the same document were binarized with different global thresholds. A typical binarized image is shown in Fig. 3 (left).

Table 2. Test statistics Characters (total): 83,863 missed average rate std of rate 850 1.1% 1.3% false alarm average rate std of rate 2,143 3.7% 5.1%

Figure 3. Binarized image (left) and text detection results (right)

The parameters used in our method were divided into adjustable and fixed. The list of parameters together with their values is given in Table 1. The values of fixed parameters were taken from other sources [2, 6, 7], while adjustable parameters were chosen based on the analysis of various cases occurring in images. In all tests, adjustable parameters were fixed as given in Table 1.

Table 1. Fixed and adjustable parameters and their values Fixed parameters 7 8 pixels 2 3

  ( )+*  #

Adjustable parameters 3 pixels  2 3-5 pixels 576 8:9 50%

<  <  ! 9 9$

Text detection results for one image are given in Fig. 3 (right). The contents of the bounding boxes of detected characters were copied from the original image, whereas non-text data are displayed in gray. Table 2 accumulates statistics for all test images. The words ‘total’, ‘missed’ and ‘false alarm’ stand for the total number of characters in all images, the number of missed characters, and the number of false positives, respectively. The average rates of missed characters and false alarms and their standard deviations are listed under ‘average rate’ and ‘std of rate’, respectively. Cases when characters are missed were mainly attributed to drop caps, too short lines (text in some table cells, mathematical formulas and annotating plots), character degradations and heavy noise. False positives if happen were caused by the similarity of non-text objects to text (texturelike noise, repeating patterns in pictures, bar codes). These reasons confuse many other methods as well.

Although it is difficult to compare our method with others because of different test sets, we consider the accuracy attained with our method as high. For example, in [7] the accuracy (92%) was comparable with ours (95% in the worse case). Their method is multiresolution-based and it processes from 3 to 9 images to detect text of various font sizes, whereas ours just needs one image, that is, our method is faster. In addition, we learnt from experiments that our method can tolerate small skew angles (1-2 degrees) without needs for skew correction.

5. Conclusion We presented a method for text detection in binary document images. Although it relies on heuristics like most other methods do, chosen parameters do not make results significantly dependent on their values. Moreover, the number of free parameters is moderate compared to many other methods. Experiments with real images demonstrated encouraging results.

References [1] A. Jain and B. Yu. Automatic text location in images and video frames. Pattern Recognition, 31(12):2055–2076, 1998. [2] R. Kasturi and M. Trivedi. Image Analysis Applications. New York: Marcel Dekker, 1990. [3] J. Liang, I. Phillips, and R. Haralick. An optimization methodology for document structure extraction on Latin character documents. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(7):719–734, 2001. [4] I. Phillips, S. Chen, and R. Haralick. CD-ROM document database standard. In Proc. of the 2nd Int. Conf. on Document Analysis and Recognition, Tsukuba, Japan, pages 478–483, 1993. [5] K. Sobottka, H. Kronenberg, T. Perroud, and H. Bunke. Text extraction from colored book and journal covers. Int. Journal on Document Analysis and Recognition, 2(4):163–176, 2000. [6] I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. New York: Van Nostrand, 1994. [7] V. Wu, R. Manmatha, and E. Riseman. TextFinder: an automatic system to detect and recognize text in images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(11):1224–1229, 1999.

Suggest Documents