Document layout structure extraction using ... - Semantic Scholar

Document layout structure extraction using bounding boxes of dierent entities J. Liang, J. Ha, R. Rogers, R. M. Haralick

Department of Electrical Engineering, Box 352500, University of Washington, Seattle, WA 98195 I. T. Phillips

Department of Computer Science, Seattle University, Seattle, WA 98112

Abstract This paper presents an ecient and accurate technique for document page layout structure extraction and classi cation by analyzing the spatial con guration of the bounding boxes of dierent entities on a given image. The text, table, and nontext structures are detected on document images. The text-lines and words are extracted and the tabular structure is further decomposed into row and column items. Finally, the document layout hierarchy is produced from these extracted entities. We develop a performance metric for the document layout analysis by nding the correspondences between detected bounding boxes and ground-truth. We evaluate our algorithms on 1600 images from the UW-III Document Image Database, and the quantitative performance measures in terms of the rates of correct, miss, false, merging, splitting, and spurious detections are reported. We describe a method for determining the optimcal algorithm tuning parameters given the ground-truth. The results show that the average performance of the algorithms is improved by 26 6% after the optimization process. :

Key words: document layout structure, bounding box, performance, optimization.

1 Introduction The goal of document understanding is to convert existing paper documents into a machine readable form. It involves extracting the geometric page layout, recognizing the text content through an Optical Character Recognition (OCR) system, determining the logical structure, and formatting the data and Preprint submitted to Elsevier Preprint

19 December 1997

information of the document in a suitable way for use by a word processing system or by an information retrieval system. Let designate the set of content types (text-block, text-line, word, table, equation, drawing, halftone, handwriting, etc.). Let D designate the set of dividers (spacing, ruling, etc.). We denote by A the set of non-overlapping homogeneous polygonal areas on document image. Each polygonal area A 2 A consists of an ordered pair (; I ), where 2 speci es the content type and I is the area. A polygon is homogeneous if all its area is of one type and there is a standard reading order for the content within the area. A layout structure is a hierarchy of polygonal areas, = (A; D)

(1)

which speci es the geometry of the polygons, the content types of the polygons, and the spatial relations of these polygons. Each polygonal area on the hierarchy is also called entity or object in this paper. We use a bounding box or a list of rectangular bounding boxes as the geometric representation of a polygon. Therefore, the layout analysis consists of two processes: the segmentation process discovers various objects of interest in an input document image; the classi cation process identi es the detected objects' types. The classi cation method must complement the segmentation approach chosen. It should utilize the data and measurements produced during segmentation before performing further measurements when computing features [1]. There are two major approaches of segmentation and classi cation. In the rst approach, the document image is decomposed into a set of zones enclosing either textual or nontextual content portions using the global spatial properties. The statistical and textural features are extracted and used to label each zone as one of the prede ned categories [20]. In the second approach, the low-level segments (connected components [21], line segments [19], etc.) are extracted and classi ed as text or non-text. Then the text segments are grouped into text-blocks. This paper presents a hybrid technique for page layout structure extraction and classi cation by analyzing the spatial con guration of the bounding boxes of dierent entities (connected components, words, text-lines, etc.) on a given document image. We segment the image into a list of zones by analyzing the projection pro les of connected component bounding boxes. Then, the textline and word entities are extracted within each zone. We classify each zone into one of the categories: text, table, gure, and horizontal or vertical line, using the bounding box information of connected component, word, and textline entities. By analyzing the alignment of text-line boxes and word boxes, text-blocks are extracted from text zones, and tabular row-column structures 2

are extracted from table zones. Since our segmentation algorithm manipulates only the bounding boxes of dierent entities, and the classi cation algorithm utilizes the data produced during segmentation, it does not require intensive computation. Since document understanding techniques will be moving to the consumer market, they will have to perform nearly perfectly and eciently. This means that they will have to be proved out on signi cant sized data sets and there must be suitable performance metrics for each kind of information a document understanding technique infers [4]. However, many of the earlier techniques on document image analysis were developed on a trial-and-error basis and only gave illustrative results. There is a wide agreement upon the need of an automatic tool to benchmark the existing layout analysis techniques. Randriamasy and Vincent [17] proposed a pixel-level and region-based approach to compare segmentation results and manually generated regions. An overlap matching technique is used to associate each region of one of the two sets, to the regions in the other set it has a non-empty intersection. Since the black pixels contained in the regions are counted rather than the regions themselves, this method is independent of representation scheme of regions. Quantitative evaluation of segmentation is derived from the cost of incorrect splittings and mergings. Working on the bit-map level, their technique involves extensive computation. They assume there is only one text orientation for the whole page. Kanai et al. [5] proposed a text-based method to evaluate the zone segmentation performance. They compute an edit distance which is based on the number of edit operations (text insertions, deletions, and block moves) required to transform an OCR output to the correct text. The cost of segmentation itself is derived by comparing the costs corresponding to manually and automatically zoned pages. This metric can be used to test \black-box" commercial systems, but is not able to help users categorize the segmentation errors. This method only deals with text regions. They assume the OCR performance is independent of the segmentation performance. In this paper, we consider the comparison of ground-truth and detected entities in terms of their bounding boxes. A quantitative metric is presented to evaluate the performance of the layout analysis algorithms. A large quantity of ground-truth data, varying in quality, is required in order to give an accurate measurement of the performance of an algorithm under dierent conditions. In the University of Washington English Document Image DatabaseIII [12], there are 1600 English document images that come with manually edited ground-truth of entity bounding boxes. These bounding boxes enclose text and non-text zones, text-lines, and words. We evaluate the performance of our layout analysis algorithms based on the UW-III data set and the given performance metric. The parameters of our algorithms are determined by optimizing their performance on the data set. 3

This article is organized as follows. In Section 2, our layout analysis algorithms are described in details. Then we provide the performance metric in Section 3, and report the quantitative results of our algorithms on the UW-III data set in Section 4.

2 The algorithm for layout structure extraction In this section, we present the algorithm for layout structure extraction and classi cation using the bounding boxes of dierent entities [6]. The ow diagram is shown in Figure 1. The algorithm rst computes the connected components of black pixels on the input image and produces the bounding box for each of the connected components. Next, the algorithm performs the horizontal and vertical projections of these bounding boxes. The projection pro les of the bounding boxes are then analyzed to decompose the document image into a set of rectangular zones. The connected component entities are classi ed into large component, normal component, vertical or horizontal line, and noise, according to their sizes and locations on the page. The spatial con guration of connected component bounding boxes within each zone box are analyzed to extract text-line and word entities. Each zone is labeled as textual or non-textual according to the distribution of text-line and word entities within the zone. By nding the peaks from the vertical projection pro le of word bounding boxes, the tabular structure can be identi ed. Within the text zone, the neighboring text-lines are merged to form text-blocks based upon the inter-text-line spacing statistics and also based on the changes in the textline justi cation. The tabular structure can be further decomposed into row and column items. Now we describe the page segmentation algorithm in a step-by-step manner. The input of the algorithm is a binary document image. We assume that the input document image has been correctly deskewed. 2.1 Connected component extraction

A connected-component algorithm [3] is applied to the black pixels in the binary image to produce a set of connected components. Then, for each connected component, the associated bounding box is calculated. The bounding box of a connected component is de ned to be the smallest upright rectangle which circumscribes the connected component. Each bounding box is considered as a smallest entity on the page. Figure 2.1(a) shows a segment of a document image and Figure 2.1(b) shows the bounding boxes produced in this step. 4

Connected Component Extraction & Filtering

Page Segmentation Text-Line & Word Extraction Tabular Structure Detection Text-Block Extraction

Fig. 1. Flow of layout structure extraction and classi cation algorithm.

(a) 0

10

20

30

40

(b) 50

60

30 20 10 0

(c)

(d)

Fig. 2. (a) an English document, (b) bounding boxes of connected components of black pixels, (c) horizontal projection pro le, (d) vertical projection pro le.

5

Connected components are determined to be large connected components, normal connected components, horizontal or vertical lines, or noise, according to the physical properties of the components. If the width/height ratio of a connected component is lager than a certain threshold, it is labeled as a horizontal or vertical line. The graph and gure regions usually include irregular lines, curves or shapes which form connected components with a larger size than those of individual characters. The table usually has rectangular boxes or separate lines and characters. Some text zones, such as underlined text, text with an enclosing frame, include big connected components. We label each connected component as large component if its size (either in the horizontal direction or in the vertical direction) is larger than a threshold, otherwise, a normal component. The threshold size is empirically determined based on the size of the characters in the text. Some components are considered as noise if their sizes are extremely large or small, or they are very close to the boundary of page. These attributes of connected components will be used in the following structure extraction steps. 2.2 Page segmentation

The document page segmentation roughly divides the image into a list of zones by analyzing the spatial con guration of bounding boxes. The analysis can be done by projecting bounding boxes onto a straight line [2]. Since paper documents are usually written in the horizontal or vertical direction, projections of bounding boxes onto vertical and horizontal lines are of particular interest. For brevity, the projection of bounding boxes onto a vertical straight line is called horizontal projection. The vertical projection is de ned analogously. By projecting bounding boxes onto a line, we mean counting the number of bounding boxes orthogonal to that line (See Figure 2.2). Hence, a projection pro le is a frequency distribution of the bounding boxes on the projection line. Figures 2.1(c) and 2.1(d) show the horizontal and vertical projection pro les of the bounding boxes in Figure 2.1(b). A document image can be segmented using a recursive X{Y cut [8] procedure based on the bounding boxes. We choose to work with the bounding boxes of the connected components rather than the pixels. The bounding box projection approach has many advantages over the pixel projection approach. It is less computationally intensive. It is possible to infer from projection pro les how bounding boxes (and, therefore, primitive symbols) are aligned and/or where signi cant horizontal and vertical gaps are present. At each step, the horizontal and vertical projection pro les are calculated. Then a zone division is performed at the most prominent valley in either projection pro le. The process is repeated recursively until no suciently wide valley is left. 6

+ + + +

543210

sum

bounding boxes

height functions of individual bounding boxes

projection profile

Fig. 3. Horizontal projection pro le, which is obtained by accumulating the bounding boxes onto the vertical line. The projection pro le is a histogram which shows frequencies of projected bounding boxes.

2.3 Text-line extraction

Inside each zone generated from the page segmentation process, we analyze the spatial con guration of bounding boxes to extract text-lines. During this step, we only use normal sized connected components (primitive symbols). By analyzing the distribution and alignment of text-lines within a zone, we can classify the zone as textual (text or table) or non-textual. From Figure 2.1(c) and 2.1(d), it is clear that the text-lines can be extracted by nding the distinct high peaks and deep valleys at somewhat regular intervals in the horizontal projection pro le. Since the bounding boxes are represented by a list of coordinates of the two corner points, the bounding boxes of textline entities are easily extracted. The result is shown in Figure 4(f). Other important features, such as frequency distribution of text-line heights and inter-text-line spacings, can also be deduced from the horizontal projection pro le. The above algorithm is fast and straightforward. Unfortunately, noise and skew result in poor performance with this algorithm. To cope with these problems, we developed another algorithm to merge a set of connected component bounding boxes into line bounding boxes. Boxes are merged if and ony if the horizontal distance between them is small, and they overlap vertically. In severally degraded documents, characters in vertically adjacent text lines may be merged. This may cause the line segmentation algorithm to return bounding boxes which contain multiple text lines. To remedy this situation, the following algorithm is applied to the lines created by the line segmentation algorithm:

for each line l do if Count Merged Lines (l) > 1 then for each connected component cc in l do

n = Count Intersecting Lines (cc; l) split cc vertically into n equally tall components 7

(e)

(f)

(g)

(h)

Fig. 4. (e) vertical projection pro les, (f) text-line bounding boxes, (g) word bounding boxes, (h) text-block bounding boxes.

re-run line segmentation algorithm for components in l The Count Merged Lines algorithm determines the number of vertically merged lines contained in a rectangular region l. It assumes that each line in the region contains at least one character which is not vertically merged. The algorithm provides correct results for regions contain text within a size range of up to twice a speci ed minimum height. The Count Intersecting Lines algorithm computes the number of text-lines in a rectangular region l which intersect a speci ed connected component cc. The algorithm makes the same assumptions as the Count Merged Lines algorithm. According to the properties of extracted text-line bounding boxes, we classify each zone as textual (text or table) or non-textual (line-drawing or halftone). The properties are the text-line density ratio (total text-line area/zone area), text-line height ratio (total text-line height/zone height), text-line height variance, and justi cation of text-lines (left, right, center, or justi ed). The justi cation of text-lines can be found by computing the variance of the edges of line boxs from either left, right, or center. The textual zones usually have a regular alignment of text-lines. 8

900

750

600

450

300

150

0 0

10

20

30

40

50

Fig. 5. Illustrates the histogram of inter-character and inter-word spacing.

2.4 Word extraction

The spatial con guration of the bounding boxes within each of the extracted text-lines is analyzed to extract words. The vertical projection pro le for each text-line is computed. The algorithm considers each such pro le as a onedimensional gray-scale image, and thresholds it at 1 to produce a binary image. During the binarization a symbol (or a broken symbol) with multiple bounding boxes may be merged into one. Consequently, adjacent symbols whose bounding boxes overlap each other are also merged. However, this will not cause any problem as we merge symbols' bounding boxes to form words. Figure 4(e) shows projection pro les within text-lines. The binarization is followed by a morphological closing with a structuring element of appropriate length to close spaces between symbols, but not between words. The length is determined by analyzing the distribution of the run-lengths of 0's in the binarized pro le. In general, such a run-length distribution is bi-modal. One mode corresponds to the inter-character spacings within words, and the other to the inter-word spacings (See Figure 5). An algorithm which minimizes the within-group variance [9] is used to nd the best threshold. The bottom of the valley between these modes gives the desired structuring element length. If there are more than two types of spacing, this method can be recursively applied to the extracted segments. Figure 4(g) shows the results of this step. 2.5 Tabular structure Detection

For each textual zone, we detect if it has a tabular structure. A table is a systematic arrangement of data usually in rows and columns for ready reference. The LATEX tabular environment produces a box (visible or invisible) consisting of a sequence of rows of items, aligned vertically in columns. Each column has 9

a list of item which are left-aligned, right-aligned, or centered. A single item can span a number of columns. The tabular structure can be identi ed by analyzing the projection pro le of word bounding boxes. In this, our method is dierent from [18], which detects peaks from a white space density graph. A table image is shown in Figure 6. We extract the text-line and word boxes using the method described in Section 2.3. From Figure 6(a) and 6(b), it is clear that the table columns and column separators can be detected by nding the distinct high peaks and deep valleys in the vertical projection pro le of word boxes. For each textual zone, if such a structure is found, the zone is a table; otherwise, it is a regular text region. We then detect the row and column structure of the table by applying the X-Y cut algorithm based on the projection pro les of word bounding boxes (See Figure 6(c)).

(a) 15

10

5

0

0

500

1000 Column

1500

2000

(b)

(c) Fig. 6. (a) a table image and the extracted word bounding boxes, (b) vertical projection pro le of word boxes, (c) table column items.

The itemized list is another example of tabular structure. This algorithm can be applied to detect the list structure, too. 10

2.6 Text-block extraction

For all textual zones which do not have tabular structure, we merge textlines into text-blocks. The beginning of a text-block, such as a paragraph, math zone, section heading, etc., are usually marked either by changing the justi cation of the current text-line, or by putting extra space between two text-lines, or by changing font size or font style. So when a signi cant change in text-line heights, inter-text-line spacings, or justi cation occurs, we say that a new text-block begins. The distributions of text-line heights and inter-textline spacings together with the horizontal projection pro le give us cues for text-block segmentation. The font size can be estimated roughly from the textline bounding box height. Figure 4(h) shows the text-block bounding boxes for our example text.

3 Performance evaluation of layout analysis The document image analysis algorithms have to be proved out on signi cant sized data sets and there must be suitable performance metrics for each kind of information a document analysis technique infers. Publically available document image data sets whose correct layout structure are speci ed for each image in the data set are needed. An example of one signi cant sized English document image data base publically available is that of [10]. This series of data sets have 1600 English document images from scienti c and technical articles which are labeled with their layout structure. Performance evaluation serves two purposes. One is comparing the groundtruth and algorithm output and producing a quantitative overall measure of the algorithm. Another one is to help the developers to analyze the results and nd out when and why their algorithm works or fails before trying to improve the algorithm. In document image analysis algorithms, tuning parameters are often used. These parameters have been traditionally chosen by trial-and-error. In this section we discuss a methodology for the estimating of parameters by optimizing the algorithm performance over the training data. 3.1 Generation of layout structure ground-truth

A large quantity of ground-truth data, varying in quality, is required in order to give an accurate measurement of the performance of an algorithm under 11

dierent conditions. Sponsored by the Advanced Research Project Agency (ARPA), we have produced the University of Washington Document Image Database [10,11] series of ground-truth databases on CDROM for the document image analysis and recognition communities. The rst two databases in the series, UW-I and UW-II, provide a variety of ground-truth information about their constituent documents, including zone and page bounding boxes, attributes, and ASCII text. In order to facilitate the training and evaluation of document layout analysis algorithms, UW-III [12] includes text-line bounding box and word bounding box ground-truth, in addition to the ground-truth information provided in the previous database releases. UW-III contains a total of 1600 English document images randomly selected, copied, and scanned from scienti c and technical journals. Each page contains a hierarchy of manually veri ed page, zone, text-line, and word entities with bounding box and in the correct reading order. The text ground-truth and zone attributes are tagged to each zone entity. The zone attribute includes information about the type of the zone (i.e., text, gure, table, etc.), the dominant font size and the dominant font style, and the justi cation, etc. Based on these ground-truth data, we can evaluate the performance of document analysis algorithms. The protocol used for ground-truthing layout structure and other information on the document images is illustrated in Figure 7. We used a double entry and triple veri cation procedure. First, independent ground-truth layout structures are generated either manually (text and non-text blocks) or by dierent algorithms (text-lines and words). Then, two ground-truth les are compared and the errors are corrected until they agree with each other. This veri cation is done independently by two individuals. Finally, the veri ed pairs are compared and corrected by third group. Several automatic tools have been developed in order to eciently groundtruth a large number of document images with very high accuracy. One compares two ground-truth les and produces common and dierent boxes between them. The matching criterion is described in Section 3.2. An erroneous word bounding box agging algorithm makes use of the database's existing ground-truth information to detect errors in the word bounding boxes [13]. Document Format Attribute Speci cation (DAFS) [14] is used as the representation of layout structure in the UW-III database. The DAFS provides a format for breaking down documents into standardized entities, de ning entity boundaries and attributes, and labeling their contents (text values) and attribute values. DAFS permits the creation of parent, child and sibling relationships between entities, providing easy representation of the hierarchical structures of a document. DAFS Library [15] is a set of routines for developers to work with DAFS les. A graphical user interface called Illuminator [16] 12

Ground Truth Data Entry Process (Double Data Entries)

Verification / Correction Process (Triple verifications) Ground Truth (1)

Ground Truth (2)

Protocol

Protocol

Protocol

Protocol

Manual (Automatic) Entry

verification I

verification II

Verified Truth (1)

Manual (Automatic) Entry

Verified Truth (2)

Protocol verification III

Ground Truth (1)

Ground Truth (2)

Verified Truth (3)

(a)

(b)

Fig. 7. Layout structure ground-truthing protocol: (a) generation of ground-truth, (b) veri cation and correction.

enables the users to view and edit the DAFS structure. Both DAFS Lib and Illuminator are public domain software. Through DAFS, developers will be able to exchange training and test documents and work together on shared problems. 3.2 Matching of detected layout structure with ground-truth

There are two problems in making the evaluation. The rst is one of correspondence: which entities of the ground-truth set correspond to which entities of the automatically produced set. Once this correspondence is determined then a comparison of detected entities with the ground-truth entities can proceed. Suppose we are given two sets G = fG1 ; G2; ; G g for ground-truthed bounding boxes and D = fD1 ; D2; ; D g for detected bounding boxes. The comparison of G and D can be made in terms of the following two kinds of measures: G \ D ) and = Area(G \ D ) = Area( (2) Area(G ) Area(D ) M

N

i

j

i

ij

j

ij

i

j

where 1 i M , 1 j N , and Area(A) represents the area of A. The measures in the above equation constitute two matrices = ( ) and T = ( ). Notice that indicates how much portion of G is occupied by D , and ij

ij

i

13

ij

j

ij

indicates how much portion of D is occupied by G . Our strategy of performance evaluation is to analyze these matrices to determine the correspondence between two sets of bounding boxes: j

i

one-to-one match ( 1 and 1); one-to-zero match ( 0 for all 1 j N ); zero-to-one match ( 0 for all 1 i M ); P one-to-many match ( < 1 for all j , and P =1 1); many-to-one match ( < 1 for all i, and =1 1); many-to-many match (others). ij

ij

ij

ij

N

ij

j

ij

M

ij

ij

i

3.3 Performance measure

Once the matching between detected entities and ground-truth entities is established, a performance measure can be computed. A one-to-one match means an object G is correctly identi ed by the segmentation process as D . The classi cation process assigns a content type to each detected entity. Let g be the true content type of an entity and let d be the assigned type. The performance of the classi cation process is de ned as i

j

f

type

X W (g; d)P (g; d)

= g

(3)

2 2 C;d

C

where P (g; d) is the probability of an assignment, and W (g; d) is the weight given to dierent kinds of assignment. f is the misclassi cation rate if we only consider the case g 6= d. type

A one-to-zero match is the case when a certain object G is not detected by the segmentation (misdetection), and vise versa for the zero-to-one match (false alarm). If an entity G matches to a number of detected entities, we call it a splitting detection. It is a merging detection when two or more objects in G are identi ed as an object D . i

i

j

Let us denote the probability of a matching (G G is identi ed as D D) as + #ofD P (G; D) = ##ofG (4) ofG + #ofD i

i

i

i

i

Then, the performance measure of a segmentation process is de ned as

f

spatial

=

X W P (G; D) i

2

i

M

14

i

(5)

where M is the set of possible matchings, and W is the weight assigned to each type of matching. Again, f is the cost when we only consider the segmentation errors. i

spatial

Each kind of matching can be further divided into dierent categories. For example, merging text across columns apparently should be given more penalty than merging two adjacent text-blocks within the same column. Splitting a paragraph into two blocks will not hurt the text ow, but splitting a single picture will make the nal output look strange. In the UW databases, each zone is labeled with a set of properties, e.g. column number, content type, read direction, font information, alignment, etc. Therefore, a many-to-one matching between ground-truth zones and detected zones can be divided into the following possible detection errors:

merge text and non-text merge text across columns merge text zones with dierent reading directions merge adjacent homogeneous text.

Equation 5 still works and M now is the set of categories de ned by the users. 3.4 Parameter selection

When a representative sample of a domain is available, and a quantitative performance metric is de ned, we can tune the parameter values of an algorithm and select a set which produces the optimal performance on the input population. Suppose the performance measure f is the cost, then the best tuning parameter vector is X after

X = arg min f (X )

(6)

X

and X = (X1; X2; ; X ) is the vector of parameters. N

We use an iterative rst-order search method to nd the optimal parameters,

X = X ?1 + k S q

q

(7)

q

q

Here q is the iteration number, S is the search direction, and k is a scalar multiplier determining the amount of change in X for the iteration. The search direction is taken as the negative of the gradient of the function, q

q

S = ?rf (X ): q

15

(8)

After nding the search direction, we use the golden section algorithm to nd the best scalar k.

4 Experimental results In this section, we report the performance of our algorithms on the images from the UW-III Document Image Database. The parameters of each algorithm are decided from the optimization process described in Section 3.4. For the segmentation process, the weights we used for computing the performance is given in Table 1. For the classi cation process, we assign weight 0 when d = g Table 1 Weights assigned to dierent spatial matching. Correct Miss False Merge Split Spurious 0 1 1 0.5 0.5 1

and assign 1 when d 6= g. Therefore, our performance criterion is the cost for converting the detected layout structure to the ground-truth. For each algorithm, we start with their default parameters obtained by observing a small number of images, then we tune the parameters by minimizing the algorithm's cost on the images in the UW-III database, until the improvement of performance is less than a threshold. The algorithms' cost after the parameter tuning, compared to their performance using the default parameters (reported in [7]), is shown in Table 2. The average improvement is 26:6% with respect to the original performance. Table 2 Illustrates the performance of algorithms before and after the optimization process. Page Line 1 Line 2 Word Block original 0.143 0.022 0.015 0.015 0.141 optimized 0.104 0.017 0.006 0.012 0.137 improved by 27.3% 22.7% 60% 20% 2.8%

The numbers and percentages of miss, false, correct, splitting, merging and spurious detections of each algorithm after the optimization process are presented in the following sections. 16

4.1 Performance of page segmentation

Table 3 illustrates the numbers and percentages of miss, false, correct, splitting, merging and spurious detections with respect to the ground-truth zones as well as the algorithm output. The cost for the page segmentation is 10:2%. Since the page segmentation nds the coarse homogeneous zones, we do not consider the merging of adjacent text within the same column as error. Table 3 Performance of X-Y cut page segmentation with respect to the ground-truth and the algorithm output. Total Correct Splitting Merging Mis-False Spurious Ground Truth 24216 21019 462 2186 245 304 (86.80%) (1.91%) (9.03%) (1.01%) (1.25%) Detected 14848 11346 1883 710 592 317 (76.41%) (12.68%) (4.78%) (3.99%) (2.14%)

This algorithm is restricted to bi-directional X-Y cuttable layouts. It is also sensitive to severe page skew. So deskew has to be done before applying the projection. To decide if applying a cut on a projection pro le valley, a threshold on the width of valley is used. Instead of having a global value, the threshold might be adaptively determined by considering the width and depth of the projection pro le valley, the size of nearby connected components, and the aspect ratio of current zone, etc. 4.2 Performance of text-line segmentation

The numbers and percentages of miss, false, correct, splitting, merging and spurious detections of two text-line extraction algorithms are shown in Table 4. The cost of the rst text-line extraction algorithm (projection pro le cut) is 1:73%. The advantages of this algorithm are that it is very simple and fast, and it produces very low misdetection rate. But this algorithm is sensitive to skew (global or local), smear, warping, and noise. If the inter-text-line spacing is very small and the superscript, subscript, or the ascender and descender of adjacent lines overlap or touch with each other, the text-lines are usually merged. The cost of the second text-line extraction algorithm (merging and splitting connected components) is 0:62%. The splitting procedure is able to split the touched connected components and warped or skewed text-lines, but it may 17

Table 4 Performance of text-line extraction algorithms: (a) projection pro le cut; (b) merging and splitting of connected components. Total Correct Splitting Merging Mis-False Spurious Ground Truth 105439 100471 124 4543 157 33 (95.39%) (0.12%) (4.31%) (0.15%) (0.03%) Detected 102494 100471 383 1504 4 132 (98.03%) (0.37%) (1.47%) (0.00%) (0.13%) (a) Total Correct Splitting Merging Mis-False Spurious Ground Truth 105439 104250 148 657 335 49 (98.87%) (0.14%) (0.62%) (0.32%) (0.05%) Detected 105107 104250 390 287 1 179 (99.19%) (0.37%) (0.27%) (0.00%) (0.17%) (b)

also cause some splitting errors, i.e. the superscript and subscript are split from the text-line. 4.3 Performance of word extraction

Table 5 illustrates the numbers and percentages of miss, false, correct, splitting, merging and spurious detections with respect to the ground-truth words as well as the algorithm output. The performance of this algorithm is 1:15%. Table 5 Performance of word extraction. Total Correct Splitting Merging Mis-False Spurious Ground Truth 828201 810208 5192 11259 1327 215 (97.83%) (0.63%) (1.35%) (0.16%) (0.03%) Detected 828296 810208 13380 4333 192 183 (97.82%) (1.62%) (0.52%) (0.02%) (0.02%)

This algorithm is robust for dierent layout and conditions since the word spacing is determined adaptively. Small skew and warping are tolerable. One of the limitations of this algorithm is that the text-line is required as input. It is sensitive to noise, italic font, in-line mathematical formula, underlined text, etc. We can nd the base line and x-height rst, and only use the portion of 18

connected component between those two lines for computing the spacing. 4.4 Performance of text-block extraction

The performance of text-block extraction algorithm is shown in Table 6. Its total cost is 13:76%.

Table 6 Performance of text-block extraction with respect to the ground-truth and the algorithm output. Total Correct Splitting Merging Mis-False Spurious Ground Truth 21738 16680 1670 3014 2 372 (76.73%) (7.68%) (13.87%) (0.01%) (1.71%) Detected 23302 16680 5191 1094 0 337 (71.58%) (22.28%) (4.69%) (0.00%) (1.45%)

This algorithm utilizes the paragraph formatting attributes as the cues for text-block segmentation. Instead of the global threshold for the inter-line spacing, and text size, a local threshold can be adaptively determined within a homogeneous region. 4.5 Performance of textual structure detection

We used a decision tree classi er to assign a text or nontext label to a structure. The decision tree is trained and tested using the cross-validation method. Table 7 provides the results of textual and nontextual structure classi cation, where row are the true categories and column are the assigned categories. The misclassi cation rate is 2:04%. Table 7 Contingency table of textual and nontextual structure classi cation. text nontext text 20523 411 nontext 21 204

4.6 Performance of tabular structure detection

We tested the tabular structure detection algorithm on 190 tables and 9049 text-blocks (rejecting blocks with less than three text-lines). If more than 19

one peak are detected from the projection pro le of word boxes, we say the block has a tabular structure. Table 8 provides the results of tabular structure detection. The misclassi cation is 0:9%. Table 8 Contingency table of tabular structure detection. regular text table regular text 9009 30 table 53 136

5 Conclusions and future work In this paper, we discussed a heuristic algorithm for extracting layout structure from document images. A small set of rules are derived from the typographical properties which can be applied to the general documents, i.e. the spacing between dierent kinds of entities. The advantages of this algorithm is that it is very fast and the idea is very straightforward. Therefore, it is easy to analyze and predict its performance given an application. A near-perfect performance is required in order to transfer the scienti c results to practical document analysis system. The signi cant sized data sets with very accurate ground-truth and appropriate quantitative performance metrics are necessary for the performance evaluation of document image analysis algorithms and systems. Our work on producing UW document image database series is for this purpose. We have presented a method for matching the detected layout structure with the ground-truth, and a quantitative performance measure can be derived from the correspondences. We evaluated our algorithms on 1600 images from the UW-III Document Image Database, and the quantitative performance measures in terms of the rates of correct, miss, false, merging, splitting, and spurious detections were reported. We presented a method to tune the parameters (thresholds) on a given set of ground-truth, and the optimal performance for a speci c application and data population can be pursued. The results show that the average performance of the algorithms is improved by 26:6% after the optimization process. For future work, we will work on the characterization of our algorithms' performance and the properties of various document structure. By analyzing the performance characteristics of an algorithm analytically and empirically, we can understand when and why an algorithm works or fails. Then, we are able to predict the algorithm's performance given an application, and know the direction on how to improve the accuracy. 20

References [1] A. Antonacopoulos and R.T. Ritchings, Representation and classi cation of complex-shaped printed regions using white tiles, in: Proceedings Third International Conference on Document Analysis and Recognition (Montreal, Que., Canada, 1995) 1132{1135. [2] J. Ha, R.M. Haralick, and I.T. Phillips, Document page decomposition using bounding boxes of connected components of Black Pixels, in: L.M. Vincent and H.S. Baird, eds., Proceedings Document Recognition II (San Jose, CA, 1995) 140{151. [3] R.M. Haralick and L.G. Shapiro, Computer and Robot Vision, (Addison-Wesley, 1992) 37{48. [4] R.M. Haralick, Document image understanding: geometric and logical layout, in: Proceedings 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Seattle, WA, 1994) 385{390. [5] J. Kanai, S V. Rice, T.A. Nartker, and G. Nagy, Automated evaluation of OCR zoning, in: IEEE Trans. on PAMI 17 (1995) 86{90. [6] J. Liang, J. Ha, R.M. Haralick, and I.T. Phillips, Document layout structure extraction using bounding boxes of dierent entities, in: Proceedings Third IEEE Workshop on Applications of Computer Vision (Sarasota, FL, 1996) 278{283. [7] J. Liang, I.T. Phillips, and R.M. Haralick, Performance evaluation of document layout analysis on the UW data set, in: L.M. Vincent and J.J. Hull, eds., Proceedings Document Recognition IV (San Jose, CA, 1997) 149{160. [8] G. Nagy and S, Seth, Hierarchical representation of optically scanned documents, in: Proceedings Seventh ICPR (Montreal, Canada, 1984) 347{349. [9] N. Otsu, A thresholding selection method from gray-level histograms, in: IEEE Transactions on Systems, Man, and Cybernetics SMC-9 (1979) 62{66. [10] I.T. Phillips, S. Chen and R.M. Haralick, CD-ROM English document database standard, in: Proceedings Second International Conference on Document Analysis and Recognition (Japan, 1993) 478{483. [11] I.T. Phillips, S. Chen, J. Ha, and R.M. Haralick, English document database design and implementation methodology, in: Proceedings Second International Conference on Document Analysis and Recognition (Japan, 1993) 65{104. [12] I.T. Phillips, User's reference manual for the UW English/technical document image database III, in: UW-III English/Technical Document Image Database Manual 1996. [13] R. Rogers, I.T. Phillips, and R.M. Haralick, Semiautomatic production of highly accurate word bounding box ground-truth, in: Proceedings ICPR Workshop on Document Analysis Systems (Malvern, PA, 1996) 375{387.

21

[14] RAF Technology, Inc., DAFS: Document Attribute Format Speci cation, 1995. [15] RAF Technology, Inc., DAFS Library: Programmer's Guide and Reference, 1995. [16] RAF Technology, Inc., Illuminator User's Manual, 1995. [17] S. Randriamasy and L. Vincent, Benchmarking page segmentation algorithms, in: Proceedings 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Seattle, WA, 1994) 411{416. [18] D. Rus and K. Summers, Using white space for automated document structuring, in: Proceedings of the Workshop on Principles of Document Processing Seeheim, 1994. [19] S. Tsujimoto and H. Asada, Major components of a complete text reading system, in: Proceedings of the IEEE 80 (1992) 1133{1149. [20] F.M. Wahl, K.Y. Wong, and R.G. Casey, Block segmentation and text extraction in mixed text/image documents, in: Computer Graphics and Image Processing 20 (1982) 375{390. [21] S.Y. Wang and T. Yagasaki, Block selection: a method for segmenting page image of various editing styles, in: Proceedings Third International Conference on Document Analysis and Recognition (Montreal, Que. Canada, 1995) 128{135.

22

Document layout structure extraction using ... - Semantic Scholar

Document layout structure extraction using ... - Semantic Scholar

Suggest Documents

Document Structure and Layout Analysis - Semantic Scholar

Document Classification using Layout Analysis - Semantic Scholar

Document Structure and Layout Analysis

Extraction of Document Structure for Genomics ... - Semantic Scholar

Document Image Layout Comparison and ... - Semantic Scholar

Versatile Document Image Content Extraction - Semantic Scholar

Keyphrase Extraction using Semantic Networks Structure Analysis

Automatic Extraction of Musical Structure Using ... - Semantic Scholar

LTH: Semantic Structure Extraction using Nonprojective Dependency ...

Document Classification using Layout Analysis - CiteSeerX

Document Classification using Layout Analysis - CiteSeerX

Document Classification via Structure Synopses - Semantic Scholar

Rule-based Document Structure Understanding ... - Semantic Scholar

Spherical Layout Implementation using Centroidal ... - Semantic Scholar

Surface Layout Estimation Using Multiple ... - Semantic Scholar

Optimal Bangla Keyboard Layout using ... - Semantic Scholar

SemaFor: Semantic Document Indexing using ... - Semantic Scholar

Extraction, layout analysis and classification of ... - Semantic Scholar

Constraint-Based Document Layout for the Web - Semantic Scholar

KEYWORD EXTRACTION FROM A SINGLE DOCUMENT USING

Using Information Extraction to Improve Document Retrieval

Single Document Keyphrase Extraction Using Sentence Clustering ...

Document Content Extraction Using Automatically ... - Lehigh CSE

Text Extraction from Document Images Using Fourier