Shape Codebook based Handwritten and Machine

19 downloads 0 Views 922KB Size Report
features. Further, we apply our method for machine printed page detection ... that Arabic characters serve as scripts for several languages such as Arabic, Farsi,.
Shape Codebook based Handwritten and Machine Printed Text Zone Extraction Jayant Kumarb

Rohit Prasada

Huaigu Caoa Wael Abd-Almageedb Premkumar Natarajana

David Doermannb

a Raytheon

b Institute

BBN Technologies, Cambridge, USA; of Advanced Computer Studies, University of Maryland, College Park, USA ABSTRACT

In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We then compute the normalized histogram of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The codebook based approach is robust to the background noise present in the image and TAS features are invariant to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for the detection of machine printed documents. The proposed method is robust to the size of zones, which may contain text content at line or paragraph level. Keywords: zone classification, zone segmentation, page classification, noisy documents, handwriting, Arabic

1. INTRODUCTION Many real world documents contain handwritten text, logos, figures and background patterns along with the traditional machine print content. Since optical character recognition (OCR) methodologies are different for machine print and handwritten text, it is necessary to separate these two types of text before feeding them to their respective OCR systems.1 This separation is also important for other pre-processing tasks like text line extraction, as the technique used may differ depending on the type of text.2 Another application of this classification is in predicting whether a given document image has predominantly printed or handwritten content. The component level or region-wise knowledge of different text areas can serve as a useful metric for overall page classification. Although the document research community has made significant progress on the analysis and recognition of clean, structured documents, analyzing noisy documents with mixed content, remains a difficult task.3 Noise present in monochromatic document images can be independent or dependent of the text data. Ink blobs, Salt-npepper,4 marginal noise5 are independent of location, size or other properties of text data in the document image.6 On the other hand, blur and bleed-through7 are examples of text-dependent noise. Large noise components like marginal strips can be removed reliably with some simple heuristics. It is however difficult to discriminate noise from compatible sized text. Page segmentation and zone classification methods based on connected-components often fail to distinguish noise from the content. Many properties based on texture8, 9 and alignment of characters do not work well when the noise is spatially close or mixed with the content. When present in a document with mixed content, detection and removal of noise becomes more difficult due to the irregular variation in its shape, size and nature.10 Figure 1 shows examples of such document images from our data set. Often the noise removal process either removes or degrades the text area making the subsequent steps in the preprocessing, error-prone and difficult. This work was done at Raytheon BBN Technologies. [email protected])

(Send correspondence to Jayant Kumar E-mail:

Handwriting on a document often indicates corrections, additions, or other supplemental information3 and is usually written very close to the relevant printed text. Handwritten/printed text separation becomes more difficult when the printed form of script is cursive in nature. For example, discriminating handwritten and printed Arabic text has been reported to be a difficult task and the complexity of the problem is greatly increased by noise and the variability of handwriting between users.11 In this context, it becomes important to use features which are robust to the noise and background clutter. Features should work even if the printed and handwritten text is similar in nature, for example, cursive. Some existing methods12 for text extraction work only when there is sufficient text content in the zone, and their performance degrades quickly with less text content. To address these problems, we use a family of local contour based features called k-adjacent segments (kAS) for discriminating machine printed text from handwriting. These features, first proposed by Ferrari et al.,13 are small groups of connected, approximately straight contour segments. The segments in a kAS form a path of length k through a network of contour segments covering the document image. kAS are able to cleanly encode fragments of character boundary, without including nearby clutter. Triple-Adjacent-Segment (TAS) features, a special case of kAS (k = 3) have been shown to be reliable in capturing the local shape properties of a given object and have been successfully used for object detection13, 14 and script identification.15 In this work, we first construct two codebooks of TAS features, one for handwritten and one for printed Arabic text. We then compute a normalized histogram for each zone based on the codebooks, to classify it as machine printed or handwritten. Zones in our experiment are obtained using the Voronoi++ method.16 Using our experiments we show that a high classification accuracy can be achieved using these features. Further, we apply our method for machine printed page detection on a large dataset and report a high precision.

(a)

(b)

(c) Figure 1. Sample image from our dataset

(d)

(e)

2. RELATED WORK There has been a lot of work on text/nontext classification in the last 20 years.12 Unfortunately, the features used for text/nontext separation do not work well for print/handwritten text discrimination. Furthermore, most of the recent work in machine-print and handwriting separation has focussed on English, and much less work has been done on Arabic. In this section, we first give a brief overview of existing approaches for machine-printed and handwritten text extraction. We then give a short overview of previous work on Arabic documents. The classification of machine-print and handwritten text is typically performed at the block or zone,17 text line,1, 18, 19 word,3, 20 or character level.21 Fan et al.17 proposed a method for classifying machine-printed and handwritten text blocks. In their approach, they used spatial features and character block layout variance as the main features. Machine printed text lines are typically arranged regularly with a straight baseline, while handwritten text lines are irregular with a varying baseline. Srihari et al.18 implemented a text line-based approach using this characteristic and achieved a classification accuracy of 95 percent. One main advantage of their approach is that it can be used across different scripts like Chinese, English etc. with little or no modification. However, this approach relies heavily on accurate text line extraction, which for mixed, noisy documents may not be trivial. In our data set handwritten lines are present side by side with printed lines, and tend to be grouped together. Guo and Ma19 proposed an approach based on the vertical projection profile of

the segmented words. They used a Hidden Markov Model (HMM) as the classifier and achieved a classification accuracy of 97 percent. But problems introduced due to noise in word segmentation and feature extraction are not addressed. Zheng et al.3 first extract the connected components and merge them at the word level based on spatial proximity, then extract several categories of features like pixel density, aspect ratio and Gabor filter based features for classification. They use trained Fisher classifiers to classify each word into machine printed text, handwriting, or noise. Finally, in a post-processing step contextual information is incorporated into the Markov Random Field (MRF) models to refine the classification results. However, their data set contained a majority of printed text and handwritten content was minimal. Also, as can be seen from their results, the envelopes of printed and handwritten text regions do not overlap spatially and are separated with a large margin. They also assumed that the text lines are horizontal in the given document image. These assumptions are not valid in our case. Chanda et al.20 proposed a method for word level separation which is useful for sparse data. They first estimate the orientation of each connected-component using a variation of a PCA based method. Then, they rotate the word image in the direction of the first eigenvector. For feature extraction they divide the bounding-box of the rotated component into 7x7 blocks. For each block they use the histogram of the chain code to train the classifier. It is not clear if the proposed method method will work for Arabic, where both printed and handwritten text is similar in nature. The analysis and recognition of documents containing Arabic script has not received as much attention as other scripts in spite of the fact that Arabic characters serve as scripts for several languages such as Arabic, Farsi, Urdu and Uygur, covering more than thirty countries.22 Until recently, much less work has been done on the segmentation and recognition of handwritten Arabic text and even less on noisy Arabic. Due to the unique nature of the script, existing methods do not always prove to be the most effective. Most Arabic character recognition systems, assume clean documents and start by segmenting the text sequentially at line, word and/or character levels. Zahour et al.23 proposed a method based on horizontal projections to first extract text blocks before text line segmentation in historical Arabic documents. They developed a new segmentation method suited for Arabic historical manuscripts, to segment the document image into three classes: text, graphics and background. In Section 3, we explain the steps involved in extracting the TAS features, constructing the shape codebook and computing the zone descriptor for two-class classification. We explain in detail our experiments and review results in Section 4. Finally we conclude the paper in Section 5 with some pointers to future work.

3. HANDWRITTEN AND MACHINE PRINTED TEXT DISCRIMINATION We first extract zones present in the document using an improved zone segmentation method based on Vornoi segmentation.16 Second we manually select a subset of representative printed and handwritten text zones for creating a shape codebook for both printed and handwritten Arabic. Finally, we train a two-class ν-SVM classifier using the features of the distribution of codewords in the codebooks. Figure 2(a) shows a sample document image from our dataset with segmented zones.

3.1 Constructing the Shape Codebook Using a Canny edge detector24 we first obtain a list of edges present in the image. We then find a similar list of line segments by fitting a line to each edge segment with a specified tolerance. We group the neighboring segments accordingly in the underlying connected components(CC). Every triplet within each CC forms one of the four basic TAS types defined as shown in Figure 3(a). The first segment is the one with the midpoint closest to the centroid. The second and third segments take up positions 2 and 3, and are ordered from left to right. Example ordering of a typical TAS can be seen in Figure 2(d). Once the order of three segments has been established, the descriptor of the TAS is composed of 10 values (4k-2) given by Equation 1:

(

r2x r2y r3x r3y , , , , θ1 , θ2 , θ3 , l 1 , l 2 , l 3 ) Nd Nd Nd Nd

(1)

where ri = (rix , riy ) denotes the vector going from the midpoint of s1 to midpoint of si . θi and li represent the orientation and length of the segment si . The distance Nd between the two farthest midpoints is used as a

normalization factor, making the descriptor scale-invariant. Using such a descriptor, only the three segments are described, and not other nearby edgels. In this fashion, we can cleanly encode a portion of a character boundary, without including the inner/outer clutter as shown in Figure 2(c). SimpleKMeans available in the Weka data mining package25 is used to cluster the TASs extracted from the zones using the above features. In each cluster, we select an exemplary codeword which is the TAS instance closest to the center of the cluster. In addition, each exemplary codeword is associated with a cluster radius, which is defined as the maximum distance from the cluster center to all the other TASs within the cluster. The final codebook C is composed of all exemplary TAS codewords. Through clustering, translated, scaled and rotated versions of the TAS feature types, are grouped together.

(a)

(b)

(c)

(d)

Figure 2. (a) Sample image from our dataset with segmented zones using Vornoi++ (b) Color contour image of a text zone (c) Contours grouped to form different TASs (d) Order of segments in a TAS

Figure 3. (a) Basic TAS types (b) Histogram based zone descriptor

3.2 Computing the Zone Descriptor For each segmented zone, we construct a descriptor that provides statistics of the frequency of each TAS feature occurrence. For each detected TAS feature, we increment the number of occurrence of the entry which is nearest to it. We do so only when the distance between this TAS feature and the nearest entry is less than the corresponding cluster radius (Equation 2).

D(Ta , Ck ) < rk

(2)

where rk is the radius of the cluster for which Ck is the exemplar. We concatenate the two normalized histograms (Figure 3(b)), obtained using printed and handwritten codebooks to obtain a single feature vector for each zone.

3.3 Zone Classification The purpose of zone classification is to label each segmented zone, as one of a set of predefined types, such as text, images, graphics and tables.26 We use a variant of SVM called ν-SVM? available with libSVM package27 to train a two-class classifier for machine printed and handwritten zones. The main advantage of using ν-SVM is that the user-chosen error penalty parameter C is replaced by another parameter ρ, which allows us to control the number of Support Vectors. This helps in achieving a better generalization on test data. The ν-SVM formulation as given in? is given by Equation 3:

minimize

F′ =

1 ′ 2 1∑ ′ ∥w ∥ − νρ′ + ξi 2 l

(3)

with respect to w′ , b′ , ρ′ , ξi ′ , subject to: yi (w′ · xi + b′ ) ≥ ρ′ − ξi ′ , ρ′ ≥ 0, ξi ′ ≥ 0. w is the normal to the separating hyperplane, xi ∈ RN , i=1,2...l, is the mapped input data and yi ∈ {0, 1} is the corresponding label. b, ρ are scalars and ξi are slack variables. The value of ν should be between 0 and 1. Hence, it is easier to search for the optimal value of ν as compared to parameters in standard SVM. The decision function to decide the labels is then given by following Equation : f ′ (x) = w′ · xi + b′

3.4 Page Classification Page classification is based on the zone-wise voting of printed and handwritten text zones. We use the count of edge pixels in each zone to decide if the page has primarily printed content or not. Using the edge pixels as voting criteria, minimizes the effect on any clutter or non-text component present in the zone. This is because text regions have relatively high edge pixel density than non-text elements like blob, logo, figure etc. We also experimented with the count of TAS structures extracted from each zone as the voting criteria for page classification. Since only those TAS structures are considered valid which lie inside the pre-determined radius of a particular codeword, it acts as a filter for TAS extracted from non-text regions. Hence, the count of valid TASs in a zone can also be a good metric for voting. We first classify all the zones present in the document image and obtain its label. If the percentage of edge-pixels/TASs in printed zones is greater than a certain fixed threshold, then the page is classified as printed.

4. EXPERIMENTAL RESULTS AND EVALUATION 4.1 Dataset and Protocol We manually annotated a data set of 10,946 Arabic document images to obtain images with primarily machine printed content. Subjects were asked to visually evaluate each image and classify it as printed if 70% or more of the total content was printed Arabic. A total of 3774 printed pages were obtained. Non-print images typically consisted of handwritten zones in addition with signatures, logos and stamp images. We randomly selected 732 images for training and 625 images for test set. Relatively clean subset of 62 images from the training set were used to obtain a total of 310 codewords of printed and handwritten text. Table 1. Machine-print and Handwritten Text Zone Classification Results

#Print Zones

#Handwritten Zones

Zone level Accuracy

Pixel level Accuracy

Train

1162

1122

91 %

-

Test

1038

1838

-

98.2 %

Zones were manually labeled using the ground-truthing tool GEDI28 as - Printed-Arabic, Handwritten-Arabic and Mixed-type. Zones having both printed and handwritten text were labeled as Mixed-type. There were a total of 182 Mixed zones and the count of printed and handwritten zones are given in Table 1. To see the robustness of method on the size of text zones, we divided the test data into different smaller sets based on the number of text lines (Table 2). We report both zone level and pixel level accuracies.

4.2 Zone Classification Results Table 1 shows the results of our experiments for zone classification. The accuracy in case of training set refers to the average number of zones correctly classified in the held-out data using 10-fold cross validation obtained during ν-SVM training. The accuracy in the test case refers to the percentage of correctly classified pixels in the test images. Hence, a larger zone with more text content has more weight than smaller zones in our evaluation. Figure 4 shows the accuracies obtained on these data sets. We can see that the proposed method is reasonably robust up to a single text line but performs poorly when the zone contains only few characters. When a zone has only few characters, the number of TAS extracted is very less. This makes the corresponding histogram very sparse and less informative and hence the accuracy goes down. 100 90 80

Accuracy

70 60 50 40 30 20

1

1.5

2 2.5 3 3.5 4 Partitions (increasing text content in zones −−>)

4.5

5

Figure 4. Accuracies on different partitions based on zone size Table 2. Size of zone based on text lines for partitions 1-5

Partitions

1

2

3

4

5

Textlines

8

4.3 Printed Page Detection Results Table 3 shows the page level classification accuracy for printed documents using two different metric discussed in Section 3.4. The threshold for both the counts was empirically set to 0.5. Since our current system does not automatically label mixed-type zones, we experimented with both by including and excluding them in the voting process. The first set of two rows represent the results when the mixed-type zones were labeled automatically as either printed or handwritten by our method and was included in voting. The last set of two rows show the results when these zones did not participate in classification. The slightly poorer recall in both cases of our results is mainly due to the border line documents where the dominance of printed content was quite subjective. Higher precision in the first case can be attributed to our method’s adaptability to classify mixed zones as the one with more content. The average time taken for the classification of single page (dimension : 5000x4000) is 2.5 seconds on a P4 machine with 3 GB RAM. As expected, computation of TAS features takes most of the time.

5. CONCLUSION AND FUTURE WORK In this paper we first presented a novel approach for extracting handwritten and machine printed text in noisy monochromatic documents that may contain other types of content. This is a prerequisite before applying any text-line or word segmentation for the processing of such documents. We used features extracted from local shape properties of the script, which are invariant to scale and rotation. We discussed and demonstrated many

Table 3. Machine Print Page Detection Results

Mixed-Type Included Mixed-Type Excluded

#Print Pages

#Non-print pages

Precision

Recall

F1 score

Edge

307

318

98.13

70.2

81.84

TAS

307

318

98.15

71.4

82.66

Edge

307

318

96.2

71.0

81.7

TAS

307

318

96.3

70.0

81.07

advantages of using shape codebook for text extraction. The main advantage of our approach is that it is robust to noise and the size of zones. Since the train and test data are partitioned at document level in our experiments, it shows that the TAS features are also robust to variation in handwritten script due to different users. Our results with two different metrics for page classification show that our approach is effective for the triage of large data sets. In future, we plan to encode the spatial constraints of these features as additional information, to further improve the classification results. We would also like to demonstrate that the proposed method is applicable to the text extraction of other scripts.

ACKNOWLEDGMENTS The partial support of this research by DARPA through BBN/DARPA Award HR0011-08-C-0004 under subcontract 9500009235, the US Government through NSF Award IIS-0812111 is gratefully acknowledged.

REFERENCES [1] Pal, V. and Chaudhuri, B., “Machine-printed and handwritten text lines identification,” in [Pattern Recognition Letters], 22, 431–441 (2001). [2] Kumar, J., Abd-Almageed, W., Kang, L., and Doermann, D., “Handwritten arabic text line segmentation using affinity propagation,” in [Proc. Document Analysis Systems], 135–142 (2010). [3] Zheng, Y., Li, H., and Doermann, D., “Machine printed text and handwriting identification in noisy document images,” in [IEEE Trans. PAMI.], 26(3), 337–353 (2004). [4] Chinnasarn, K., Rangsanseri, Y., and Thitimajshima, P., “Removing salt-and-pepper noise in text/graphics images,” in [Asia-Pacific Conference on Circuits and Systems], 459–462 (1998). [5] Fan, K. C., Wang, Y. K., and Lay, T. R., “Marginal noise removal of document images,” in [Proc. Sixth Intl. Conf. Document Analysis and Recognition], 317–321 (2001). [6] Agarwal, M. and Doermann, D., “Clutter noise removal in binary document images,” in [Proc. Intl. Conf. on Document Analysis and Recognition], 556–560 (2009). [7] Wang, Q. and Tan, C. L., “Matching of double-sided document images to remove interference,” in [Proc. Intl. Conf. on Comp. Vision and Pattern Recognition], 1, 184–189 (2001). [8] Etemad, K., Doermann, D., and Chellappa, R., “Multiscale document page segmentation using soft decision integration,” in [IEEE Trans. Pattern Analysis and Machine Intelligence], 19(1), 92–96 (1997). [9] Jain, A. and Bhattacharjee, S., “Text segmentation using gabor filters for automatic document processing,” in [Machine Vision and Applications], 5, 169–184 (1992). [10] Ali, M., “Background noise detection and cleaning in document images,” in [Proc. 13th Intl. Conf. Pattern Recognition], 3, 758–762 (1998). [11] Abuhaiba, I., Mahmoud, S., and Green, R., “Recognition of handwritten cursive arabic characters,” in [IEEE Trans. PAMI], 16, 664–672 (1994). [12] Jain, A. and Yu, B., “Document representation and its application to page decomposition,” in [IEEE Trans. Pattern Analysis and Machine Intelligence], 20(3), 294–308 (1998). [13] V. Ferrari, L. Fevrier, F. J. and Schmid, C., “Groups of adjacent contour segments for object detection,” in [IEEE Trans. PAMI], 30, 36–51 (2008).

[14] Yu, X., Li, Y., Fermuller, C., and Doermann, D., “Object detection using shape codebook,” in [British Machine Vision Conference], 1–10 (2007). [15] Zhu, G., Yu, X., Li, Y., and Doermann, D., “Unconstrained language identification using a shape codebook,” in [ICFHR], 13–18 (2008). [16] Agarwal, M. and Doermann, D., “Vornoi++ : A dynamic page segmentation approach based on vornoi and docstrum features,” in [Proc. Intl. Conf. on Document Analysis and Recognition], 1011–1015 (2009). [17] Fan, K., Wang, L., and Tu, Y., “Classification of machine-printed and handwritten texts using character block layout variance,” in [Pattern Recognition ], 31(9), 1275–1284 (1998). [18] Srihari, S., Shim, Y., and Ramanprasad, V., “A system to read names and address on tax forms,” in [Technical Report CEDAR], (1994). [19] Guo, J. and Ma, M., “Separating handwritten material from machine printed text using hidden markov models,” in [Proc. Intl. Conf. Document Analysis and Recognition], 439–443 (2001). [20] Chanda, S., Franke, Katrin, and Pal, U., “Structural handwritten and machine print classification for sparse content and arbitrary oriented document fragments,” in [Proc. of ACM Symposium on Applied Computing], 18–22 (2010). [21] Kuhnke, K., Simoncini, L., and Kovacs-V, Z., “A system for machine-written and hand-written character distinction,” in [Proc. Intl. Conf. Document Analysis and Recognition], 811–814 (1995). [22] Amin, A., “Off-line arabic character recognition : The state of the art,” in [Pattern Recognition], 31, 517–530 (1998). [23] Zahour, A., Likforman-Sulem, L., Boussellaa, W., and Taconet, B., “Text line segmentation of historical arabic documents,” in [Proc. Intl. Conf. Document Analysis and Recognition], 1, 138–142 (2007). [24] Canny, J., “A computational approach to edge detection,” in [IEEE Trans. Pattern Analysis and Machine Intelligence], 8(6), 679–697 (1986). [25] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H., “The weka data mining software: An update,” in [SIGKDD Explorations], 11 (2009). [26] Wang, Y., Phillips, I. T., and Haralick, R. M., “Document zone content classification and its performance evaluation,” in [Pattern Recognition ], 39(1), 57–73 (2006). [27] Chang, C.-C. and Lin, C.-J., “Libsvm: a library for support vector machines,” (2001). Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [28] Doermann, D., Zotkina, E., and Li, H., “Gedi - a groundtruthing environment for document images,” in [Intl. Workshop on Document Analysis Systems (DAS 2010), 2010], (2010).