Density-based Approach for Text Extraction in ... - Semantic Scholar

A Density-based Approach for Text Extraction in Images1 Fang Liu, Xiang Peng, Tianjiang Wang, Songfeng Lu School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China Email: [email protected] Abstract In this paper we describe a new approach to distinguish and extract text from images with various objects and complex backgrounds. The goal of our approach is to present characters in images with clear background and without other objects. The proposed approach mainly includes two steps. Firstly, a densitybased clustering method is employed to segment candidate characters by integrating spatial connectivity and color feature of characters’ pixels. In most images, colors of pixels in one character are commonly non-uniform due to the noise. So a new histogram segmentation method is proposed in this step to obtain the color thresholds of characters. Secondly, priori knowledge and texture-based method are performed on the candidate characters to filter the non-characters. Experimental results show that the proposed approach has a good performance in character extraction rate.

1. Introduction Findings in [1] present that 42% of the 256 randomly selected web images contain text and 78% of all the words in images are non-stop words. It shows that many images contain abundant text and the text can be a very valuable source of high-level semantics for image indexing and retrieval. However, it is difficult to accurately recognize characters in complex images by directly using conventional OCR systems. Then text extraction technique is proposed to preprocess images with text. The goal of the technique is to provide a standard OCR system with clean binary images only contain characters. Existing text detection and segmentation methods can be divided into two main categories, the texturebased and CC-based (connected component based) 1

methods. Texture-based approaches are based on the fact that text regions have distinctive texture. In general, texture-based method is more robust than the CC-based method. But the high complexity of texture segmentation is the main problem when processing large images [2]. Inaccuracy of text location is also a problem of this method [3]. CC-based approach includes edge-based and color analysis. Edge-based methods are based on the analysis of the geometrical arrangement of edges that belong to characters. It could not effectively handle images with complex background [4, 5]. The color analysis is based on the analysis of the homogeneous color/grayscale pixels that belong to characters. Different color-based clustering methods are employed in [6]. Drawbacks of these color clustering methods are that they are sensitive to noises and character size [7]. Other researchers pay their attention on text segmentation [7, 8]. But these methods only process text block images. The aim of our approach is to split the original image into several parts. Each part only contains characters with white color and black background. Our approach is based on color clustering. We employ two techniques to deal with the existing drawbacks mentioned above. (1) In most cases, one character is of uniform color visually. However, it is composed of pixels with nonuniform color due to the noise. We propose a histogram segmentation method for eliminating the influence of noise by determining the grayscale thresholds of a character. (2) Most color clustering method fail in finding characters with various size. We tend to solve the problem by considering the density property of a character. Each character can be viewed as an arbitrary shaped region which is composed of spatialconnectivity pixels with similar grayscale. Therefore, a DBSCAN [9] (a density-based clustering algorithm) based algorithm is proposed to ensure all connected

This work was partially supported by HTRDP (Hi-Tech Research and Development Program of China) 2007AA01Z161.

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

parts could be clustered into one region. Experimental results show that it works well when font size is not too small. The remainder of the paper is organized as follows. Section 2 introduces our method in detail. Section 3 presents the experimental results. Conclusions are given in Section 4.

2. Description of the Proposed Approach 2.1 Candidate Characters Extraction DBSCAN is a spatial clustering algorithm. We employ it in image analysis because that an image can be deemed as a set of data in 2-D space. But data in image has not only spatial property but also color feature.

Pixel_clustering (Eps, Minpts) 1. ColorLabel(); 2. read the first pixel p = I(1,1)； 3. if p is not a core pixel, then label p with ‘noisy’; 4. if p is a core pixel, then lable p with a new class ID C(p); 5. for each pixel q within the radius Eps of p 6. if ColorL (q) = ColorL(p) then 7. Label q with C(p) and 8. Add q into List A 9. for each q in List A 10. if q is a core pixel 11. for each o within the radius Eps of q 12. if ColorL (o) = ColorL(q) then 13. if o has labeled with ‘noisy’ then label o with C(p) 14. if o has no label then label o with C(p) and add it to List A 15. if there exist unlabeled pixel p, read it and go to 2

2.2 Color Labelling

Figure 1. A magnified binary image (40*40) with one Chinese character. Pixel p is Labeled by ColorL=‘b’, which is the same as the other black pixels. p is a core pixel with the current Eps if Minpts is 10. Terms of pixel clustering are illustrated in Fig.1. (1) Eps: one parameter of the pixel clustering algorithm. It represents the radius of a given pixel. (2) ColorL: In our pixel clustering algorithm, pixels with similar color should be assigned the same color label called ColorL. The ColorL of a pixel p is denoted as ColorL(p) (ColorL(p)=’b’ in Fig.1). Color clustering for pixels is performed by a histogram analysis method which will be described in 2.2. (3) Minpts: another parameter of the algorithm. Within the radius Eps of p, if the number of pixels with ColorL(p) is equal or larger than Minpts, then p is a core pixel. (p in Fig.1 is a core pixel.) Given an image I with size m*n, the DBSCAN_based pixel clustering algorithm is described as follows. Function ColorLabel() is to label each pixel a ColorL. The detail of this function is described in 2.2.

In candidate characters finding process, we use color histogram segmentation method to implement ColorLabel() function. In histogram segmentation, non-parametric algorithm needs not to estimate the underlying number of segments previously. MeanShift [10] is a commonly used one. But MeanShift tends to find too many false peaks in noisy image. FTC [11] is also an excellent non-parametric method which can segment the histogram properly. But the iterative process is time-consuming. In this paper, we propose a non-parametric histogram segmentation algorithm. The algorithm is efficient and achieves good quality result. Determination of peaks/valleys is a difficult problem in histogram segmentation due to the noise. To avoid finding false peaks, the proposed algorithm performs the segmentation with the help of gradient of the 1-D histogram. There are many false peaks (noisy) in the decreasing part of the uni-modal in the graylevel histogram shown in Fig. 2. But the noise does not affect the gradient of the histogram. On the interval [z, b], curve in gradient of the histogram is below the zero line. (In this paper, the gradient of 1-D histogram is computed by mask [-1, -2, -3, -4, 0, 4, 3, 2, 1]) We conclude following facts by observing histograms (H) and its 1-D gradients (HG). (1) Gray-level of a peak (p) in H must lies between the gray-level of a positive peak (p1) and the right next negative valley (v1) in HG. It can be precisely located by searching maximum on the interval in H.

(2) Valleys of H lie between two peaks. It can be located by searching minimum between two peaks in H.

Figure 2. gray-level histogram of an image with noise and the gradient of the histogram.

(a)

(b)

Figure 3. (a) Segmentation result comparison between FTC and our algorithm; (b) Segmentation result comparison between MeanShift and our algorithm; (in (a) and (b), solid lines represent the segmentation point found by two algorithms simultaneously, the dashed lines represent the segmentation point found by our algorithm. In (a), dotted lines represent the segmentation point found by FTC. In (b), dotted lines represent the segmentation point found by MeanShift) Fig. 3 shows the segmentation results comparison among MeanShif, FTC and our algorithm on image Lena (512 × 512). Fig. 3(a) shows that the results obtained by two algorithms are almost the same. And Fig.3(b) shows that MeanShift over segments the histogram. We perform the above three algorithms on 53 images. The number of peaks found by MeanShift is the maximum among the three. FTC finds the minimum number of peaks. The segmentation points found by FTC and our algorithm are almost at the same gray-level. The average distance is between [0, 3] gray-level unit.

2.3 Characters Filtering This step selects real characters from candidate characters. It first applies simple heuristics to delete

those obvious non-characters. The rules are performed on the MBRs (Minimum Bound Rectangle) of character regions. (1) Proportion of width and length of the MBR is larger than 1.2. (2) Proportion of text pixels and background pixels in the MBR is less than 0.5. (3) The number of pixels in the MBR is less than 150. All these parameters are selected empirically. After performing the above rules, we use texturebased method to refine the result. Harr wavelet is chosen according to its effective shows in [12]. The third order wavelet moment of the candidate character regions is checked to verify whether it is a character or not.

3. Experimental Results Currently, our algorithm has been implemented in C++ under Windows-XP on a PC with Celeron 2.5G Hz and 512M memory. The test dataset contains 120 color images. All of them are from web. Among them, 80 images of them are downloaded from the web site of Huazhong University and Technology. Most of them are logos and icons. Another 20 images are from internet bookshop. They are covers of CDs and books. The image resolution range is between 50*90 and 962*200 pixels. All the images are converted into grayscale previously. Fig. 4 and Fig. 5 give input images and the results. For measuring accuracy, recall and false alarm rate of regions and characters are calculated to evaluate our algorithm performance, and Table 1 gives the detail with Eps = min(image with, image height) /50 (Small font refers to pixels of that character is less than 200). Recall rate of characters and false alarm rate of regions are defined as follow. RRC (Recall Rate of Characters) = number of extracted characters / number of characters in images * 100% FRR (False alarm Rate of Regions) = number of non-character regions / number of total regions * 100% Table 1 Extraction performance for characters Total Total characters Extracted characters RRC Character regions Extracted regions FRR

963 720 75% 159 166 4.2%

Large font 857 698 81%

Small font 106 22 21%

References

Figure 4. Examples of some results

(a)

(b)

(c)

Figure 5. (a) original image. (b) result with Eps = min(image with, image height)/20. (c) result with Eps = min(image with, image height)/50. Characters are mixed with other objects when Eps is large.

4. Conclusion The proposed approach extract characters from image with complex background and other objects. Density-based clustering method and a new histogram segmentation is employed to find and segment characters. Experimental results show the approach performs well on font size is normal and large. Research on method which can handle small font is our future work.

[1] T. Kanungo, C. H. Lee, and R. Bradford, "What Fraction of Images on the Web Contain Text?," In Proceedings of the First International Workshop on Web Document analysis, A. Antonacopoulos and J. Hu eds., World Scientific Publishing Co., New Jersey, 2001, pp 43-46 [2] Li Hui ping, Doerman D. Kiao. O. "Automatic text detection and tracking in digital video". IEEE transactions on image processing, 9(1), IEEE Computer Society, 2000, pp 147~156 [3] Zhong Y, Karu K, Jain A K. "Locating text in complex color images". Pattern Recognition, 28(10), 1995, pp 1523~1536 [4] Xi Jie, Hua Xian-Sheng, Chen Xiang-Rong, et al., "A Video Text Detection and Recognition System", In Proc. of ICME 2001. Japan: Waseda University, 2001, pp 1080~1083 [5] Q. Yuan, C. L. Tan. "Page Segmentation and Text Extraction from Grey-Scale Images in Micro Film Format", SPIE Proc. on Document Recognition and Retrieval, 4(2), 2000, pp323~332 [6] C. Garcia, X. Apostolidis. "Text detection and segmentation in complex color images," Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol.4, IEEE Computer Society, 2000, pp 2326-2329 [7] Y. Zhan, W. Wang, W. Gao. "A Robust Split-and-Merge Text Segmentation Approach for Images", Proceedings of the International Conference on Pattern Recognition, vol. 2, 2006, pp1002-1005 [8] M. Yokobayashi, T. Wakahara. Binarization, "Recognition of Degraded Characters Using a Maximum Separability Axis in Color Space and GAT Correlation", Proceedings of the International Conference on Pattern Recognition, vol. 2, 2006, pp 885-888 [9] M. Ester, H.P. Kriegel, and J. Sander et al., "A Density Based algorithm for discovering clusters in large spatial databases", Proceedings of International Conference on knowledge Discovery and Data Mining, AAAI Press, 1996, pp 226-231 [10] D. Comaniciu and P. Meer, "Mean Shift: A Robust Approach towards Feature Space Analysis", IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 24(5), IEEE Computer Society, 2002, pp 603-619 [11] J. Delon, A. Desolneux and J. Lisani et al., "A Nonparametric Approach for Histogram Segmentation", IEEE Transactions on Image Processing, Vol. 16(1), IEEE Computer Society, 2007, pp 253-261 [12] Y. LIU, S. GOTO, and T. IKENAGA, "A Robust Algorithm for Text Detection in Color Images", Proceedings of the International Conference on Document Analysis and Recognition, vol. 1, IEEE Computer Society, Piscataway, 2005, pp 399-403

Density-based Approach for Text Extraction in ... - Semantic Scholar

Density-based Approach for Text Extraction in ... - Semantic Scholar

Suggest Documents

Knowledge Extraction from Text: Machine ... - Semantic Scholar

An Information Extraction Approach to ... - Semantic Scholar

An Efficient Approach for Number Plate Extraction ... - Semantic Scholar

New feature extraction approach for epileptic EEG ... - Semantic Scholar

A Resource-light Approach to Phrase Extraction for ... - Semantic Scholar

A Heuristic Approach for Web Content Extraction - Semantic Scholar

An Efficient Approach for Extraction of Actionable ... - Semantic Scholar

Novel Two-Stage Analytic Approach in Extraction ... - Semantic Scholar

A Novel Approach for Text Categorization of ... - Semantic Scholar

a new approach for video text detection - Semantic Scholar

a new approach for video text detection - Semantic Scholar

A Fuzzy Similarity Approach in Text Classification ... - Semantic Scholar

Text Simplification for Information Extraction

Automatic Keyword Extraction from Spoken Text. - Semantic Scholar

Domain Adaptive Information Extraction From Text - Semantic Scholar

Feature extraction - Semantic Scholar

Multi-script Text Extraction from Natural Scenes - Semantic Scholar

Entity Extraction within Plain-Text Collections ... - Semantic Scholar

A Polygonal Approach for Automation in Extraction

Computer-Aided Design Data Extraction Approach ... - Semantic Scholar

Neural Based Approach to Keyword Extraction ... - Semantic Scholar

Generic Approach to Highlights Extraction from a ... - Semantic Scholar

Improving Method Extraction: A Novel Approach to ... - Semantic Scholar

a hybrid approach to extraction and refinement of ... - Semantic Scholar