A New Method for Handwritten Scene Text Detection in Video a
Palaiahnakote Shivakumara, bAnjan Dutta, cUmapada Pal and aChew Lim Tan a School of Computing, National University of Singapore, Singapore b Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain c Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, India a {shiva,tancl}@comp.nus.edu.sg,
[email protected] and
[email protected] Abstract There are many video images where hand written text may appear. Therefore handwritten scene text detection in video is essential and useful for many applications for efficient indexing, retrieval etc. Also there are many video frames where text line may be multi-oriented in nature. To the best of our knowledge there is no work on handwritten text detection in video, which is multi-oriented in nature. In this paper, we present a new method based on maximum color difference and boundary growing method for detection of multi-oriented handwritten scene text in video. The method computes maximum color difference for the average of R, G and B channels of the original frame to enhance the text information. The output of maximum color difference is fed to a K-means algorithm with K=2 to separate text and non-text clusters. Text candidates are obtained by intersecting the text cluster with the Sobel output of the original frame. To tackle the fundamental problem of different orientations and skews of handwritten text, boundary growing method based on a nearest neighbor concept is employed. We evaluate the proposed method by testing on our own handwritten text database and publicly available video data (Hua’s data). Experimental results obtained from the proposed method are promising.
1. Introduction We have seen many instances of handwritten text on cards, walls and papers in video images. To understand handwritten text in video, it is necessary to detect text as the first step. Then we have to localize them to send to an OCR engine for recognition. Compared with graphic and scene text in video and handwritten text in camera images, handwritten text in video poses yet another higher level of challenges, mainly because of problems such as small size,
slanting, low resolution, complex back ground, broken characters which are often skewed and are in multiple orientations. Handwritten text recognition and extraction in document images captured by high resolution cameras or scanners have been successfully done according to the literature. Zheng et al. [1] treat machine printed text, handwritten text and noise as separate classes and he used fisher classifier for the purpose. Automatic recognition of handwritten medical forms for search engines is proposed by Milewski et al. [2]. Based on invariant moments along with an SVM classifier a robust two level classification algorithm for text localization is proposed by Kandan et al. [3]. In the same way, Milewski et al. [4] present a method for extraction of handwritten text from images of carbon copies of medical forms using mask operation at different levels to separate the foreground text from its background. The above methods work on document images with plain background. The images are of higher resolution than video and the backgrounds are much simpler too. A method closer to our present work is the handwritten text extraction from video by Tang and Kender [5]. However, this method also assumes plain background with clear spaces between text lines. Hence, none of the above methods are really suitable for general handwritten text detection in video. Methods for graphics and scene text detection in video [6, 7] are mainly categorized as connected component based [8], texture based [9-12] and edge and gradient based methods [13-16]. Connected component based methods are generally simple but are not robust in text detection in the presence of cluttered background, low resolution and small font as these methods work on geometrical properties of connected components. On the other hand, texture based methods handle cluttered background well, but they are too sensitive to text size and text like features in the background as they consider text appearance as special texture and use expensive classifiers to separate text and non-text regions with a large number of samples.
Finally, edge and gradient based methods are becoming popular in terms of speed and accuracy in text detection. However, they pick up easily high contrasting edges as false positives because these methods assume text pixels to be of high contrast and use many thresholds to control the contrast variation to separate text and non-text regions. From the above discussion, we can understand why there is very little work reported that deals with handwritten text in video as these existing methods generally assume large fonts of graphics and scene text in horizontal alignment. In this paper, we propose a new method based on Maximum Color Difference (MCD) and Boundary Growing Method (BGM) for detection of multioriented handwritten text from video. Before computing MCD, we take the average of R, G and B channels of the original frame to assemble the missing information and to increase the contrast of text pixels. Since text is handwritten, color of text pixels in the entire text line may not differ much as in scene text in video. In addition, MCD further helps in increasing the gap between text and non-text pixels as it is defined as the difference between the maximum and minimum pixels in a particular window over an image. The concept of maximum gradient difference is already implemented successfully in [13] for classification of text and non-text regions. This inspires us to propose maximum colour difference for handwritten text detection in video. Rather than using the conventional projection profiles analysis for fixing bounding boxes for text lines, we propose BGM based on the nearest neighbor because projection profile analysis works only for horizontal direction text lines.
2. Proposed Methodology The proposed text detection methodology is divided into five subsections. Text enhancement by averaging R, G and B channels of the original frame and its MCD are given in Section 2.1. Then in Section 2.2, text cluster is obtained by a K-means clustering algorithm from the enhanced text frame obtained in the preceding section. In Section 2.3, we discuss how to get text candidates from the text cluster. Section 2.4 describes boundary growing method for fixing bounding boxes for handwritten text lines in video. Finally, in Section 2.5, false positive elimination is done by heuristics proposed in [17].
2.1. Text Enhancement Accurate text detection in video without enhancement is difficult because of low contrast and low resolution [18]. In addition to this, since we consider handwritten text detection in video in this work, the color of the text pixels remains fairly constant throughout the text line. This observation
leads to the proposed use of color channels by averaging R. G and B of the original frame to assemble missing text information, if any, or to increase the contrast of pixels. It is illustrated in Figure 1 where (a) shows the original frame, (b)-(d) show R, G and B channels of (a) respectively and (e) shows the average of all three channels. It is noted that averaging of the three channels shown in Figure 1(e) sharpens the text edges compared to the results shown in Figure 1(b)-(d). To restore the low contrast text pixels, we further propose MCD criteria. This gives the difference between maximum and minimum pixel values in a sliding window of size 1×11 over an average color channel as shown in Figure 1(f) where we can notice that all text pixels are highlighted. The window size is chosen based on the observation that the width of each character stroke may not vary too much. In this way, text enhancement helps in increasing the gap between text and non-text pixels.
(a). Original frame
(b) R Channel
(c) G Channel
(d) B Channel (e) Average (f) MCD Figure 1. Text Enhancement Process
2.2 Text Cluster In the text enhancement step, we observe that there are sufficient differences between text and non-text pixel values in MCD matrix, which can be easily separated using a clustering algorithm. We propose the well known K-means clustering algorithm with K=2 to identify the text cluster rather than binarization because the former does not require thresholding as in binarization. Since K-means clustering algorithm is unsupervised, we choose the cluster that has the higher mean value to be the text cluster. This is valid because all text pixels have high contrast values and are grouped together accordingly. The result of K-means clustering is shown in Figure 2 for the MCD shown in Figure 1(e), where we can see that all text pixels have been classified as the text cluster.
Figure 2. Example of Text Cluster
2.3 Text Candidates As it is observed from Figure 2 that text patches corresponding to text lines in the frame shown in Figure 1(a) are connected with each other due to text like feature in the background. In other words, there is no proper spacing between text lines to separate or detect text lines. Unlike existing methods [13,14] that use thresholding to obtain text candidates, we propose a simple criterion to fix the text bounding box, namely, the intersection of Sobel edge map of the original frame with the text cluster to obtain the text lines. This process not only helps in getting text candidates but also helps in eliminating false positives because Sobel detects only high contrast text pixels. This is illustrated in Figure 3 where (a) shows the Sobel edge map of the frame shown in Figure 1(a). Figure 3 (b) shows the result of intersection where the components are considered as text candidates. It is noticed from Figure 3(a) and (b) that most non-text pixels are eliminated and text pixels are retained. This is the advantage of our proposed method.
The method scans a text candidate frame (Figure 3(b)) from left top pixels to right bottom pixels. When the method finds a component during scanning, it fixes a bounding box for that component and it allows the bounding box to grow until it reaches a pixel of the component where an adjacent bounding box is formed. This process will continue till the end of the text line and the end of the line is determined by a certain threshold which is more than the space between characters and words or end of image. Figure 4 illustrates the boundary growing procedure for the text lines shown in Figure 1(a), where (a)-(g) show the process of growing bounding boxes along the text direction. The boundary growing procedure is shown on the last word of the second line of Figure 4(i) and this word is marked by oval shaped. The BGM fixes bounding boxes for all text lines in frame as shown in Figure 4(h) and it also fixes bounding boxes for nontext line as shown Figure 4(i) where we can see few false positives.
(a)
(d).
(b)
(e)
(c)
(f)
(a). Sobel edges (b) Text candidates Figure 3. Illustration of Text Candidates Selection
2.4 Boundary Growing Method (BGM) Fixing boundary for handwritten text is not as easy as fixing bounding boxes for graphics and scene text lines because of different skews and orientations. Therefore, conventional projection profiles used by existing methods may not work for handwritten text detection in video. Hence, we introduce a novel idea of Boundary Growing Method (BGM) based on the nearest neighbor concept. The basis for this method is that text lines in the frame always appear with characters and words in regular spacing in one direction and it can grow based on orientation of text.
(g). (h) (i) Figure 4. Illustration of Boundary Growing
2.5 False Positive Elimination We note that false positive elimination [17] is a difficult problem in video text detection due to the cluttered background and text like features in the background. Therefore, we use the concept of intrinsic and extrinsic edges proposed by [17] to eliminate false positives. This criterion tries to eliminate edges which are crossing text area to background called extrinsic edges and small edges in the text area called intrinsic
edges. Then the criterion considers edge density to differentiate text and non-text blocks. Sample illustration is shown in Figure 5. Figure 5(a) shows text detection without false positive elimination and (b) shows the final result after elimination of false positives.
•
Inaccurate boundary rate (IBR) = Number of text blocks with inaccurate boundary / Number of truly detected text blocks.
The performance of the proposed method based on the above metrics on handwritten text is summarized in Table 1.
(a). Initial text detection (b) Final text detection Figure 5. False Positive Elimination
3. Experimental Results As there is no standard database for handwritten text detection in video, we create our own dataset of 112 frames which include sample frames of different varieties for evaluating the performance of the proposed method. However, we consider also a publicly available dataset of 45 frames as a benchmark study in this work and the proposed method works well for graphics and scene text detection in video too. To give an objective comparison of all the above methods, we use detection rate, false positive rate, misdetection rate and inaccurate boundary rate as decision parameters and metrics in this work. The detected text blocks are represented by their bounding boxes. To judge the correctness of the text blocks detected, we manually count Actual Text Blocks (ATB) in the frames in the dataset. This procedure is based on the metrics given in [14]. If text in a detected text block misses more than 20% of the actual text block then we consider it as misdetection. Otherwise it is considered as truly detected text block. A text block that covers two or more than two text lines in one text block is considered as inaccurate boundary. Based on the number of blocks, the following metrics are calculated to evaluate the performance of the method: •
Detection rate (DR) = Number of truly detected text blocks / Number of Actual text blocks.
•
False positive rate (FPR) = Number of falsely detected text blocks / Number of (Truly and falsely detected text blocks.
•
Misdetection rate (MDR) = Number of Misdetected text blocks / Number of Truly detected text blocks.
Figure 6. Results on some Handwritten Scene Images are shown
Table 1. Performance on Handwritten Text Dataset (in %) Handwritten
DR 89.15
FPR 2.32
MDR 8.6
IBR 3.2
3.1 Experiment on Handwritten Dataset As this experiment is solely on handwritten text in video, it is meaningless to compare it with existing methods which are meant for graphics and scene text. Figure 6 shows that the proposed method detects text lines well when the there is a clear space between the text lines, otherwise it detects two text lines together as one text line. For our dataset, the proposed method gives promising results in terms of DR, FPR, MDR and IBR for handwritten text detection in video as reported in Table 1.
3.2 Experiment on Hua’s Dataset To give a comparative study with existing pieces work, we have chosen four methods namely (a) “Edge based method [16]”, which basically uses different directional maps of Sobel and a set of texture features to detect text, (b) “Gradient based method [13]”, which is based on maximum gradient difference and identifying potential line segments and text lines, (c) “Sobel-Color based method [18]” which uses Sobel in color channels and masks to control contrast variation, and (d) “Uniform text color based method [19]”, which works based on hierarchical clustering as the state of the art. We choose these methods for comparative study because these methods are unsupervised and work with lesser assumptions than other methods. We report the comparison in Table 2 for Hua’s dataset [20] as these methods work on graphics and scene text but not on handwritten text in video. We will now test our method on an independent dataset comprising 45 different frames obtained from [20]. While the dataset is small, it provides an objective test of our method in comparison with the four existing methods. Experimental results are shown in Figure 7 where (a) shows the results of the proposed method for different frames, (b)-(e) show the results of existing edge based, gradient based method, SobelColor based and Uniform-Color based methods corresponding to fames shown in (a) respectively. It is observed that the proposed method gives good results compared to the other methods for all three frames. However, we notice false positives for the first frame in (a) detected by the proposed methods while the edge based method is better than the other three existing methods. The results of the proposed and existing methods are reported in Table 2. Table 2 shows that the proposed method outperforms the existing methods in terms of DR and MDR while the gradient based method is good in terms of FPR. However, though our method is the second best in terms of FPR, its DR is higher than the gradient based method. This shows the proposed method is good for graphics and scene text detection in video also because of the advantage of text enhancement and boundary growing methods. The reasons for the poor performance of the existing methods are as follows: The edge and gradient based methods usually suffer from constant thresholds for classifying text and nontext regions. On the other hand, color based methods make several assumptions in their mask operations to control the contrast variation. These parameters may fit their own dataset well but do not work well for other datasets.
(a): Proposed Method
(b): Edge based Method
(c): Gradient based Method
(d): Sobel-Color based Method
(e): Uniform-Color based Method Figure 7. Results on Hua’s dataset are shown
Table 2. Performance on Hua’s Dataset (in %) Methods Edge based [16] Gradient based [13] Sobel-Color [18] Uniform-Color [19] Proposed
DR 75.4 50.8 68.8 46.7
FPR 45.8 25.3 57.1 56.1
MDR 16.3 12.9 13.0 43.8
89.67
35.34
5.44
4. Conclusion and Future Work In this paper, we have proposed a new method based on maximum color difference and boundary growing methods for detecting multi-oriented handwritten text in video. To the best of our knowledge, this is the first work which makes an attempt to detect challenging handwritten text in video amidst complex background. Because of the advantage of text enhancement introduced in the proposed method, it is able to detect small and low contrast handwritten text in video. Further, we have tested our method on publicly available data (Hua’s data) and the proposed method works well. In future, we are planning to differentiate machine printed text, and handwritten text in the scene. This is essential because techniques involved in recognition of each category are different from each other. Hence, there is a scope for identifying different classes of text in video. The same technique can be extended to identifying different scripts in our future work.
Acknowledgment This work is done jointly by NUS and ISI, Kolkata, India. This research is also supported in part by the MDA grant R252-000-325-279 and A*STAR grant R252-000-402-305.
References [1] Y. Zheng, H. Li and D. Doermann, Machine Printed Text and Handwriting Identification in Noisy Document Images, IEEE Trans. on PAMI, Vol. 26, No. 3, 2004, pp 337-353. [2] R. J. Milewski, V. Govindaraju and A. Bhardwaj, Automatic recognition of handwritten medical forms for search engines, IJDAR, 2009, pp 203-218. [3] R. Kandan, N. K. Reddy, K. R. Arvind and A.G. Ramakrishna, A Robust Two Level Classification Algorithm for Text Localization in Documents, LNCS 4842, 2007, pp 96-105. [4] R. Milewaski and V. Govindaraju, Extraction of Handwritten Text from Carbon Copy Medical Form Images, LNCS 3872, 2006, pp 106-116. [5] L. Tang and J. R. Kender, A Unified Text Extraction Method for Instructional Videos, in proceedings of ICIP, 2005, pp 1216-1219. [6] J. Zang and R. Kasturi. Extraction of Text Objects in Video Documents: Recent Progress, in proceedings of DAS, 2008, pp 5-17. [7] K. Jung, K.I. Kim and A.K. Jain. Text information extraction in images and video: a survey. Pattern Recognition, Vol. 37, 2004, pp. 977-997.
[8] A.K. Jain and B. Yu. Automatic Text Location in Images and Video Frames, Pattern Recognition, Vol. 31, No. (12), 1998, pp. 2055-2076. [9] H. Li, D. Doermann and O. Kia. Automatic Text Detection and Tracking in Digital Video, IEEE Trans. on Image Processing, Vol. 9, No. 1, 2000, pp 147-156. [10] K. L Kim, K. Jung and J. H. Kim. Texture-Based Approach for Text Detection in Images using Support Vector Machines and Continuous Adaptive Mean Shift Algorithm, IEEE Trans. on PAMI, Vol. 25, No. 12, 2003, pp 1631-1639. [11] V. Wu, R. Manmatha and E. M. Riseman, TextFinder: An Automatic System to Detect and Recognize Text in Images, IEEE Trans. on PAMI, Vol. 21, No. 11, 1999, pp 1224-1229. [12] P. Shivakumara, T. Q. Phan and C. L Tan, A Robust Wavelet Transform Based Technique for Video Text Detection, in proceedings of ICDAR, 2009, 12851289. [13] E. K. Wong and M. Chen. A new robust algorithm for video text extraction, Pattern Recognition, Vol. 36, 2003, pp. 1397-1406. [14] D. Chen, J. M. Odobez, and J. P. Thiran, A localization/verification scheme for finding text in images and video frames based on contrast independent features and machine learning methods, Signal Processing: Image Communication 19, 2004, pp 205-217. [15] T. Q. Phan, P. Shivakumara and C. L Tan, A Laplacian Method for Video Text Detection, in proceedings of ICDAR, 2009, 66-70. [16] C. Liu, C. Wang and R. Dai. Text Detection in Images Based on Unsupervised Classification of Edge-based Features, in proceedings of ICDAR, 2005, pp. 610-614. [17] J. Zhang, D. Goldgof and R. Kasturi, A New EdgeBased Text Verification Approach for Video, in proceedings of ICPR, 2008. [18] M. Cai, J. Song and M. R. Lyu, A New Approach for Video Text Detection, in proceedings of ICIP, 2002, pp 117-120. [19] V. Y. Marinano and R. Kasturi, Locating UniformColored Text in Video Frames, in proceedings of ICPR, 2000, pp 539-542. [20] X. S Hua, L. Wenyin and H. J. Zhang, An Automatic Performance Evaluation Protocol for Video Text Detection Algorithms, IEEE Trans. on CSVT, Vol. 14, No. 4, 2004, pp 498-507.