Computing Lab, Samsung Advanced Institute of Technology. {qifeng.liu ..... SVM in Images,â Samsung Technology Conference, 2005. [8] R.Lienhart and J.
STROKE FILTER FOR TEXT LOCALIZATION IN VIDEO IMAGES Qifeng Liu, Cheolkon Jung, Sangkyun Kim, Youngsoo Moon and Ji-yeun Kim Computing Lab, Samsung Advanced Institute of Technology {qifeng.liu, cheolkon.jung, skkim77, mys66, jiyeun.kim}@samsung.com ABSTRACT
select optimal features [6], these selected features are still the combination of traditional features.
In this paper, we propose stroke filter for text localization in video images. First, we give the definition of text, which is based on the analysis of previous methods and intrinsic characteristics of text. Secondly, the definition is realized in the form of stroke filter, which is elaborately designed based on local region analysis. We also discuss the relationship between the proposed stroke filter and the other related filters. Furthermore, stroke filter can be implemented in a fast way without convolution operation. The effectiveness and efficiency of stroke filter is validated by extensive experiments on a challenging database.
CCA
Edge
Texture
Corner
(a) Canny edge image. (b) Relationship of feature spaces. Figure 1. Illustration of problems in text localization.
In this paper, we attempt to answer the basic question. Based on the observation of text (see Fig. 2.), we propose that a sub-image is text, if and only if: z local constraint - there are many stroke-like structures in the sub-image, and z global constraint - the stroke-like structures have specific spatial distribution, where a stroke is defined as a straight line or arc used as a segment of a character.
Index Terms— Text localization, SVM, video analysis 1. INTRODUCTION Text in images and video always carries rich useful information, which can help computer to understand the content of images and video. So text localization is very important for the fields of automatic annotation, indexing and parsing of images and video. By now, great achievements of text localization have been obtained by researchers [1]. The related methods could be briefly classified into four categories: (1) intensity-similarity CCA (Connected Component Analysis) based [2], (2) edge based [3], (3) corner based [9], and (4) texture based methods [4]. More recently, many researchers pay much attention to applying pattern classification into text localization based on elaborately selected features [5]. However, there are still many problems in all these methods. Video images often have complex background with strong edge or texture clutter, and it is very difficult for any method to detect the graphic or scene text with high accuracy. For example, from the Canny edge image shown in Fig.1 (a), it is so difficult to find the text even for human eyes. Besides the high complexity of text localization, we think that the fundamental reason of these problems is that no researcher finds the distinctive features of text. That is to say, no one answers the basic question: what on earth is text? Intensity-similarity CCA, edge, corner or texture features are only the necessary conditions of text, as shown in Fig. 1(b). Although recently someone uses Adaboost to
1424404819/06/$20.00 ©2006 IEEE
Text
Source image Stroke filtering Stroke filter features Spatial-similarity CCA Text candidates SVM verification Text lines
Figure 2. Enlarged text image. Figure 3. Flowchart of our method.
According to the local constraint in the text definition, we design stroke filter, which is based on local region analysis, and according to the global constraint, we use spatial-similarity CCA to localize text regions. The overall flowchart of our text localization method is shown in Fig. 3, where the module of stroke filtering is the focus of this paper, the other modules could be found in our previous work [7]. The source image is first filtered by stroke filter, and then analyzed by spatial-similarity CCA. The output is treated as text candidates and further verified by SVM (Support Vector Machine) classifier (the feature is gray level intensity of scanning sub-window [7]). The remained parts of this paper are organized as follows. In Sec. 2, we address stroke filter in details. And we compare stroke filter with the other related filters in Sec.
1473
ICIP 2006
3. Quick stroke filter is proposed in Sec. 4. Finally, we give experiments in Sec. 5 and conclude this paper in Sec. 6
R OXP
2. STROKE FILTER First, we define that a local image region is stroke-like structure, if and only if, in term of intensity, z it is different from its lateral regions, and z its lateral regions are similar, and z it is nearly homogenous. α ∈ ¢WSGπV[SGπVYSGZQπV[¤
X
α
OZP OXP
OYP
For each pixel in source image, we compute its stroke filter response. As shown in Fig. 3, the central point denotes an image pixel (x, y), around which there are three rectangular regions. The orientation and scale of these local regions are determine by α and d. According to the definition of stroke-like structure, we define stroke filter response of the pixel (x, y) as: | µ − µ 2 | + | µ1 − µ 3 | − | µ 2 − µ 3 | , (1) Rα ,d ( x, y ) = 1 max(σ ,10) where the items of the right part have clear physical meanings corresponding to the constraints of the definition of stroke-like structure respectively, and the meanings of µ and σ are denoted in Fig. 3. The more likely the pixel (x, y) belongs to stroke-like structure, the higher its response is. Eq. (1) is enough for actual implementation. However, if we curiously want to know the analytic form of stroke filter, the following deduction is helpful. Eq. (1) can be rewritten as so-called bright stroke response RB: 2µ1 − 2µ 2 ° max(σ ,10) if ( µ 2 > µ 3 ) ° R B α ,d ( x, y ) = ® if ( µ1 > µ 2 & µ1 > µ 3 ) (2) ° 2µ1 − 2µ 3 if ( µ ≤ µ ) 2 3 °¯ max(σ ,10) 2µ − 2 max(µ 2 , µ 3 ) = 1 if ( µ1 > µ 2 & µ1 > µ 3 ) max(σ ,10)
OZP T
(a) Bright stroke filter T OZP
OXP R
R OYP
or
OXP T
Rα ,d ( x, y ) ° R ( x, y ) = max α ,d ° ®O ( x, y ) = arg max Rα ,d ( x, y ) α ° °S ( x, y ) = arg max Rα ,d ( x, y ) d ¯
(4)
Some intermediate results are shown in Fig. 5. We can find stroke filter can filter out most step-like edges, and at the same time, the text parts are enhanced well. Although some non-text parts between two strokes are detected, it does not affect text candidate detection. On the contrary, it is helpful for CCA to grouping detected strokes.
(a) Source image.
(b) Stroke response magnitude
(c) Binarized (b)
(d) Stroke width map. (e) Stroke orientation map Figure 5. Features extracted by stroke filter. Bright part in (d) ((e)) denotes large scale (orientation angle).
3. COMPARISON WITH RELATED FILTERS
D
and dark stroke response R
2µ 3 − 2µ1 ° max(σ ,10) if ( µ 2 > µ 3 ) (3) ° R D α , d ( x, y ) = ® if ( µ 1 < µ 2 & µ 1 < µ 3 ) 2 µ 2 µ − 1 ° 2 if ( µ 2 ≤ µ 3 ) °¯ max(σ ,10)
2 min(µ 2 , µ3 ) − 2µ1 max(σ ,10)
or
The final response, orientation and scale of stroke filter are:
X dY dOXSVYPSG dYQ
Figure 3. Local regions for stroke filter.
=
OYP T
R OXP
(b) Dark stroke filter Figure 4. Stroke filter profile.
∈ ¢XSZS\¤
Oµ Gσ GGG GGGG α GSG GU GGGGyGUP
Y
with bright (dark) color. So the corresponding filters are called as bright and dark stroke filter, as shown in Fig. 4.
if ( µ1 < µ2 & µ1 < µ3 )
Note that: (1) Rα,d(x, y)=0 if µ2 ≤µ1≤µ3 or µ3 ≤µ1≤µ2; (2) RBα,d(x, y) (RDα,d(x, y)) is high when the pixel (x, y) is text
1474
There are some related filters, such as Canny edge filter [12], Gabor filter [11], Haar-like line filter [8] and ratio edge filter [10]. On one hand, stroke filter is similar to them because all of them are based on local image difference between central and lateral regions. One the other hand, stroke filter is distinctively different from them, as summarized in Table 1. Note that: z Icl denotes the interval between central and lateral regions, which is marked by d2 in stroke filter. This interval means that the image part between central and lateral regions is omitted, which is always unstable because of blurred edge, as shown in Fig. 2.
z Edge mask – Since text must have rich edges, we only perform stroke filtering on edge pixels and its neighbors. The filter mask is the dilation of Canny edge map of source image. z Quick filling – Once we detect a pixel with strong response, it and its neighbors are set as the same response, orientation and scale. This is based on the local spatial consistency of text features. z Integral image – This is used for quick Haar-like filtering by [8]. On one typical video image (720*480) (as shown in the left-top of Fig. 7), we test stroke filter with different combination of above three strategies to compare their computational costs, as shown in Table 2. “E”, “Q” and “I” denote the above three strategies, respectively. We find that the final computational cost of quick stroke filter is only 5.3% of the original one.
z Scl denotes the difference between central and lateral regions, which is evaluated by |µ1-µ2|+|µ1-µ3| in stroke filter. We think that for text strokes, Scl should be large. z Sll denotes the similarity between two lateral regions, which is evaluated by |µ2-µ3| in stroke filter. We think that for text strokes, Sll should be small. z Hc denotes central region homogeneity, which is measured by ı in stroke filter. We think that for text strokes, Hc should be small. z Rs denotes the computational time (ms) for a 720*480 image. Table 1. Comparison between related filters. Right (cross) mark means the corresponding factor is (isn’t) considered.
Filter Canny Gabor Haar Ratio Stroke
Icl × √ × √ √
(a) Source image
Scl √ √ √ √ √
Sll × × × × √
(b) Canny edge filter
Hc × × × × √
Rs 14 34599 105 354 124
Table 2. Computational cost of stroke filtering. (One 1.4GHz CPU under debug model in VC++6.0. Running under release model will be much faster than this)
Time(ms) filtering Canny Dilation Integral total
original 4038 0 0 0 4038
E 694 14 5 0 713
E+Q 647 14 5 0 666
E+Q+I 73 14 5 22 114
5. EXPERIMENTS
(c) Gabor filter
We do experiments with one CPU (1.4GHz) on Win2000 and VC++6.0. In our ground truth database, there are news images from about 4-hour videos of South Korea TV channels, whose size are 720*480 (text height varies from 10 to 250 pixels). In these images, 435 images are for testing and 357 images for training. We compare the proposed method “Stroke filter+CCA” with our previous method “Canny+CCA” [7]. The results are shown in Table 3. Note that for detecting large text with about 250 pixels height, we perform the same algorithm on 3 pyramid levels of source image. We find that “Stroke filter+CCA” is more accurate, but more costly than “Canny+CCA”. However, we must notice that the main advantage of stroke filter is not limited to higher accuracy. Actually, because stroke filter can discover the intrinsic features of text including response, orientation and scale, it is potentially powerful in subsequent processes, such as SVM verification and even segmentation. Once the image is filtered in the stage of text candidate detection, there will be only a little computational cost in subsequent processes. In addition, we give some results of our whole text localization method, i.e., “stroke filter + CCA + SVM + Post process”, as shown in Fig. 7. The recall and precision rate is 97.5% and 95.5%, respectively. Computational cost
(d) Haar-like line filter (e) Ratio edge filter (f) Stroke filter Figure 6. Comparison of responses of some related filters.
In addition, we compare the five filters on a typical image containing some kinds of edge, such as step edge, stroke edge (i.e., the left four vertical lines), bright edge and dark edge. Because the four factors Icl, Scl, Sll and Hc are involved by stroke filter, it outperforms the other filters in stroke edge detection. Moreover, from Fig. 6 (f) we can find that the stroke response of the right line is weaker than the others because its lateral regions are not similar. In summary, compared with the other filters, stroke filter is the most suitable especially for text localization. 4. QUICK STROKE FILTER Efficiency is very important for a practical system. One of advantages of stroke filter is that it enables us to design a smart scheme of fast implementation without loss of precision. Here we use three strategies:
1475
is about 2.3s/frame. We can find that it can work so well in various complex conditions. Table 3. Comparison of two methods for text candidate detection. (About time cost, please see Paragraph 3 of Sec. 5) Method Canny +CCA Stroke filter+CCA
Recall Rate 0.943 0.970
Precision Rate 0.407 0.424
Time ms/frame 138 466
6. CONCLUSIONS AND FUTURE STUDY In this paper, we propose stroke filter for text localization in video images. The novelty of stroke filter is that it discovers the intrinsic characteristics of text by the analyzing the relationship between the local regions. So it is more suitable than the other filters especially for text localization. Experiments demonstrate the impressive performance of stroke filter. We believe that stroke filter will play a more important role in text localization or the other fields. Some efforts should be made. Stroke filter could be applied to text verification and segmentation, or extended to the fields of line-like structure detection, such road detection from remote images. 7. REFERENCES
[2] A. K. Jain, et al., “Automatic Text Location in Images and Video Frames,” Pattern Recognition, vol. 31, pp. 2055-2076, 1998. [3] M. Lyu, et al., “A Comprehensive Method for Multilingual Video Text Detection, Localization, and Extraction,” IEEE Trans. on CSVT, vol. 15, no. 2, 2005. [4] V. Wu, et al., “TextFinder: An Automatic System To Detect And Recognize Text In Images,” IEEE Trans. on PAMI, vol. 21, no. 11, pp. 1224-1229, 1997. [5] Y. Zheng, et al., “Machine Printed Text and Handwriting Identification in Noisy Document Images,” IEEE Trans. on PAMI, vol. 26, no. 3, pp. 337-353, 2004. [6] X. Chen, et al., “Detecting and Reading Text in Natural Scenes,” IEEE Conf. CVPR, 2000. [7] Q. F. Liu, et al, “Text Localization Based on Edge-CCA and SVM in Images,” Samsung Technology Conference, 2005. [8] R.Lienhart and J. Maydt. “An Extended Set of Haar-like Features for Rapid Object Detection,” IEEE Conf. Image Processing, vol. 1, pp. 900-903, Sep. 2002. [9] X. Hua, et al., “Automatic Location of Text in Video Frames,” Proc. ACM Multimedia 2001 Workshop: Multimedia Information Retrieval, Canada, 2001.. [10] F. Tupin, et al., “Detection of Linear Features in SAR Images: Application to Road Network Extraction,” IEEE trans. on GeoScience and Remote Sensing, vol. 36, pp. 434-453, 1998. [11] D. Chen, et al., "Text Enhancement with Asymmetric Filter for Video OCR", Proc. of the 11th International Conference on Image Analysis and Processing, 2001. [12] J. F. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. on PAMI, pp. 679-698, 1986.
[1] K. Jung, et al, “Text Information Extraction in Images and Video: A Survey,” PG Recognition, vol. 37, pp. 977-997, 2004.
Figure 7. Final results of text localization based on stroke filter. Black or white boxes indicate text regions.
1476