A TREE STRUCTURED-BASED CAPTION TEXT DETECTION APPROACH Miriam Le´on, Sergio Mallo and Antoni Gasull Image Processing Group, Department of Signal Theory and Communications Technical University of Catalonia UPC-Campus Nord, Jordi Girona 1-3 08034 Barcelona (Spain) email: mleon,
[email protected] ABSTRACT Nowadays superimposed text in both images and video sequences provides useful information about their contents. The aim of this paper is to introduce a method, which allows us to extract this kind of information, focused on working as independently as possible from the content, quality or font. Some pre-processing tools can be applied in order to reduce the number of false positives as well as the computational cost. The input image is represented by means of a Max-tree. This structure allows us to perform text localisation as a tree pruning. The pruning is performed applying connected operators based on geometric features of the letters. As a result, a set of potential text regions are obtained. The output of this first stage shows promising results. A second stage will be necessary to extract text as a whole, a set of unconnected regions with a unique meaning, allowing us to discard those regions not accomplishind text features.
contrast between text and the background, spatial cohesion, its textured appearance, colour homogeneity, stroke thickness, temporal uniformity and redundancy, its position in the frame, its movement, etc [1]. Therefore, the lack of some of these features due to image compression, low resolution of the images (quality), the complexity and/or similarity of the background and the wide range of fonts and letter sizes makes its detection difficult. Much research work has been carried out in the area of text detection in both single images and video sequences. The algorithms for text detection can be classified in two categories, those working on the compressed domain and those working on the spatial domain [1]. Compressed domain methods include algorithms that are both in the compressed and in the semi-compressed domain. The first ones analyse macro-blocks belonging to P-frames (MPEG-4) [2] in order to decide if they contain text or not. Whereas the second ones don’t analyse the macroblock but the Discrete Cosinus Transform (DCT) [2][3][4] or Discrete Wavelet Transform (DWT) [5][6] components in order to detect the textured appearance of the text by checking AC coefficients in the DCT or the High-Low (HL), LH and HH subbands in the DWT. These methods provide satisfactory results working as a pre-processing step. On the other hand, those methods that work with the pixel values and positions are called methods in the spatial domain and they can be classified according to the following image features: Edge-based [7],[2],[8] and [9], focused on the search of those areas having a high contrast between text and background; Connected Components-based [10] and [11], they use a bottomup approach by iteratively merging sets of connected pixels using a homogeneity criterion leading to the creation of flat-zones or Connected Components (CC). Both edgebased and CC-based methods could be included under the same group, region-based methods; Texture-based, they use the property that text in images has distinct textural characteristics that distinguish them from the background. For example, those methods which use the Gabor filter [12], Gaussian filter [13] or those based on the colour and shape of the regions [14] and [15]; Correlation-based [16], they use any kind of correlation in order to decide whether a pixel belongs to a character or not; finally, there are methods, which use temporal information as their main feature [17].
KEY WORDS Text detection, text localisation, Max-tree representation.
1 Introduction Nowadays the volume of data in video is increasing. This fact makes it necessary to develop tools, which allow us to deal with the information contained in the video sequences without human supervision. Caption text as well as scene text can be considered a semantic content, which is usually easier to handle linguistically but more difficult to analyse automatically than the perceptual. However, its extraction becomes a useful feature, providing very relevant information for the semantic analyse because of the fact that, on the one hand caption text is usually synchronised and consequently related to the contents in the scene; and, on the other hand, its computational cost is lower than the cost of extracting other semantic content, such as objects, events or their relationships. The detection and recognition of text can be employed in many application, such as indexation, content-based image retrieval, summarisation, etc. The aim of this paper is to provide a caption or superimposed text extraction method. Caption text present some features, which make it easily detectable for humans, some of these features are
480-099
220
Some of the previous methods have been tested or even improved in some aspects. As they are usually application oriented, they work fine for a determinate set of images or video sequences, e.g. sequences containing static text or sports events. But the fact is, that none of them works satisfactorily detecting text from any image or video sequence. Our method attempts to provide a more robust algorithm working as independently as possible from the content, quality or font. According to the classification presented in [1], our technique can be classified into the spatial domain category and can be further sub-classified as a region-based method. That is due to the fact that our approach is focused on features related to region characteristics, such as its shape, perimeter, area,... or its colour homogeneity. All these features are easier to analyse in the spatial domain than in the compressed domain. The paper is organised as follows: section 2 presents the text extraction approach, describing in the subsections the Max-tree, its advantages dealing with connected operators as well as how to prune the tree, the decision criteria, the features utilised to discriminate character from other regions and how the thresholds are defined; Section 3 shows some results; and finally, in the last section, some conclusions are drawn leading to the future work.
1
7
6
2
3
4
5
10
8
9
11
Figure 1. A synthetic image and its Max-tree representation
frame, whereas the root includes the whole image and corresponds to the lowest (highest) gray level. This fact means that if letters are darker than the background a Min-tree may be created instead of a Max-tree. Therefore, if a frame contains letters both lighter and darker than their neighbourhood, the combination of both trees should be utilised, because we are interested in recovering the leaves of the tree. The links of the tree show how the flat zones may be merged. Figure 1 shows a synthetic image and its representation by means of a Max-tree. Both letters, F and H, are the leaves numbered 10 and 3, respectively. The other leaves correspond to the two regions at the corners (nodes 7 and 8) and finally, the other region with the same gray level as F is in the centre (node 11).
2 Text detection method The text detection method we are proposing works with gray level images. Some pre-processing tools can be applied in order to simplify the input image and reduce the computational cost of the method. One of the main assumptions for text detection is that it must be readable by the viewers, that means that a certain contrast between letters and background should exist. Moreover, by analysing letter characteristics as well as the state of the art presented in [1], it can be founded that regions representing characters are homogeneous in colour. Consequently, an initial segmentation can be performed in order to eliminate noise as well as to avoid working with single pixels instead of flat zones. The segmentation reduces the number of regions in the image to a fixed number by merging those regions with a similar gray level [18]. Due to the fact that contrast exist and letter candidate regions are homogeneous in colour, the merging does not modify text regions and neither other contrasted regions.
2.1.1 Connected operators and Max-tree (Min-tree) Connected operators preserve the edges and are suitable for the extraction of characters. The way to implement connected operators working with functions (gray level images) is through the implementation of binary operators and the ’stacking’ method. The stacking implementation leads to a high computational cost, due to the fact that a binary operator is applied at all possible gray levels. For example, if the connected operator is applied in a 256-gray level image, 256 images must be created by thresholding the image at 256 gray levels. Then, a binary connected operator is applied at all binary images. Finally, the output image results from stacking the 256 images. The Max-tree (Mintree) is easy to implement to work with connected operators and has a low computational cost. The way to apply connected operators to a Max-tree (Min-tree) is as follows: first the tree is created, then a filtering or pruning of the tree is done, as a result of the filtering the tree is restored and the last step is to convert the restored tree to a gray level image [20], see Figure 2.
2.1 Image representation The input image is a simplified gray level image and is represented by means of a Max-tree or a Min-tree [19]. A Max-tree (Min-tree) is a graph, concretely a tree, where each node is associated to the binary connected component resulting from thresholding the image for every grey level present in the image. The terminal nodes or leaves in the Max-tree (Min-tree) can represent both single pixels and flat zones, whose gray levels are the highest(lowest) in the
2.1.2 Filtering or ’Pruning’ To filter it is necessary to define a criterion and the measurement of the criterion in each node. In the next section, the criteria utilised in the caption text extraction method
221
Figure 2. Filtering strategy Figure 4. Compactness values of Courier font
Figure 3. Complexity values of Courier font Figure 5. Complexity versus Compactness
will be introduced. Once the measurement of the criteria is calculated for each node, a decision about preserving or removing a node is taken. There are different kinds of decision depending on whether the criterion is increasing or not. If the criteria is increasing that means that the measurement of the criteria in a node is always lower than the measurement of the criteria of its father. In this case the decision would be straightforward by thresholding the measurements. In case the criteria is not increasing the decision is less straightforward. One of the methods, which works better, is one which handles the decision as an optimisation problem by applying Viterbi’s algorithm [20]. Once the decision is taken, removing a node of the tree means that pixels represented in this node become pixels represented by its father node. In the caption text detection, letters are usually represented by leaves and some of their consecutive fathers. The pruning does not allow us to preserve nodes, whose father should be removed. Therefore, when the pruning is done, we try to remove nodes representing letters. The removed regions are recovered by means of a Top-hat between the original image and the filtered image, which has presumably no letters.
ter testing some other operators as opening, closing and other criteria as the volume, the most discriminate are complexity and compactness, which are related to the following region geometry aspects: perimeter and area. • Complexity: it is defined as CX = PA , where P is Perimeter and A area of the region. This criterion expresses how complex a region is. If the perimeter of a region is large in comparison with its area, the region can be defined as a complex region. We could also work with the simplicity, which is defined as the inverse of the complexity. • Compactness: it is defined as CC = PA2 , where P is perimeter and A area of the region. This criterion expresses how compact a region is and has no dimension, that means that the measurement does not depend on the size. In some bibliography this feature is normalised and defined as the circularity: CC = 4πA P2 . The compactness of a circle is one and remain constant when its size increases. Any other region, compactness is smaller than one.
2.2 Character geometric features
The analysis of the alphabet in lower and uppercase with font 16 has led to determine some thresholds used for the filtering process, see [21]. Four text styles are taken into account for each font. The styles are normal, italics, bold and italics plus bold, in a range of sizes from 20 till
In the previous subsections filtering has been presented in a general way. In this subsection, the criterion and its measurements chosen for letters localisation are presented. Af-
222
250 points (a point is 1/72 inches) or pixels. For each single letter its perimeter and area has been calculated. The perimeter is the number of edge pixels surrounding the letter (understanding a pixel as a unit with four edges) and the area is the number of pixels, which are members of that letter. Once perimeter, area, complexity and compactness are computed, some conclusions can be drawn. Perimeter increases linearly with the size of the region and the area increases quadratically also with the size. Therefore, complexity varies with the sizes whereas compactness remains almost constant for each letter although its size increases or decreases. Figures 3 and 4 present the complexity and the compactness of the Courier font for different sizes. Figure 3 shows how complexity decreases when size increases independently of the letter, whereas Figure 4 shows how compactness remains almost constant although size increases. Also the relationship between both criteria has been drawn in order to have the possibility of applying them at the same time, see Figure 5. The following ranges include all the points in the graphic: first CC ∈ (0.0025 : 0.05) and CX ∈ (0.1 : 1.5) and the second one CC ∈ (0.0025 : 0.022) and CX ∈ (1.5 : 3, 85). But depending on the letter size, complexity ranges can be adjusted. When the tree is constructed, each node is associated with the complexity and compactness measurements of the represented region. If these values are analysed, it can not be assumed that the measurement of the criteria are increasing or decreasing from a child node to its father node. Therefore, a straightforward decision can not be applied to prune the tree. As it is mentioned in the previous subsection, the best option to prune by applying a non increasing criteria is the Viterbi’s algorithm. This algorithm searches the path of the tree with the minimum cost in order to preserve or remove a set of nodes. In the next section some promising results are shown illustrating the usefulness of the method and the selected criteria. Notice that only colour, complexity and compactness have been developed and if we want to provide very robust results the combination of the existing criteria with other criteria might be considered. Currently, a contrast criteria between connected regions is being developed. Leaves, which represent regions whose contrast with the region represented by its father nodes is higher than a threshold are removed.
(a)
(b)
Figure 6. a) Original image, and b) its filtered version with 500 regions
(a)
(b)
(c) Figure 7. a) Output image after applying the complexity connected operator in the range (0.6 : 3), b) Output image after applying the compactness connected operator in the range (0.0025 : 0.022), and c) Output image after applying both criteria (0.6 : 3)&(0.0025 : 0.022)
rion at the same time becomes clearer. Overlapping criteria is equivalent to narrowing each individual range. In figure 7 this fact can be observed. The input images are shown in figure 6, a) is the original image and b) is the result of segmenting the original image to obtain 500 regions. Figures 7 a) and 7 b) are the output for the complexity and compactness ranges (0.6 : 3) and (0.0025 : 0.022) respectively. Whereas 7 c) shows the output image after applying both criteria. Clearly, the output image improves, but it also proves that with wide ranges, the criteria are not discriminate enough. If our method was application oriented, results are very promising by narrowing both ranges criteria. Figures 8 a) and b) show how output images improve simply by fitting the complexity range. But there are some regions, as for example the badge on the T-shirt, which the referee wears, which are almost impossible to eliminate from the result image only applying both criteria. The badge has a
3 Results First complexity and compactness have been tested separately in order to prove how restrictive are, and the consequence of, varying the threshold range. If the whole range is selected, lots of non letters regions are reconstructed. But as ranges become narrow, less false positives appear. That could seem good news, but they are not if our aim is to develop a method working as independently as possible of the input image. Narrowing ranges is just the opposite. At this point, the necessity of handling more than one crite-
223
(a)
(b)
Figure 8. Output Images with different compactness and complexity ranges
Figure 10. Original image
Figure 9. Output Image after filtering with the three criteria without fitting complexity and compactness ranges
shape and a contrast, which fit with a letter. In the previous section, another criterion has been presented: contrast. Figure 9 shows some preliminary result, which correspond to the complexity and compactness ranges from Figure 7 c). (a)
Figures 11 a) and b) present other results by applying complexity and compactness at the same time in an image without having performed any pre-processing tool. But in this case, we want to show how the approach works, when in the same image there are both letters brighter than the background, for example the time at the top left corner or the phrase ’CANAL EN PERIODE DE PROVES’ centred at the bottom of the image in Figure 10, and letters darker than the background, for example the temperature or the relative humidity also in Figure 10. In this case, two trees are created, a Max-tree and a Min-tree. The Max-tree allows us to recover bright letters because they are usually represented on the leaves of the tree, and the Min-tree allows us to recover dark letters because in this case they are the dark objects, which are represented on the leaves of the tree. The output images preserve the letters quite well, but some other little regions have been also preserved. This is due to the fact that when regions are small, their complexity and compactness values tend to be similar to the values of the letters. In a second stage, when text features are being applied, all these isolated regions will dissappear.
(b) Figure 11. a) Output image, whose input image has been represented by means of a Max-tree, and b) Output image, whose input image has been represented by means of a Min-tree
224
4 Conclusions and further work
Processing, January 2000, vol. 9, No.1, pp. 147-155 [7] L.Agnihotri, N.Dimitrova, Text Detection for Video Analysis, IEEE Workshop on CBAIVL, 1999, pp. 109-113 [8] X.-S.Hua, X.-R.Chen, et al., Automatic Location of Text in Video Frames, Intl Workshop on Multimedia Information Retrieval (MIR2001, In conjunction with ACM Multimedia 2001), 2001 [9] M.A.Smith, T.Kanade, Video Skimming and Characterization through the Combination and Language Understanding Techniques, IEEE Computer Vision and Pattern Recognition, 1997, pp. 775-781 [10] R.Lienhart, F.Stuber, Automatic Text Recognition in Digital Videos, Proceedings of SPIE Image and Video Processing IV 2666, 1996, pp. 180-188 [11] URL: http://www.informatik.uni-mannheim.de/-informatik/pi4/projects/MoCA/ProjecttextSegmentationAndRecognition.html [12] A.K.Jain, S.Bhatarcharjee, Text Segmentation using Gabor Filters for automatic document processing, Matching Vision and Application, 1992, vol.5 pp. 169 -184 [13] V.Wu, R.Manmatha, E.M.Riseman, Automatic text Detection and recognition, Proceedings of Image Understanding Workshop, 1997, pp. 707-712 [14] V.Wu, R.Manmatha, TEXTFINDER: An Automatic System to Detect and Recognize Text in Frames, IEEE Transactions on Pattern analysis and machine intelligence, November 1999, vol. 21, No.11, pp. 1224 1229 [15] H-K.Kim, Efficient Automatic Text Location Method and Content-based Indexing and Structuring of Video Database, Journal of Visual Communication and Image Representation, December 1996, vol.7, No.4, pp. 336-344 [16] E.K.Wong, M.Chen, A Robust Algorithm for Text Extraction in Color Video, Proceedings of IEEE International Conference on Multimedia and Expo, 2000, vol. 2, pp. 797-800 [17] X.Tang et al., Video Text Extraction using Temporal Feature Vectors, in Proc. of IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002 [18] P.Salembier, L.Garrido, D.Garcia, Auto-dual connected operators based on iterative merging algorithms, Fourth Int. Symposium on Mathematical Morphology, Amsterdam, The Netherlands, June 1998 [19] P.Salembier, A.Oliveras, L.Garrido, Anti-extensive Connected Operators for Image and Sequence Processing, IEEE Transactions on Image Processing, 7(4):555-570, April 1998 [20] P.Salembier and L.Garrido, Connected operators based on region-tree pruning strategies, 15th International Conference on Pattern Recognition, ICPR’2000, Vol 3: 371-374, Barcelona, Spain, September 2000 [21] S.Mallo, Detecci´on de caracteres mediante operadores conexos, PFC ETSETB, 2004
The aim of this approach is to provide a tool to extract text information from data in image or video format by filtering the Max-tree representation of an input image. The filtering is done by analysing letter features: complexity and compactness. These both criteria provide satisfactory results working together but they are not discriminative enough to avoid false positives and, at the same time, they are too strict to avoid false negatives. The reason is that the ranges are too wide and some regions not representing letters have also a complexity and compactness inside the ranges. Moreover, the decision criteria implemented with the Viterbi’s algorithm helps providing very robust and accurate results. As a conclusion, a set of criteria has been tested to detect characters regions in images and the results are very promising. Some other possible features could also be combined with the already utilised, such as a contrast-based criterion or stroke thickness homogeneity criterion, leading to a more robust approach. Once this stage is more stable, the next step will be to develop some tools to handle the set of letter candidate regions as a whole, being capable of discarding false positives by applying text features, such as horizontal direction, temporal redundancy, etc.
5 Acknowledgements This material is based upon work partially supported by the IST programme of the EU through the NoE IST-200032795 SCHEMA and TEC2004-01914 of the Spanish Government.
6 References [1] M.Le´on, A.Gasull, Text detection in images and video sequences,IADAT International Conference on Multimedia, Image Processing and Computer Vision, Madrid (Spain), March 2005 [2] L.Agnihotri, N.Dimitrova, M.Soletic, Multi-layered Videotext Extraction Method, IEEE International Conference on Multimedia and Expo (ICME), Lausanne (Switzerland), August 26-29, 2002 [3] D.Chen, J.-M.Odobez, H.Boulard, Text Detection and Recognition in Images and Video Frames, Pattern Recognition the journal of the pattern recognition society, accepted 20 June 2003 [4] Y.Zhong, H.Zhang,H., A.K.Jain, Automatic Caption Localization in Compressed Video, IEEE Trans. PAMI, April 2000, vol. 22, No.4, pp. 385-393 [5] H.Li, D.Doermann, O.Kia, Automatic Text Detection and Tracking in Digital Video, Univ. of Maryland, College Park, Tech.Reps. LAMP-TR-028, CAR-TR-900 1998 [6] H.Li, D.Doermann, O.Kia, Automatic Text Detection and Tracking in Digital Video, IEEE Trans. on Image
225