Caption text or superimposed text provides valuable information about contents in images and video ..... textSegmentationAndRecognition.html, May 1998.
TEXT DETECTION IN IMAGES AND VIDEO SEQUENCES
Miriam León, Antoni Gasull Image processing group, Department of Signal Theory and Communications, Technical University of Catalonia. UPC- Campus Nord, C/Jordi Girona, 1-3, 08034 Barcelona, Catalonia (Spain) E-mail {mleon, gasull}@gps.tsc.upc.es
Abstract Caption text or superimposed text provides valuable information about contents in images and video sequences. In this paper, on one hand we present a general overview about text features and a classification of its extraction methods, and on the other hand we introduce our tree structure-based bottom-up approach to text extraction showing some promising results. The purpose of this work is to develop a framework aiming to detect text as independently as possible from the content, quality or font present in the image or video sequence.
Keywords Text detection, text localization, character features, max-tree structure.
1. INTRODUCTION Nowadays the volume of data contained in video format makes necessary to create useful tools which allow to extract information from these video sequences in order to classify (e.g. indexation) or to analyse (e.g. CBIR) them without human supervision. Contents can be perceptual, such as colour, shapes, textures, etc., or semantic, such as objects, text or events and its relationships. The perceptual ones are easier to analyse automatically while the semantic are easier to handle linguistically. Caption or superimposed text is a semantic content whose computational cost is lower than the cost of others semantic contents. Due to the fact that usually text is synchronized and related to the scene, its extraction becomes a useful feature providing very relevant information for the semantic analyse. Although text is easily detectable for humans, even in the case of a foreign language, up to our knowledge, there are no methods allowing its extraction in any kind of video sequence. This is due to the fact that there is a wide range of text formats (size, style, orientation, ...), the low resolution of the images (quality) and the complexity of the background. Despite these facts, text lines present some homogeneity features, which make it detectable such as contrast, spatial cohesion, textured appearance, colour homogeneity, stroke thickness, temporal uniformity, movement on the sequences, position on the frame, etc. Therefore, the aim of this paper is on one hand to present, in the next section, a general overview about text features and its localization methods. On the other hand, section 3 introduces our approach to text extraction showing some promising results. Finally, in the last section some conclusions are drawn leading to the future work.
2. OVERVIEW ON TEXT DETECTION Much research work has been carried out in the area of text detection and localization in both, images and video. Therefore, for a better understanding of the different methods, the main character features are described. Although hardly ever all of them are taken into account in the same method, some of them, such as contrast or colour homo geneity, are always present.
2.1. Character and text features Some of the main caption text features are the following:
Contrast between text and background. Contrast is an important feature since in most images text must be readable, it cannot be blurred or occluded. Very often a high contrast is required as well as a steady brightness. One of the main problems localizing text is unsurprisingly the low contrast and the complex background because it turns their detection in an almost impossible task. In these cases, some enhancement tools must be employed as a pre-processing step. Spatial cohesion. All the features included in this point are related with geometric aspects, such as: Typography, which refers to the used type of font; Size, the minimal height and width allowing viewers to read the text are approximately 15 and 7 pixels , respectively, although the ratio between them allows values till 0.9; Word and sentences length can be also calculated in order to separate words given their height and their spacing; Compactness (or Fillfactor), if a bounding box containing the letter is build, compactness is the relation between pixels belonging to the letter and those belonging to the background. This feature can be applied for both a simple character and an entire word. Values which are used to be included in the interval [0.1, 1] are allowed, [14], but it depends on the authors' criteria; Direction, text possesses certain orientation to make it more easily readable and it is generally displayed horizontally. Textured appearance. The two previous features, contrast and spatial cohesion, can cause that the text detection turns into texture segmentation. Considering text as a whole entity, it has enough features to be detectable as a texture. Mismatching problems can appear when image textures and text features are very similar, like the leaves of a tree. Colour homogeneity. Characters are usually monochrome. Some papers take colour homogeneity as the main feature, because colour segmentation preserves contours better than for example contrast segmentation, which may blur some edges [14]. Polychrome characters can also be found, but they are related to artistic aspects more than to informative purposes, thus some authors tend to discard this kind of characters. Strokes thickness and its density. It contributes to the textured appearance of the text because stroke is almost ever uniform. Thickness usually remains constant, except for some typography. Other attributes related to stroke are its number in the character and its density, see [17]. Temporal uniformity and redundancy. People need time to read a sentence. This means that if every second 25 frames are displayed, the same caption text will be overlapped in so many frames as needed in order to make the sentence readable. Vision research has determined that humans need between two and three seconds in order to understand or process a complex image. Temporal uniformity affects not only the visualization time, but also the variation of the text size or its movement throughout the video sequence, which can not sharply change from frame to frame. Movement on the frame. Related to the previous one, it describes the most common text behaviour on the screen along the time: Static text , characters present no movement; Scrolling and crawling text , normally it is a linearly moving either horizontally from right to left (crawling), or vertically from bottom to top (scrolling); Flying text, it is the less common kind of movement and describes a free movement on the frame. When text candidates present neither static nor linear motion, candidate text blocks can be discarded due to the assumption that flying text is not usually intended to give more information, but to attract attention. On the other hand, velocity is another discarding element. When candidate regions are moving too fast it means that regions are not intended to be read. Position in the frame. Normally caption text is superimposed on the same frame area, thus caption text is normally found centred at the screen bottom. But it is placed on the most appropriate place in order to not occlude the video content (e.g. football match scoreboard: top left/right corner). Algorithms can be classified in two categories, those working on the compressed domain and those working on the spatial domain . Every method takes into account different features depending on the type of sequence (e.g. sport event or news) and its quality.
2.2. Compressed and semi -compressed domain In this group we include algorithms that are both in the compressed and in the semi -compressed domain. Compressed domain: It is based on the localization of static characters over moving background taking into account the macro-blocks belonging to P frames (MPEG-4) [2]. Moreover it assumes that text has horizontal geometry, that it does not occupy the whole frame and that it has to appear at least in three frames. These three features allow the algorithm to isolate macro-blocks and to determinate if the macroblocks are candidates to contain text. Both recall and precision are high in those sequences with moving background and static text, like sports sequence (e.g. score in a football match). But it cannot be used in sequences containing moving text or static background. Semi-compressed domain: This section is called semi-compressed domain, because algorithms don't work directly with macro -blocks but analysing the DCT (Discrete Cosinus Transform) components [2], [4] and [29]. DCT together with motion compensation are utilised in the MPEG standard video
compression in order to reduce spatial redundancy in a frame and temporal redundancy in consecutive frames, respectively. DCT coefficients represent spatial and directional periodicity. Thus, low level features can be directly extracted from compressed images. AC coefficients from horizontal harmonics show horizontal intensity variations; therefore, they will be high in case of having a text line. On the other hand, AC coefficients from vertical harmonics show vertical intensity variations; they will be high in case of having more than one single text line. In [20] a detailed explanation about the DCT coefficient interpretation can be found. The DCT block size, the character size and their ratio are important. For instance, if each letter is bigger than the block size we will be evaluating a single letter stroke intensity variation, but not text intensity variation relative to the background. In the same way if the letter size is too small any texture could be analogous to text and easily confused (e.g. grass field). In [2] the algorithms are classified in edge-based method and correlation methods. The Edge-based methods take into account contrast between text and background, this is the reason why as a first step they calculate the Horizontal Intensity Variation in the DCT coefficients. The Correlation method [5] is applied only when a shot changes, so a shot detection must be previously done. In order to detect if the new shot contents text the intra-coded blocks increment is calculated in the B- and P-frames. Once the candidate blocks are chosen some text features are searched for the localization of text. For example, characters have to be made of strokes and their colour homogeneity and horizontal geometry. However, in [29] this method is analysed and discarded due to its vulnerability to scene changes. Some other transformations could be used like the DWT (Discrete Wavelet Transform). This transformation gives more information than the DCT because spatial information is not lost with the transformation. Therefore, those areas with high values in high frequencies can be more easily found. Both [12] and [11] suggest wavelet transformation because of its capability to preserve spatial information. The text boxes are found through a hybrid: wavelet transformation and neural network. The WT output provides some relevant statistical features that can be chosen. In particular, the more discriminators are the mean, the second and third order level calculated from the HL, LH and HH subbands. These vectors are used as input in a neural network. In [3] some other transformations such as DHT (Discrete Haar Transform), DFT (Discrete Fourier Transform) and WHT (Walsh-Hadamard Transform), as well as the DWT (Discrete Wavelet Transform) are explained. As a pre-processing tool, in [27] this kind of algorithms is used in a first step to localize candidate areas.
2.3. Spatial domain Those methods that work with the pixel values and positions are called methods in the spatial domain and they can be classified according to the following image features: Edge-based [1],[2],[6] and [21]. Methods in this group are focused in the search of those areas that have a high contrast between text and background. In this way, edges from letters are identified and merged. Once these regions are recognised, spatial cohesion features are applied in order to discard false positives. Connected Components-based [13] and [15]. These methods use a bottom-up approach by iteratively merge sets of connected pixels using a homogeneity criterion leading to the creation of flat-zones or Connected Components (CC). At the end of the iterative procedure all the flat-zones are identified. Also in this case spatial cohesion features are applied. Both edge-based and CC-based methods could be included under the same group, region-based methods. Texture-based [8],[10],[12],[25] and [26]. In this group we can include many of the existing methods: they use the property that text in images has distinct textural characteristics that distinguish them from the background. For example, those methods which use the Gabor filter [8], Gaussian filter [26] or those based on the colour and shape of the regions [25] and [10]. If we had not classified into spatial and compressed domain, those methods based on wavelet or FFT would accomplish this textural properties. Correlation-based [24]. These methods are those that use any kind of correlation in order to decide if a pixel belongs to a character or not. Others [22]. All the methods that have been mentioned don't use temporal information or use it as a complementary tool. In [22], temporal information is the main feature. After applying a shot detection technique, for each image pixel a vector collects its temporal value along a fixed number of frames. The authors prove that, computing the PCA for each vector, feature vectors related to the background can be separated from those related to text. The main problem of this method is that it only can be applied when the sequence has static text and a moving background.
3. TREE STRUCTURE-BASED APPROACH Even though some of the methods presented in the overview have been tested or even improved in some aspects, the fact is that none of them works satisfactorily extracting superimposed text from any case
because they are usually application oriented. Our aim is to provide a more robust method, working as independently as possible of the image/sequence content, quality or font. We have decided to represent the input image by means of a Max-tree. In this tree structure each node is associated to a connected component resulting from thresholding the image for every grey level. This structure allows filtering the image in an easy way by pruning the tree using connected operators based on geometric features [19]. A priori, some geometric properties of the text have been modelled using a statistical model. The model is based on the geometric features of a set of fonts , such as compactness (A/P2 ) and complexity (P/A). As a result, the most probable values taken by the character features are used as thresholds to filter the tree. As the chosen criteria are non-increasing, instead a direct decision, a decision is taken based on the Viterbi algorithm. This filtering constitutes a first stage and results in a set of text regions candidates , see Fig. 1. Due to the fact that some non-text regions might not be pruned, a second stage might be needed. In this second stage, which is future work, these candidate regions are analysed in order to discard false positives and find text as a whole, as an object constituted by a set of sub-regions.
Fig. 1. a) Input image.b) First stage output image. Detected text and two examples of false positives regions marked with red circles.
4. CONCLUSIONS AND FURTHER WORK According to the classification presented in Section 2, our method can be classified as a region-based approach in the spatial domain. As this method consist of selecting region candidates using prior statistical models of the fonts and grouping those candidates that fulfils some geometric features (e.g. spatial cohesion), our approach can be defined as bottom-up. The selection of the candidates shown in this paper provides quite satisfactory results. However, complexity and compactness are not selective enough leading to the false positive problem. False positives are non-text regions that are not filetered. For example, the regions marked with a red circle in Fig. 1 b). In order to improve these false positives performances, we are working to introduce more robust criteria (e.g. stroke thickness homogeneity criterion). The introduction in our framework of the second stage is current work. Another advantage of using Max-tree in this framework is that it can be easily extended to deal with video sequences.
5. ACKNOWLEDGES This material is based upon work partially supported by the IST programme of the EU through the NoE IST-2000-32795 SCHEMA and TEC2004-01914 of the Spanish Government.
References [1] Agnihotri,L., Dimitrova,N.,”Text Detection for Video Analysis ”, IEEE Workshop on CBAIVL, 1999, pp. 109-113 [2] Agnihotri,L., Dimitrova,N., Soletic,M.,“Multi-layered Videotext Extraction Method”, IEEE International Conference on Multimedia and Expo (ICME), Lausanne (Switzerland), August 26-29, 2002 [3] Chaddha,N., Gupta,A., “Text Segmentation Using Linear Transforms ”, Proc. of Asilomar Conf. Circuits and Computers, 1996, pp. 422-427
[4] Chen,D., Odobez,J.-M., Boulard,H.,”Text Detection and Recognition in Images and Video Frames”, Pattern Recognition the journal of the pattern recognition society, accepted 20 June 2003. [5] Gargi, U., Antani, S., Kasturi, R., “Indexing Text Events in Digital Video Databases”, ICPR, August 1998, pp. 916-918 [6] Hua,X.-S., Chen, X.-R. et al.,”Automatic Location of Text in Video Frames”, Intl Workshop on Multimedia Information Retrieval (MIR2001, In conjunction with ACM Multimedia 2001), 2001 [7] Hua,X.-S., Yin, P., Zhang,H.-J.,”Efficient Video Text Recognition using multiplane frame integration”, IEEE International Conference in Image Processing (ICIP 2002), Rochester, NY, USA, 2225th September 2002. [8] Jain,A.K., Bhatarcharjee,S., “Text Segmentation using Gabor Filters for automatic document processing”, Matching Vision and Application, 1992, vol.5 pp. 169 -184 [9] Jain,A.K., Yu,B., “Automatic Text Location in Images and Video Frames”, Pattern recognition, 1998, vol. 31, No. 12 pp. 2055-2076 [10] Kim, H-K., “Efficient Automatic Text Location Method and Content-based Indexing and Structuring of Video Database”, Journal of Visual Communication and Image Representation, December 1996, vol.7, No.4, pp. 336-344 [11] Li,H., Doermann,D, Kia,O., “Automatic Text Detection and Tracking in Digital Video”, Univ. of Maryland, College Park, Tech.Reps. LAMP-TR-028, CAR-TR-900 1998 [12] Li,H., Doermann,D, Kia,O.:, “Automatic Text Detection and Tracking in Digital Video”, IEEE Trans. on Image Processing, January 2000, vol. 9, No.1, pp. 147-155 [13] Lienhart,R., Stuber,F., “Automatic Text Recognition in Digital Videos”, Proceedings of SPIE Image and Video Processing IV 2666, 1996, pp. 180-188 [14] Lienhart,R., Effelsberg,W.,”Automatic Text Segmentation and Text recognition for Video Indexing”, Technical Report TR-98-oo9, Praktische Informatik IV, University of Mannheim, May 1998 [15] URL: http://www.informatik.uni-mannheim.de/informatik/pi4/projects/MoCA/ProjecttextSegmentationAndRecognition.html , May 1998 [16] Myers,G.K., Bolles,R.C., Luong, Q.-T., Herson, J.A., “Recognition of Text in 3-D Scenes”, Fourth Symposium on Document Image Understanding Technology, Columbia, Maryland, April 2001 [17] Rosenfeld,A., Doermann,D., DeMenthon,D., 'Video Mining', Kluwer Academic Publishers (KAP), USA 2003. [18] URL: http://citeseer.nj.nec.com/sato99video.html [19] Salembier,P., Garrido,L., Garcia,D., "Auto-dual connected operators based on iterative merging algorithms " Fourth Int. Symposium on Mathematical Morphology, Amsterdam, The Netherlands, June 1998 [20] Shen,B., Sethi, I.K., “Direct feature extraction from compressed images”, SPIE, Storage and Retrieval for Image and Video Databases IV, 1996, vol. 2670 [21] Smith,M.A., Kanade,T., “Video Skimming and Characterization through the Combination and Language Understanding Techniques ”, IEEE Computer Vision and Pattern Recognition, 1997, pp. 775781 [22] Tang,X. et al., “Video Text Extraction using Temporal Feature Vectors', in Proc. of IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002 [23] Tekinalp, S., Alatan, A.A., "Utilization of Texture, Contrast and Color Homogeneity for Detecting and Recognizing Text from Video Frames", IEEE International Conference in Image Processing (ICIP 2003), Barcelona, Spain, 14-17th September 2003 [24] Wong,E.K., Chen,M, "A Robust Algorithm for Text Extraction in Color Video", Proceedings of IEEE International Conference on Multimedia and Expo, 2000, vol. 2, pp. 797-800 [25] Wu,V. , Manmatha, R., "TEXTFINDER: An Automatic System to Detect and Recognize Text in Frames", IEEE Transactions on Pattern analysis and machine intelligence, November 1999, vol. 21, No.11, pp. 1224 1229 [26] Wu,V., Manmatha,R., Riseman,E.M., "Automatic text Detection and recognition", Proceedings of Image Understanding Workshop, 1997, pp. 707-712 [27] Zhang, D., Rajendran, R. K., Chang, S.-F., "General and Domain-Specific Techniques for Detecting and Recognizing Superimposed Text in Video", IEEE International Conference in Image Processing (ICIP 2002), Rochester, NY, USA, 22-25th September 2002. [28] Zhong, D.X., "Color Space Analysis and Color Image Segmentation", VIP2000 Pan Sydney Area Workshop on Visual Information Processing, December 2000. [29] Zhong,Y., Zhang,H., Jain,A.K. "Automatic Caption Localization in Compressed Video", IEEE Trans. PAMI, April 2000, vol. 22, No.4, pp. 385-393