Ontology-based semantic concept extraction - Semantic Scholar

4 downloads 12386 Views 3MB Size Report
Mar 20, 2007 - consist of dedicated modules to semantics extraction as well as ...... assumption of pure translational model, Li uses SSD (Sum of Square Difference) based ...... resolved through client & server-side reasoning. MUMIS.
Deliverable D2.2: Semantics extraction from visual content tools – The state-of-the-art Final Version 4.0

I. Pratikakis1, A. Bolovinou1, B. Gatos1, M. Anthimopoulos1, S. Perantonis1, C. Seibert2, S. Eickeler2, T. Merten2

Distribution: Public

BOEMIE Bootstrapping Ontology Evolution with Multimedia Information Extraction 1

2

National Centre for Scientific Research “Demokritos” (NCSR) Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. (FHG/IMK) University of Milano (UniMi) Centre for Research and Technology Hellas (CERTH) Hamburg University of Technology (TUHH) Tele Atlas (TA) FP6-027538 D2.2 20-Mar-07

Project ref. no. FP6-027538

Project acronym BOEMIE

Project full title Bootstrapping Ontology Evolution with Multimedia Information Extraction

Security (distribution level) Contractual date of delivery Actual date of delivery Deliverable number Deliverable name Type Status & version Number of pages WP contributing to the document WP / Task responsible Other contributors Author(s)

Public M12 D2.2 Semantics extraction from visual content tools – The state-of-the-art Report Final v4.0 92 WP2 NCSR Ioannis Pratikakis, Anastasia Bolovinou, Bassilios Gatos, Marios Anthimopoulos, Stavros Perantonis, Christoph Seibert, Stefan Eickeler, Thorsten Merten

Quality Assurance EC Project Officer J. Hagman Keywords Semantics extraction, visual content, knowledge Abstract (for dissemination) This deliverable presents the state-of-the-art on “Semantics Extraction from visual content”. We first give an overview of OCR methodologies dealing with visual content, namely image and video, that provide textual information. Although, this kind of analysis does not provide directly any semantics, it is a critical step involving visual content, which feeds BOEMIE’s module dedicated to semantics extraction from text. In the sequel, the document is dedicated to the semantics extraction from images and videos, for which an exhaustive survey is presented for different type of knowledge as well as different content representation. Related R&D projects that put as one of their main objectives the semantics extraction from visual content, existing systems that consist of dedicated modules to semantics extraction as well as evaluation efforts to test the performance in semantics extraction are presented. 2

Table of Contents Executive Summary .......................................................................................................... 5 1. Introduction............................................................................................................... 6 2. Image and Video OCR.............................................................................................. 8 2.1 Introduction......................................................................................................... 8 2.2 Text Extraction from Video or Images ............................................................. 10 2.2.1 Text Detection........................................................................................... 10 2.2.2 Text Segmentation .................................................................................... 25 2.2.3 Text Recognition....................................................................................... 31 2.3 Evaluation methods........................................................................................... 31 2.4 Open Issues ....................................................................................................... 33 3. Semantics extraction from images and video ....................................................... 37 3.1 Semantics extraction basic pipeline .................................................................. 37 3.2 Use of Implicit Knowledge in low-level representation ................................... 38 3.2.1 Low-level features learning ...................................................................... 39 3.2.2 Words/image joint distribution learning ................................................... 40 3.3 Use of Implicit Knowledge in intermediate representation .............................. 42 3.3.1 Bag of words model .................................................................................. 42 3.3.2 Words/image joint distribution learning ................................................... 48 3.4 Use of Implicit Knowledge in semantic representation.................................... 49 3.5 Use of explicit Knowledge (Ontology-driven approaches) .............................. 54 3.6 Use of Explicit / implicit knowledge during a segmentation / recognition interplay ........................................................................................................................ 60 3.7 A “bird’s eye view” on the state of the art........................................................ 66 4. Related projects....................................................................................................... 73 ACEMEDIA ............................................................................................................. 73 AIM@SHAPE .......................................................................................................... 73 CARETAKER........................................................................................................... 73 DIRECT-INFO ......................................................................................................... 73 FAETHON................................................................................................................ 73 K-SPACE.................................................................................................................. 74 LIVE ......................................................................................................................... 74 MediaCampaign........................................................................................................ 74 MESH ....................................................................................................................... 74 MUMIS ..................................................................................................................... 74 POLYMNIA ............................................................................................................. 75 REVEAL-THIS......................................................................................................... 75 SHARE ..................................................................................................................... 75 VACE – Phase III ..................................................................................................... 75 X-MEDIA ................................................................................................................. 76 5. Related systems ....................................................................................................... 76 MediaMill (University of Amsterdam)..................................................................... 76 MARVEL (IBM) ...................................................................................................... 76 IMKA (Columbia University)................................................................................... 76 CuVid (Columbia University)................................................................................... 77

3

6.

Related evaluation challenges ................................................................................ 77 TRECVID ................................................................................................................. 77 MediaMill Challenge ................................................................................................ 77 PASCAL Visual Object Classes (VOC) Challenge.................................................. 77 7. Concluding remarks ............................................................................................... 78 REFERENCES................................................................................................................ 79

4

Executive Summary In BOEMIE, semantics extraction from multimedia content involves both visual (images, video) and non-visual content (text, audio/speech). This report focuses on the area of extraction from visual content, and aims to present the state of the art methods and systems, and discuss the current trends and open issues that motivate the relevant research work in the context of the project. Furthermore, related R&D projects that put as one of their main objectives the semantics extraction from visual content are presented and evaluation efforts to test the performance in semantics extraction are discussed. The state of the art methodologies are presented in two distinct parts that correspond to Sections 2, 3, respectively. In the first part, we give an overview of OCR methodologies dealing with visual content that provides textual information, namely image and video. Although this kind of analysis does not provide directly any semantics, it is a critical step involving visual content, which feeds BOEMIE’s module dedicated to semantics extraction from text. In the second part, we discuss the state-of-the-art methodologies that use visual content to directly extract semantics of different granularity. Regarding image and video OCR, the reported survey is developed upon the main analysis components which lead to OCR, namely text detection, text segmentation and text recognition. A further categorization is discussed with respect to whether the complete OCR process is progressing in bottom-up or top-down manner. The challenge in BOEMIE is three-folded. First, an effective binarisation has to be developed since we treat situations in which complex background appears, second, both artificial and scene type of text have to be recognized and third, effective text tracking for non-linear movement of text is required. Concerning semantics extraction from visual content methodologies are mainly developed by taking into account alternatives from two main paths: (i) visual content representation and (ii) use of knowledge. In the literature, the visual content representation used has been built upon three basic alternatives. In the first, we have a low-level representation that can be supported by settings which can be: (i) region-based; (ii) patch-based; (iii) block-based and (iv) the entire image. In the second, we have an intermediate representation that can be supported by visual vocabularies and finally, the third alternative concerns relationships among semantic objects. In the case of knowledge acquisition, there are two main trends that depend upon the type of knowledge which is used. Knowledge is involved in a semantics extraction process either in an implicit or in an explicit way. In this deliverable, we will discuss existing approaches that use either implicit or explicit knowledge for different levels of representation to address semantics extraction from visual content. Furthermore, we will also discuss about an important trend which leads to improved semantics extraction that is based upon an interplay between segmentation and recognition approaches. The aim of this survey is to identify the state of the art methodologies having the best performance with respect to BOEMIE requirements. This further implies that an optimal methodology for BOEMIE could not be one of the single state of the art approaches presented but also a mixture of them leading to a hybrid methodology. The measurable objective of this project is to significantly improve the performance of existing single modality approaches, in terms of scalability and precision.

5

1. Introduction Nowadays, multimedia data, i.e. text, images, diagrams, music, speech, sound and video documents as well as composite documents that contain fragments of data of different types, possibly with temporal synchronization, are widespread and are continuously produced in an increasing rate. In this respect, it becomes a major challenge to succeed on extracting information that may have explicit semantics, enabling human and machines to make a better use of information, and better integrate available data. Toward the goal of automatic semantics extraction, the major discrepancy is machine’s difficulty to interpret/understand what is depicted in the digitized visual content. Visual content understanding is inherently an ill-posed problem as higher level semantics (e.g. sky in dawn) and low-level syntactic image descriptors (images with predominant blue, white and orange colours) may be disconnected. This is called the semantic gap. In [Gevers-04], “the semantic gap is the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation”. The semantic gap gets even harder to be bridged considering that the objects in a broad-domain image collection may be captured from different viewpoints, occluded (where essential parts telling two objects apart may be out of sight) and cluttered with poor lighting, focus, and exposure. This fact of “missing” features induces uncertainty in the image interpretation task and is called the sensory gap. In a more formal way, “the sensory gap is the gap between the object in the world and the information in a (computational) description derived from a recording of that scene (e.g. shadows on an object due to varying illumination conditions during capturing it)” [Gevers-04]. To cope with the sensory gap, low-level image descriptions should be invariant to sensorial variations in order to remain representative for the repertoire of images in a database [Pratikakis-06a]. It is evident that a mapping is required from the low-level features to high level concepts [Chandrasekaran-88] that will be able for each ‘visual concept’ to be able to manage large intra-class variability as well as inter-class similarity [Pinz-06]. Semantics extraction relies on several dependencies that range from domain knowledge to contextual information. In general, the use of a priori knowledge bases is a critical requirement for the guidance of the extraction process. This has been acknowledged since a long time from the various knowledge-based vision systems that have been suggested. Example systems are the following: VISIONS [Hanson-78], SCHEMA [Draper-89], SIGMA [Matsuyama-90], SYGAL and PROGAL [Thonnat-89]. An excellent survey of knowledge-based image understanding systems is given by Crevier and Lepage [Crevier97]. One strong criticism made by Draper et al. in [Draper-92] is that most knowledge-based systems are tailored to one application domain and thus a brand new knowledge base had to be constructed for a new application domain. The close-world assumption entailed by a priori knowledge modelling is also often criticized. Another weakness of knowledgebased systems is that the use of explicit knowledge is not really suited for modeling the variability, the changes and the complexity of the world. Draper also emphasizes on the knowledge acquisition bottleneck: it is hard to scale knowledge-based approaches up to a large problem. The major bottleneck regarding the construction of new knowledge bases

6

in the case of new application domains was treated by the ARPA Knowledge Sharing Effort [Neches-91] that envisioned the construction of new knowledge based systems by assembling reusable components. The envisaged systems would require only specific knowledge and reasoners that did not exist before. The means to fulfill this expectation was the form of explicit knowledge representation called ontologies. Nonetheless, there is another community of researchers that advocate the use of implicit knowledge representations which can support an accurate and scalable semantics extraction up to a satisfactory level. In this report, we aim to provide all available basic trends on semantics extraction from visual content (images and video) dealing with different levels of representation, knowledge type as well as involved learning. In this way, we can get a clear picture about where BOEMIE’s methodology is situated towards advancing the state-of-the-art for semantics extraction from visual content. The remainder of this report is structured as follows. In Section 2, we give an overview of OCR methodologies dealing with visual content providing textual information, namely image and video. Although, these methodologies do not directly provide any semantics, their corresponding application is a critical step involving visual content, which feeds BOEMIE’s module that is dedicated to semantics extraction from text. The main focus of this deliverable is the semantics extraction from images and videos, which is detailed in Section 3. In Section 4, we give related R&D projects that have semantics extraction from visual content as one of their main objectives, while in Section 5, we report for particular systems that consist of dedicated modules to semantics extraction. For evaluation purposes certain competitions / challenges have been organised which aim to test the performance in semantics extraction among the corresponding participants. Those challenges are reported in Section 6 while, finally, concluding remarks are given in Section 7.

7

2. Image and Video OCR 2.1

Introduction

Text detection and recognition in images and videos is a research area which attempts to develop a computer system with the ability to automatically read from images and videos the text content visually embedded in complex backgrounds. In particular, a computer system should answer two typical questions “Where & What”: “where is a text string?” and “what does the text string say?” in an image or a video. In other words, using such a system, text embedded in complex background can be automatically detected and each character or word can be recognized. The investigation of text detection and recognition in complex background is motivated by cutting edge applications of digital multimedia. Today, more and more audio and visual information is captured, stored, delivered and managed in digital forms. The wide usage of digital media files provokes many new challenges in mobile information acquisition and large multimedia database management. Among the most prominent are: (i) Automatic broadcast annotation: creates a structured, searchable view of archives of the broadcast content. (ii) Digital media asset management: archives digital media files for efficient media management. (iii)Video editing and cataloging: catalogs video databases on basis of content relevance. (iv) Library digitizing: digitizes cover of journals, magazines and various videos using advanced image and video OCR. (v) Mobile visual sign translation: extracts and translates visual signs or foreign languages for tourist usage, for example, a handhold translator that recognizes and translates Asia signs into English or French. An image and video text detection and recognition system aims at extending the capability of OCR and text-based retrieval technologies for accessing images and videos. However, text characters contained in images and videos can be: • Of any color and grayscale value (not always white or black) • Of low resolution • Of variable size • Of unknown font • Embedded in complex backgrounds Two types of text are found in images and video, namely artificial and scene text that are defined in the following. Artificial (superimposed, graphic, caption, overlay) text is artificially superimposed on images or video frames at the time of editing. Artificial text usually underscores or summarizes the video’s content. This makes artificial text particularly useful for building keyword indexes (e.g. Figure 1). Artificial text has usually the following characteristics: • Text is always in the foreground and never foreground-occluded. • Usually with stable lighting condition which is scene-independent. • The values of text pixels are distributed according to limited rules. 8

• • • •

Size, font, spacing and orientation are constant within a text region. Text is usually aligned horizontally. The same text appears in several consecutive frames. Text usually has a good contrast with the background.

Figure 1: Examples of artificial text

Scene text naturally occurs in the field of view of the camera during video capture. Scene text occurring in signs, banners, etc. may also give keywords that describe the content of a video sequence (e.g. Figure 2).

Figure 2: Examples of scene content

Scene text can be described by the following characteristics: • Variability of size, font, colour, orientation, style, alignment, even within words • Part of the text might be occluded • Complex movement in the scene • Variability of lighting conditions • Variability of transformations • Deformation if on flexible surface

9

2.2

Text Extraction from Video or Images Video/Image

(a)

Text Detection

(b)

Text Segmentation

(c)

Text Recognition

(d)

Text

(e)

Figure 3: Stages of text extraction from video or images

Text Extraction from Video or Images can be generally divided in three basic stages as shown in Figure 3. The input of the method is a video sequence or a single image (a). Text Detection (b) involves locating the text regions. The output of this stage is a set of tight bounding boxes around all text instances (Spatial detection). In the case of video, text detection also includes text tracking (Temporal detection) which involves following a text event as it moves or changes over time. In this case the output is a set of bounding boxes in different frames for every text event. Text Segmentation (c) includes text enhancement and binarization. Text enhancement involves multi-frame integration on the results of text tracking and resolution enhancement via interpolation. Text Recognition (d) involves the character recognition of the enhanced and binarized text instances. The output from this stage consists of ASCII characters (e). 2.2.1 Text Detection The term “text detection” in images or video frames is often used with different meanings in the literature. Some researchers refer to “text detection” as the determination of the presence of text in a given frame while “text localization” is used to describe the process of determining the location of text in the image and generating bounding boxes around it. Distinguishing detection from localization might be useful for video text retrieval if a very fast and accurate detection algorithm can be found. In this way the localization algorithm can be omitted for the frames that do not contain text. However, in many cases various localization algorithms have the ability to spatially locate the text, with almost the same speed with text detection algorithms. In order to make text localization fast, many researchers have proposed splitting it in two stages: localization and refinement. In this way, a fast and approximate localization algorithm with very high recall is applied at first, and afterwards if any text box being detected is going to be refined from a slower, more complex and more accurate algorithm. Here, we will use “text detection” to describe the whole procedure that finally results in a text box (or a set of text boxes) for every text line of an image. The result will be more than one text box for every text line, when the input of the algorithm is a video sequence. In this case, text tracking is also included in text detection. However, in some cases we will use the terminology proposed by the writer of the specific work. Text detection generally can be classified into two categories: (i) Bottom-up methods (ii) Top-down methods Top-down methods can also be divided into two categories: (a) Heuristic methods and (b) Machine learning methods.

10

2.2.1.1 Bottom-up methods (Connected-Components (CC) based methods) These methods segment images into regions and group “character” regions into words. The methods, to some degree, can avoid performing text detection. Due to the difficulty of developing efficient segmentation algorithms for text in complex background, the methods are not robust for detecting text in many camera-based images and videos. Due to their relatively simple implementation, CC-based methods are widely used. Nearly all CC-based methods have four processing stages: (i) Preprocessing, such as color clustering and noise reduction (ii) CC generation (split-merge algorithm, region-grow algorithm). (iii) Filtering out non-text components. (iv) Component grouping. A CC-based method could segment a character into multiple CCs, especially in the cases of polychrome text strings and low-resolution and noisy video images. Further, the performance of a CC-based method is severely affected by component grouping, such as a projection profile analysis or text line selection. In addition, several threshold values are needed to filter out the non-text components, and these threshold values are dependent on the image/video database. Lienhart et al. [Lienhart-95] regard text regions as CCs with the same or similar color and size, and apply motion analysis to enhance the text extraction results for a video sequence. The input image is segmented based on the monochromatic nature of the text components using a split-and-merge algorithm. Segments that are too small or too large are filtered out. After dilation, motion information and contrast analysis are used to enhance the extracted results. A block-matching algorithm using the mean absolute difference criterion is employed to estimate the motion. Blocks missed during tracking are discarded.

Figure 4: Intermediate stages of processing in the method by Lienhart et al [Lienhart-95]. (a) original video frame; (b) image segmentation using split-merge algorithm; (c) after size restriction; (d) after binarization and dilation; (e) after motion analysis; and (f) after contrast analysis and aspect ratio restriction.

Their primary focus is on caption text, such as pre-title sequences, credit titles, and closing sequences, which exhibit a higher contrast with the background. This makes it 11

easy to use the contrast difference between the boundary of the detected components and their background in the filtering stage. Finally, a geometric analysis, including the width, height, and aspect ratio, is used to filter out any non-text components. Based on experiments using 2247 frames, their algorithm extracted 86% to 100% of all the caption text. Figure 4 shows an example of their text extraction process. Sobottka et al. [Sobottka-99] use a region growing method in order to detect homogeneous regions. Beginning with a start region, pixels are merged if they belong to the same cluster. Then the regions are grouped to form text lines assuming that text lines consist of more than three regions having a small horizontal distance and a large vertical overlap with each other (Figure 5).

Figure 5: Example for a text line composed of regions Ri

For each pair of regions Ri and Rj it is tested whether their horizontal distance d is smaller than a predefined threshold Θ: Θ=5*max(w(Ri),h(Ri)) where w(.) and h(.) denote the width and height of a region. Furthermore it is required that the difference between the upper and lower borders of Ri and Rj are in a certain tolerance interval which is defined dependent on the height h of a region. The upper or lower border of a region must vary not more than (2/3)*h (Figure 6).

Figure 6: Features for grouping of regions

Text regions are also detected using a top-down approach, and merged independently so finally the results of the two methods are combined. The method has promising results for book and text covers for which it was initially designed. However, it is vulnerable to the size and colour of text and to text embedded in complex backgrounds. 2.2.1.2 Top-down methods These methods first detect text regions in images using filters and then perform bottomup techniques inside the text regions. These methods are able to process more complex images than bottom-up approaches. Top-down methods are also divided into two categories: 12



Heuristic methods

These methods usually use heuristic filters in order to detect text. Malobabic et al. [Malobabic-04] propose an edge-based method for detecting text regions. The magnitude of the symmetrical horizontal difference is calculated for each pixel. Smoothing and blurring filters are applied followed by morphological operations in order to connect edges of text regions and discard non-text edges. Finally the edge map is binarized and bounding boxes are created according to geometrical constraints. The method works for artificial horizontal text. Zhong et al. [Zhong-00] use the DCT coefficients of compressed jpeg or mpeg files in order to distinguish the texture of textual regions from non-textual regions. The horizontal intensity variation of each 8x8 DCT block is calculated and then thresholding is applied in order to obtain potential text blocks. Morphological operations help to remove isolated noisy blocks and merge isolated text blocks. Finally text verification and filtering are applied based on vertical intensity variation. By manipulating features directly in the compressed domains, we can save the resources (computation time and storage) needed for decompressing video sequences. Moreover this approach proves to be relatively robust to the font of the text, the image resolution, the background complexity and the presence of moving objects. However, since this algorithm uses local texture measure in DCT blocks, it is vulnerable to the size of characters and the spacing between them. Xi et al. [Chen-01a] propose an edge based text detection algorithm that consists of the following steps: (1) edge map generation with sobel operator, (2) smoothing and morphological opening in order to connect potential text areas, (3) region decomposition with vertical and horizontal projections so one textbox is created for any text line and (4) text block identification using geometrical constraints. The method works for artificial horizontal text. Perantonis et al. [Perantonis-03, Perantonis-04] propose an edge extraction technique that facilitates the detection of all text as well as of all inverted text regions in WEB images. A conditional dilation technique helps to choose text and inverted text objects among all objects. The criterion is the thickness of all objects that in the case of characters is of restricted value. Sato et al. [Sato-98] use a simple edge-based text detection algorithm. They apply a 3x3 horizontal differential filter to the entire image with appropriate binary thresholding. If a bounding region which is detected by the horizontal differential filtering technique satisfies size, fill factor and horizontal-vertical aspect ratio constraints, it is selected for recognition as a text region. The method works for artificial horizontal text. Du et al. [Du-03] propose a methodology that makes use of four modules to implement a series of processes to extract text regions from video images. The first module, called the multistage pulse code modulation (MPCM) module, is used to locate potential text regions in colour video images. It converts a video image to a coded image, with each pixel encoded by a priority code ranging from 7 down to 0 in accordance with its priority, and further produces a binary thresholded image, which segments potential text regions from the background. The second module, called the text region detection module, 13

applies a sequence of spatial filters to remove noisy regions and eliminate regions that are unlikely to contain text. The third module, called the text box finding module, merges text regions and produces boxes that are likely to contain text. Finally, the fourth module, called the optical character recognition (OCR) module, eliminates the text boxes that produce no OCR output. Crandall et al. [Crandall-03] propose a method based on the DCT coefficients. Video frames are divided in 8x8 blocks and the DCT coefficients are computed. Some of them are summed and the result is thresholded in order to classify the block as text or non-text. The threshold is related linearly to the frame average contrast. Then a text localization step is applied in order to find the minimum bounding boxes. For this purpose Crandall et al. propose an iterative greedy algorithm. Firstly, connected component analysis is performed on the blocks detected as text. Orthogonal bounding rectangles are computed for each component. Then, the bounding rectangles are iteratively refined. Each iteration of the greedy algorithm attempts to increase the criterion: G = Pt * (1-Pnt) where Pt is the percentage of the detected text pixels that lie underneath the rectangle, and Pnt is the fraction of the rectangle’s area covering non-text pixels. During each iteration, each rectangle is visited and one of the following actions is taken: • Leave rectangle unchanged; • Increment or decrement rectangle height or width by one block; • Shift rectangle one block horizontally or vertically; • Rotate by 15 degrees clockwise or counter-clockwise. The algorithm is designed to detect artificial text with special effects as well as scene text. Gatos et al. [Gatos-05a, Gatos-05b] propose an efficient binarization and enhancement technique followed by a suitable connected component analysis procedure in order to detect text areas in indoor/outdoor scene images. Image binarization successfully processes scene images having shadows, non-uniform illumination, low contrast and large signal-depended noise. Connected component analysis is used to define the final binary images that mainly consist of text regions. •

Machine learning methods

These methods use machine learning techniques in order to detect text. Li et al. [Li-00a] propose a method based on neural networks. A network is trained to distinguish between text and non-text regions using wavelets as features. The NN classifies the pixels of a sliding window of 16x16 pixels. In the detected area a connected component analysis is applied so the textbox’s center, width and height are computed. The skew is also estimated and corrected. This method works for artificial and scene text. Most methods base the detection algorithm on features related to edges or texture and as a second stage apply geometrical constraints to enhance the result. Wolf et al. [Wolf-04] propose a method which takes into account the geometrical constraints directly in the detection phase. The algorithm consists of two steps:

14

(i) Perform a coarse detection to emphasize text candidates without taking into account geometrical features. This detection is based on the detection of areas containing a high density of vertical strokes. (ii) For each pixel, calculate geometrical features of its neighbourhood based on the detection results from step 1. Use these features together with the features calculated in step 1 and perform a new refined detection. The features used are: (i) the horizontally accumulated first derivative, (ii) the width of the detected mode, (iii) the height of the detected mode, (iv) the difference of the heights of the mode to the left and to the right border, (v) the mean of the gradient values in the mode, (vi) the standard deviation of the gradient values in the mode, (vii) the baseline presence (accumulated differences of the mode widths across several modes in a horizontal neighbourhood). The 7 features are calculated for the four principal orientations of the image so 4*7=28 features are created. An SVM is used for the classification of text or non-text regions. The method works for artificial and scene text. However covering a wider variety of scene text cases (text in any orientation and skew etc) is a writer’s future work intention. Yan et al. [Yan-03] introduce a new color image edge operator (CED) to segment the image and achieve the elementary candidate text block. First, elementary text detection is applied based on the edges: i.

The original image I1 is processed by CED to get the greyscale edge image I2

ii.

I2 is processed by longitudinal sobel operator to get the binary edge image I3

iii.

I3 is processed by morphological methods to get the image I4.Considering the horizontal features of texts in video images, we use the open operator to dilate I3 in horizontal direction and then use the close operator to erode it in morphological direction.

Some heuristic rules are applied to enhance the result. Finally text blocks are located more accurately by a Back Propagation Artificial Neural Network trained on Gabor edge features. The method works for artificial horizontal text and is vulnerable to small characters or blurred text. Ye et al. [Ye-05] propose a coarse to fine algorithm to detect text in complex background. First, in the coarse detection, after the wavelet energy feature is calculated to locate all possible text pixels, a density-based region growing method is developed to connect these pixels into regions which are further separated into candidate text lines by structural information. Secondly, in the fine detection, with four kinds of texture features extracted to represent the texture pattern of a text line, a forward search algorithm is applied to select the most effective features. Finally, an SVM classifier is used to identify true text from the candidates based on the selected features. Wu et al. [Wu-05] propose a system of two co-trained SVMs for text detection. The two separate text detectors are trained on disparate feature sets. To avoid the hard labelling task for text detection they develop an approach which can take advantage of both cotraining and text recognition feedbacks.

15

The algorithm extracts two feature sets: edge features and colour features respectively for each image patch. Second, two margin-based classifiers are trained on these two feature sets using only labelled data. Next, the co-training algorithm comes to play with guidance of Optical Character Recognition (OCR) output. We add a data point into the positive set only when the classifier predicts it as “positive” and also OCR finds text(s) in it. Similarly, a sample is added to the negative set only when it is predicted as “negative” and OCR does not find any text(s) in it. For the edge features a simple Canny operator is used while colour features are obtained from colour distribution computed by EM algorithm. Features are computed for a sliding window of size 40x20 with 50% overlap. Li et al. [Li-99a] propose a neural network based algorithm for detecting text. They use for features the mean and the second- and third-order central moments of the wavelet decompositions of frames. Actually a 16x16 window is sliding over the image for 4 pixels at a time. A feature vector is calculated and the window is classified by an ANN as text or non-text. Then as a post-processing step, Li et al. uses the edges projection profiles in order to separate different text lines bounded by the same box. This is very useful for the tracking algorithm. The detection algorithm works for artificial and scene text. Lienhart et al. [Lienhart-02] propose a method based on neural network classification. As features they use the complex-values edge orientation image E(x,y) which is defined as follows:

E ( x, y ) = (∑c∈{r , g ,b}

I c ( x, y ) I ( x, y ) , ∑c∈{r , g ,b} c ) dx dy

(1)

where I(x,y)=( Ir(x,y), Ig(x,y), Ib(x,y)) is the RGB input image. E maps all edge orientations between 0 and 90 degrees, and thus distinguishes only between horizontal, diagonal and vertical orientations. Lienhart et al. propose a fixed scale text detector which consists of a complex-valued neural network with hyperbolic tangent activation function fed by a 20x10 edge orientation region in E (one link from each complex-valued edge pixel to each complex-valued input neuron). The hidden layer is aggregated into one real-valued output neuron of range [-1, 1]. This is actually a confidence value of being text or non-text, referring to the 20x10 window of the image. The raw fixed-scale text detector is applied to all scales using a scale down factor of 1.5. In order to recover initial text bounding boxes, the response images at the various scales must be integrated into one saliency map of text. Text locations stick out as correct hits at multiple scales, while false alarms appear less consistently over multiple scales The extraction of the bounding boxes includes: •

Region grow algorithm demanding S values over a threshold.



Use of horizontal and vertical projections to improve the result (Figure 7)



Estimating text color and background color, using: o A color histogram covering the four center rows of the text box.

16

o A color histogram covering the two rows directly above and underneath the text box (four rows together).

Figure 7: Horizontal and vertical projections

Chen et al. [Chen-01a] firstly apply a 3×3 horizontal and vertical Sobel filter (Figure 8) on a video frame to obtain two edge maps of the frame: vertical and horizontal. A nonmaxima suppression is then used to thin the edges. Isolated edge points in the edge maps are then filtered out by a de-noising process.

Figure 8: (a) Horizontal Sobel mask;, (b) Vertical Sobel mask

Several heuristic rules are employed to embody the edge density: (i) Each scan line in the region should contain at least 6 edge points, (ii) The edge density on each scan line in the region should be larger than 6 to 20, (iii) The height of the region should be larger that 6 pixels. Some false candidates are eliminated by using the horizontal edge alignment confirmation since horizontal aligned text strings have many horizontal edges along the upper and bottom boundaries of the text area. After the above process, there still remain two kinds of problems: false candidates and inaccurate left and right text area boundaries, both rising from non-text textures. For this, Chen et al. apply a shape suppression using a Bayesian Classifier. Each feature vector consists of four elements:

v = ( N / H , D / H , VN / H , H N / H ) T

(2)

H is the pixel height, N is the pixel number of the sample, D is the height of the gravity center of the edge, VN and HN are the pixel number of the sample’s vertical projection and horizontal projection. Clark et al. [Clark-00] presents five statistical measures (Figure 9) geared towards identifying specific properties of visible text that can differentiate it from most other parts

17

of an image. The measures are applied to each input image and a neural network uses them to determine likely text-regions in the image. Measure M1: The variance of the grey level histogram H over a circular neighbourhood of radius 3 (total area N=29 pixels) at each pixel is used as a measure of how much local information there is: N

M 1 = ∑ ( H (i) − H ) 2

(3)

i =1

where H is the mean intensity of histogram H. Measure M2: Text regions have a high density of edges. This density is measured in a circular neighbourhood of radius 6 centred at each pixel by summing all edge magnitudes located with a Sobel filter: M

M 2 = ∑ E (i )

(4)

i =1

where E(i) is the edge magnitude at pixel i , and M=113 is the number of pixels in the window. Measure M3: Chen and Chen’s continuous spatial density assumption (given a flatbed scanner view of a document) states that the ratio of text to non-text intensity grey levels should not vary greatly as we pass over a text region. The distance between histogram H and its eight-connected neighbouring histograms Gi is computed as: 8

B

M 3 = ∑ ∑ (H(j) − Gi(j))2

(5)

i =1 j =1

where B is the number of histogram bins. By evaluating the difference between one region and its neighbours, the stability of the spatial density is found. Measure M4: In text regions there is a tendency for the magnitude of edges in one direction to be matched by edges in the opposite direction of equal magnitude. More specifically, each edge of a character is likely to be accompanied by an edge in the opposite direction, found on the opposite side of the text character or stroke. We draw the hypothesis that over a text region the histogram of the edge angles has rotational symmetry. Hence, M4 is a measure of the strength of asymmetry using a localised edge angle histogram, A:

M4 =

1 π ( Α(θ ) − Α(θ + π )) 2 ∑ E θ =0

(6)

where A(θ) is the total magnitude of edges in direction θ, and E is the overall edge magnitude which normalises the result. θ is incremented in steps of π/8 which was found to be an adequate resolution. This is performed across the image in a circular neighbourhood of radius 16 centred at each pixel. Measure M5: We also expect edges in a text region to be well distributed. The first four measures respond in the same way to straight image features as to coarse or curved 18

features. This measure is employed to reject those areas of the image with tight distributions of edges corresponding to straight ramps, canals or ridges in the image. It examines how evenly spread the edge magnitudes are over all the directions: 2π

M 5 = ∑ ( Α(θ ) − A ) 2

(7)

θ =0

where A is the average magnitude of all the directions.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 9: Example image and visualization of the results of applying each of the five statistical measures. (a) Original Image; (b) Measure M1 output; (c) Measure M2 output; (d) Measure M3 output; (e) Measure M4 output and (f) Measure M5 output

Chen et al. [Chen-03] propose a learning-based approach which is based on two stages, text region extraction and text verification. Text Region Extraction consists of the following steps: (i) Multiple intensity frame integration is performed by computing average image of consecutive frames; (ii) Detect edges in vertical and horizontal orientation respectively with Canny operators; (iii) Integrate temporal edge information by keeping only edge points that appear in consecutive two average images; (iv) Dilate vertical and horizontal edges respectively into clusters. Vertical operator 5×1, horizontal operator 3×6; (v) Integrate vertical and horizontal edge clusters by keeping the pixels that appear in both vertical and horizontal dilated edge images.

19

Further refinement is applied on each segmented region by an iterative procedure consisting of the following heuristic process: (i) Fill-factor check. If the fill-factor (the density of the region in its smallest rectangle boundary) of the given region is less than 70%, this region is going to be refined in step 2. (ii) Segmentation refinement by first finding a line x using Otsu’s thresholding method on Y-axis projection of the region to be refined. If the length of line x is less than 65% of the longest line in this region, we segment this region at line x and go back to step 1. (iii)Baseline refinement for locating accurately the top and bottom baselines of a text string. We move the top and bottom baselines to the center until the fill-factor is equal or greater than 70%. For the verification of the text regions a more complex method is applied usually including Multi-layer perceptrons or a Support Vector Machine. The used features are: (i) Greyscale spatial derivative features; (ii) Distance map features; (iii) Constant gradient variance features and (iv) DCT coefficients. Examples of text detection with Chen’s method are shown in Figure 10.

Figure 10: Text detection examples with Chen’s method.

20

2.2.1.3 Text tracking

Tracking of text in video is a research area which has not been studied extensively, although it is very useful for text enhancement, detection verification and speed up. In the tracking procedure a text occurrence is tracked through different frames so finally a set of text images is collected for one single text occurrence. Having this information we can reject the text if it is contained in only one frame (assuming that the tracking process can track successfully the existing kinds of text) or enhance its quality using the redundant temporal information from the different frames. Moreover, in cases where text is occluded in different frames, text tracking can help recover the original image. Finally the whole procedure will speedup since the stage of text detection can be performed periodically, followed by the tracking of the found text boxes which is considered to be faster. Lienhart et al. (1995) [Lienhart-95] considered the text as either stationary or linearly moving; and even stationary text may move by some pixels around its original position from frame to frame. Consequently, Lienhart performed motion analysis in order to detect corresponding character candidate regions in consecutive frames. Motion is estimated by means of block-matching, since block-matching is suitable for rigid objects and characters are assumed to be rigid, changing neither their shape, orientation nor colour. Moreover, block-matching is very popular and used for motion compensation in the international standards for video compression such as H.261 and MPEG3. His matching criterion is the minimum mean absolute difference criterion. The mean absolute difference (MAD) is defined as: 1 (8) MAD(d1 , d 2 ) = * ∑ g ( x, y ) − g ( x + d1 , y + d 2 ) R ( x , y )∈R where g is the input image and R specifies the block for which the translation vector has to be calculated. The displacement estimate (d1,d2) for block R is given as the displacement where the MAD value is minimal. The search area is restricted to |d1|,|d2|≤ search_range and derived from the speed of fast-scrolling credit titles. The block R is selected by means of the following algorithm: The input image is binarized (background =black, rest = white) and each white pixel is dilated by a circle of a specified radius. Each connected cluster is framed in a rectangle and defined as block R. If the fill factor is above a certain threshold the block is used for motion analysis. If the fill factor is below a certain threshold, the block is divided recursively into smaller blocks until the fill factor for a resulting block exceeds the threshold. For each resulting block which meets the required fill factor, block-matching motion analysis is performed. Blocks without an equivalent in the subsequent frame are discarded. Also, blocks which have an equivalent in the subsequent frame but show a significant difference in their average grey tone intensity are discarded. The resulting image is passed on to the next segmentation step. An example of this algorithm is given in Figure 11. Antani et al. (1999) [Antani-99] and Gargi et al. (1999) [Gargi-99] utilize motion vectors in an MPEG-1 bit stream in the compressed domain for text tracking, which is based on the methods of Nakajama et al. (1998) [Nakajima-98] and Pilu (1998)[Pilu-98]. This method is implemented on the P and I frames in MPEG-1 video streams. Some pre-

21

processing, such as checking any spatial-inconsistency and checking the number of significant edges in the motion vector, is first performed. The original bounding box is then moved by the sum of the motion vectors of all the macro-blocks that correspond to the current bounding box. The matching results are refined using a correlation operation over a small neighbourhood of the predicted text box region.

(a)

(b)

(c) Figure 11: Example of the different steps of Lienhart’s tracking algorithm. (a) Image after split-merge algorithm, (b) Image after binarization and dilation, (c) Image result after applying motion analysis to two consecutive frames

Li et al. (1999) [Li-99b] also use temporal information in order to enhance the image quality using multi-frame integration. Every pixel is recorded as a random process. If the observation is background pixel, it is recorded as a random process B(t) and if it is a text pixel as T(t). In digital video noise is often introduced by digitization and anti-aliasing when subsampling. Therefore, the observed process actually consists of two parts: One is the desired pixel values (T’(t) or B’(t)) and another is the noise N(t): T(t) = T’(t) + N(t) B(t) = B’(t) + N(t) The noise N(t) can be viewed as a Gaussian random process satisfying N(t) ~ N(0,σ2). Then the mean of T(t) and B(t) can be calculated as: mT(t) = E(T(t)) = E(T’(t) + N(t)) = E(T’(t)) mB(t) = E(B(t)) = E(B’(t) + N(t)) = E(B’(t))

22

Generally, all text pixels in the same text string have the same intensity value. After noise is smoothed out, mT(t) = E(T’(t))= μΤ, where μΤ is a constant. The distribution of background pixels is much more complex since there is no prior knowledge on it. However, when text moves over the background (or text is static while the background moves), background can be regarded as a Gaussian random process as B’(t)~B(μΤ,σ12). Therefore, mB(t) = E(B’(t)) =μΒ. μΤ and μΒ have to be different so the text can be readable. Therefore if the different occurrences of a text are averaged we will have a picture with almost 2 colors, μΤ and μΒ (Figure 12).

Figure 12: (a) Frame 325; (b) Frame 337; (c) Frame 349; (d) Frame 360; (e) averaging over multiple frames; (f) Text separation using global binarization

Li uses the affine model in order to register the different text blocks. The affine model described in the next equation can be used for image registration when displacement, rotation and zooming transformations are present:

G G G P = DQ + d

(9)

G G where P = ( X , Y ) , is the point in the reference text block, Q = (U,V) is the ⎡a b ⎤ corresponding point in the current text block, D = ⎢ ⎥ is the deformation matrix and ⎣c d ⎦ G ⎡ΔX⎤ G d = ⎢ ⎥ is the displacement vector. In order to estimate the 6 parameters in D and d , we ⎣ΔY⎦ need to find at least 3 corresponding points between the two text blocks. However, when the text movement is complex selection and tracking of feature points is very difficult. Therefore the method is simplified by setting the deformation matrix D to be the Identity matrix. Then the affine model becomes a pure translational model, in which only two parameters (ΔX, ΔY) are to be estimated. The pure translational model can register text blocks well when text has the simple motion (rigid, linearly). Under the assumption of pure translational model, Li uses SSD (Sum of Square Difference) based image matching to register text blocks. Suppose the intensity matrix of the reference text block is I(x,y). Its corresponding position can be found in the current frame by searching

23

over a range W and the position having the minimum SSD will be returned as the matched position. If the motion of the text is more complex, text tracking will not work. Therefore a confidence measure is developed in order to decide whether the tracking process was successful. Intuitively, Mean Square Error (MSE) between two text blocks can be regarded as a dissimilarity measure.The mean of all MSE between tracked text blocks and the reference block (usually the first block) is used as confidence measure: mr=E(MSE) (10) To use this measure two thresholds are considered th1 and th2. If mr is very small, mr th2 then we can say text is not correctly tracked and no averaging is conducted. For the mr is between th1and th2, other confidence measures need to be developed, as motion trail. The motion trail of text in video is defined as the temporal ordering of the center points of the tracked text blocks: (X1,Y1), (X2,Y2),…, (XN,YN) where (Xk,Yk) is the centre point coordinates for the kth text block. To quantitatively measure the confidence of the motion trail, we use a straight line y = a * x + b to approximate the motion trail. The Least Mean Square error is used as a measure for approximation: 1 N err = ∑ (Yi − a * X i − b) 2 (11) N i =1 The whole decision rule can be described as: If (mr < th1 or (mr < th2 and err < th3)) Do averaging operation else Give up averaging operation. In [Li-99a] Li uses a stabilization process in order to refine the position of the text boxes tracked through complex motion. The stabilization process can be implemented efficiently in the following way: (i) To matched text position s = (x1, y1, x2, y2), generate a slightly larger text block s0 = (x1-δ, y1-δ, x2+δ, y2+δ). The real text position will be included within s0. (ii) Generate the edge map of s0 by calculating the Canny edges. Τhe edge map is used instead of thresholding the image to avoid the difficulties of identifying normal text and inverse text. (iii)Apply a horizontal smearing process so the edge map can be grouped to form a text block. (iv) Extract connected components and their positions s’ = (x1’, y1’, x2’, y2’) to represent the refined text position. The scheme described above works well when text is moving on a relatively clean background. When the background becomes more complex the stabilization process has to stop. When a large line has to be tracked the computational cost of the SSD based module is considerable. To reduce the computational complexity Li et al. perform matching from coarse to fine in a hierarchical fashion on a Gaussian image pyramid (Figure 13). For a frame It of size w x h, a Gaussian pyramid Gtl is formed by combining

24

several reduced-resolution Gaussian images of frame It, where t is the frame number and l = {0, 1, 2, …,N} represents the level in the pyramid. The size of the frame at level i is w h × . The matching is conducted starting at the coarsest resolution (level N). Each 2i 2i level contributes to determine the position of the matching on the next level. The search for the minimum SSD starts on level N over a window size S = (2s + 1) x (2s + 1). Suppose the matching point found is PN(x, y). Then at level N -1, the search for the minimum SSD will be conducted around pixel PN-1(2x, 2y) over the same window size S = (2s + 1) x (2s + 1). This process continues until the finest resolution (level 0) is reached. Although at each level, the maximum displacement supported by the search is s, which is much smaller than what is required in a one-step search, the displacement is doubled after each level. The total displacement reached at the finest level is s*2N. The level of the pyramid depends on the size of the text block. If the text line is small enough, no pyramid will be formed and matching will be conducted directly at the original image scale.

Figure 13: Multi-resolution based image matching. (a) Text block pyramid in block 650, (b) image pyramid formed with frame 651

Li et al. also propose the use of the temporal tendency information in order to reduce the computational complexity. If we find, for example, that the text lines are scrolling up (or down), we can deduce that new text lines will appear at the bottom (or top) of the video frame.When text crosses the frame horizontally, there is no way to track the whole text line, since the virtual text line is much bigger than the frame. In this case, the line is separated into words and the words are tracked. 2.2.2

Text Segmentation

Although most text with simple background and high contrast can be correctly localized and extracted, poor quality text can be difficult to extract. Text segmentation includes quality enhancement and image binarization. Text enhancement techniques can be divided into two categories: single frame-based or multiple frame-based. Many thresholding techniques have been developed for still images. However, these methods do not work well for video sequences. Based on the fact that text usually spans several 25

frames, various approaches that use a tracking operation in consecutive frames have been proposed to enhance the text quality. Such enhancement methods can be effective when the background movement is different from the text movement. Sato et al. [Sato-98] used a linear interpolation technique to magnify small text at a higher resolution for commercial OCR software. Text regions are detected and localized using Smith and Kanade’s method [Smith-95], and sub-pixel linear interpolation is applied to obtain higher resolution images. Based on the fact that complex backgrounds usually include some movement, and the video captions are relatively stable across frames, the image quality is improved by multi-frame integration using resolutionenhanced frames. Assuming that captions have high intensity values such as white pixels, Sato employs a technique to minimize the variation of the background by using a time-based minimum pixel value search (Figure 14). (For black characters, the same technique could be employed using a time-based maximum search.) With this technique, an enhanced image is made from the minimum pixel value in each location that occurs during the frames containing the caption. This method is unable to clean the background when both the text and the background are moving at the same time.

Figure 14: Improving image quality by multi-frame integration

After the image enhancement stages, four specialized character-extraction filters are applied based on correlation and a recognition-based segmentation method is used for character segmentation. This means that the intermediate character recognition results are used to enhance the character segmentation results (Figure 15). This method takes about 120 seconds to process a 352×242 frame on MIPS R4400 200 MHz processor, and almost doubles the recognition rate of a conventional OCR technique that performs binarization of an image based on a simple thresholding, character extraction using a projection analysis, and matching by correlation. Li et al. [Li-99a], [Li-00b] presented several approaches for text enhancement. For example, they use the Shannon interpolation technique to enhance the image resolution of video images (Figure 16, Figure 17). The image resolution is increased using an extension of the Nyquist sampling theorem and it is determined whether the text is normal or inverse by comparing it with a global threshold and background colour.

26

Niblack’s [Niblack-86] adaptive thresholding method is used to filter out non-text regions. They investigated the relationship between the OCR accuracy and the resolution and found that the best OCR results were obtained when using a factor of 4 in image interpolation.

Figure 15: Examples of Sato et al. approach: (a) original image, (b) binary image, (c) character extraction result using the conventional technique, (d) result of sub-pixel interpolation and multi-frame integration, (e) integration of four character-extraction filters, (f) binary image, (g) result of character segmentation.

Figure 16: Comparison of OCR result: with (a) no enhancement, (b) zeroth order interpolation, (c) Shannon interpolation

Li and Doermann [Li-99c] used a multi-fame text enhancement technique. To ensure that the text blocks are correctly tracked, the mean square errors of two consecutive text blocks and motion trail information are used. An anti-aliasing process is performed by low-pass filtering the image followed by subsampling. The process lowers the resolution but increases the number of colors to give impression of contrast and detail. Then, the increasing of the resolution is done by extension of the Nyquist sampling theorem where a sampled image is a weighted sum of delayed Sinc functions. In order to avoid this time-consuming procedure Li uses a frequency-based approach. Image’s resolution is increased by coping each pixel to neighboring pixels by the amount of the desired increase in resolution. Then the two-dimensional DFT is computed and multiplied by a mask matrix which zeros the high frequency components.

27

An inverse two dimensional DFT renders the image in the higher resolutional mode. The next figure shows an example of the method with a factor of 4.

Figure 17: (a) Anti-aliased image showing blurred edges and large number of foreground colours, (b) zoomed image by Shannon‘s interpolation.

After resolution enhancement image binarization has to be done. Before that text has to be classified as normal text or inverse text. A global threshold Th is calculated and is being compared with the maximum part of the image histogram Bg which is considered to be the colour of the background. Specifically, If Th>Bg then it is normal text If Th