Improved Text Overlay Detection in Videos Using a ... - Semantic Scholar

3 downloads 0 Views 211KB Size Report
Belle L. Tseng*, Ching-Yung Lin*, Dongqing Zhang** and John R. Smith*. *: IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532.


➡ Improved Text Overlay Detection in Videos Using a Fusion-Based Classifier Belle L. Tseng*, Ching-Yung Lin*, Dongqing Zhang** and John R. Smith* *: IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532 **: Dept. of Electrical Engineering, Columbia Univ., 500 W120th St., NY, NY 10027 ABSTRACT In this paper, classifier fusion is adopted to demonstrate improved performance for our text overlay detections in the NIST TREC-2002 Video Retrieval Benchmark. A normalized ensemble fusion is explored to combine two text overlay detection models. The fusion incorporates normalization of confidence scores, aggregation via combiner function, and an optimize selection. The proposed fusion classifier resulted best out of 11 detectors submitted to the NIST text overlay detection benchmarking and its average precision performance is 227% of the second best detector in the benchmark.

1. INTRODUCTION The growing amount of digital video is driving the need for more effective methods for indexing, searching, and retrieving of video based on its content. Recent advances in content analysis, feature extraction, and classification are improving capabilities for effectively searching and filtering digital video content. Videotext detection has been identified as one of the key contributing components. Text can appear in the video as either scene text or text overlay. Scene texts appear as part of the captured scene, like store signs, street tags, and shirt logos. Text overlays are superimposed on top of the scenes, and include video titles, news anchor names, and sport scores. There has been much prior work for text overlay detection and extraction. Sato et al. [7] use spatial differential filter to localize the text region, and size/position constraints to refine the detected area. Their algorithm is used in a specific domain, such as CNN news. Lienhart et al. [6] gave an approach using texture, color segmentation, contrast segmentation, and motion analysis. In their system, color segmentation by region merging is done in global video frame without a localization process. Thus the accuracy of the extraction may heavily rely on the segmentation accuracy. Li et al. [5] use Haar wavelet to decompose the video frame into subband images. Neural network is used to classify the image blocks into text or non-text based on the subband images. Zhong et al. [14] employ DCT coefficients to localize text regions in MPEG video I-frame. They did not use color information and layout verification, therefore false alarms are inevitable in those texture-like objects, such as buildings and crowds. Bertini [1] presents a text location method using salient corner detection. Salient corner detection may produce false positives in the cluttered background without good verification process. In summary, we regard the following modules to be essential to an accurate videotext extractor: texture model to localize the text regions; color composition model to avoid the background disturbance; layout or shape analysis to verify and

0-7803-7965-9/03/$17.00 ©2003 IEEE

refine the localized candidate regions; and motion models to verify temporal consistency. We proposed a text overlay classifier that includes two algorithms proposed by (1) Zhang and Chang [13] and (2) Shim, Dorai and Bolle [3][8] and then combine their detection results using a normalized ensemble fusion method to generate the final classifier output. The first algorithm uses texture and motion energy features to generate candidate text overlay regions. For each candidate region, color layering is used to generate several hypothetical binary images. This method addresses the limitations of the previous systems by avoiding background disturbance and thus reducing the false alarm. The second algorithm extracts and analyzes candidate regions that may contain text characters. By separating each character region from its surroundings, the technique employs successive refinement steps to retain only those regions that contain characters. The final output of the classification result is executed by fusion. The paper is organized into the following sections. Section 2 describes the detection system, which includes two text overlay detectors and the classifier fusion framework. Section 3 gives the detection experiments and fusion results. Section 4 concludes with a summary.

2. SYSTEM DESCRIPTION The proposed system combines three main components: the two overlay text detectors and the fusion module. The block diagram architecture of the proposed system is shown in Figure 1. The details of the fusion method are then illustrated in Figure 2.

Figure 1: Text Overlay Detection System Block Diagram Combining Two Detector Methods and a Fusion Module.

III - 473

ICME 2003



➡ 2.1 Texture and Motion Analysis Method Zhang and Chang proposed a texture-and-motion analysis detector [13]. It first conducts macroblock-based texture and motion energy extraction from the compressed domain features including DCT coefficients and motion vectors. The texture energy in the horizontal and vertical direction is extracted from the 8x8 DCT coefficient block in I-frames. Motion Energy Map, which represents the amplitude of the motion vectors of each macro block in the B or P frame, is used to characterize the motion intensity of regions. These features are used to generate candidate text overlay regions so that they can be filtered in the next stage. For each candidate region, the HSV color space decomposition is performed to do color layering in the HSV color image and generate several hypothetical binary images; afterwards grouping algorithm is performed in each binary image to group the character blocks, which are extracted using connected component analysis. Finally a layout analysis algorithm is applied to verify the layout of these character blocks. A text region is identified if the character blocks can be aligned to form sentences or words.

2.2 Region Boundary Analysis Method Shim, Dorai, and Bolle [3][8] present an automatic text extraction technique for obtaining text from video. The system extracts candidate regions containing text, analyzes the text regions into blocks, lines, words, and characters, and finally recognizes the characters using OCR. The first step is to locate the regions in an image with possible text characters. They developed a fast and efficient algorithm that uses chain-code called the generalized region labeling algorithm. It extracts boundary and other region features resulting in non-overlapping homogeneous regions from each input images. The second step verifies the existence of characters within the candidate regions. They attempt to remove background within the regions while preserving the text. A local thresholding operation is used in each candidate region to separate the text from its surrounding, resulting in a positive and negative image. Subsequently, region boundary analysis is performed to sharpen and separate character region boundaries. In the third step, the candidate character regions are subject to a verification step where they are tested for exhibiting typical text characteristics. This will further reduce many noisy non-text regions. Similarly as a next step, consistency between neighboring text regions is verified to eliminate false positive regions. This includes ensuring that adjacent regions in one line are cues to a sequential text string. Finally in the refinement step, interframe analysis on text regions from five consecutive frames are analyzed together to add missing characters in frames and delete incorrect regions posing as text. These steps result in a robust algorithm general enough to detect and extract the multiple text variations in font size, style, contrast, and complex backgrounds.

2.3 Classifier Fusion Since no single classifier is powerful enough to encompass all aspects of visual concept detection, classifier fusion is investigated for the detection process by retaining soft decisions from independent models and fusing their classifier outputs. As figure 2 illustrates, we can use multiple classifiers and the associated confidence scores. While the classifiers can

be combined in many ways, we explored normalized ensemble fusion to improve overall classification results. The normalized ensemble fusion process consists of three major steps shown in Figure 2. The first step is normalization of resulting confidence scores from each classifier. The second step is the aggregation of the normalized confidence scores. The third step is the optimization over multiple score normalization and fusion functions to attain optimal performance.

Figure 2: Three-stage normalized classifier fusion, which includes score normalization, combination, and selection. The individual classifiers are trained and optimized against a training set. In order to determine the best normalized ensemble fusion parameters, an independent validation set is used. Each classifier generates an associated confidence score for the data in the validation set. These confidence scores are normalized to a range of [0, 1]. The normalization schemes include: (1) rank, (2) range, and (3) Gaussian normalization. After score normalization, the combiner function selects a permuted subset of different classifiers and operates on their normalized scores. We essentially identify the highperforming and complementary subsets of classifiers. Following, different functions to combine the normalized scores from each classifier are considered. The combiner functions we tried include: (1) minimum, (2) maximum, (3) average and (4) product. Subsequently, an optimal selection of the best performing normalized ensemble fusion is obtained by evaluating the performance measure against the Feature Validate ground truth. During the optimization stage we favor fusion methods that generalize well in order to avoid over-fitting the validation set.

3. EXPERIMENTAL RESULTS 3.1 Overview of NIST TREC 2002 Concept Detection Task The goal of TREC-2002 Video Track was to promote progress in content-based retrieval from digital video via open, metrics-based evaluation [9]. This year the track used 73.3 hours of publicly available digital video (MPEG-1) from the Internet Archive and the Open Video Project. The material comprised advertising, educational, industrial, and amateur films produced between the 1930’s and the 1970’s. Sixteen teams participated in one or more of three tasks: shot boundary determination, concept detection, and search. Results were scored by NIST using manually created truth data. The concept detection benchmark was as follows. 23.26 hours (96 videos containing 7891 standard shots) were randomly chosen from the corpus, to be used solely for the development of concept detectors. 5.02 hours (23 videos containing 1848 standard shots) were randomly chosen from the remaining material for use as a Feature Test Set. (Concept detection task is officially called “Feature Extraction” by NIST). Given a standard set of shot boundaries for the Feature Test set

III - 474



➡ our individual and fused detector performances in the FV set, where an uncompromising detection improvement is evident in the precision recall curves. Using the resulting fusion selection, the NIST Feature Test data is operated with the same fusion and their precision recall curves are plotted in Figure 4. Precision vs. Recall − Text Overlay 1 DQ CD B*

0.9

0.8

B* 0.7

CD 0.6

Precision

and a list of concept definitions, participants were to return for each concept a list, at most the top 1000 video shots, ranked according to the highest possibility of detecting the presence of the concept. The ground-truth of the presence of each concept was assumed to be binary, i.e., it is either present or absent in a video shot. Ten concepts were defined in this benchmark: Outdoors, Indoors, Face, People, Cityscape, Landscape, Text Overlay, Speech, Instrumental Sound, and Monologue. They are defined by textual descriptions, e.g. the definition of text overlay is: “Video segments that contain superimposed text large enough to be read.” Seven groups participated in the Text Overlay detection benchmark. Each group can submit one or more runs of detection result. Totally, eleven runs were submitted to NIST for the text overlay benchmark.

0.5

DQ 0.4

3.2 Comparisons on the result of the TextOverlay Concept Detectors

0.3

Training and validation of models was done using the NIST feature training data set. We randomly partitioned the NIST feature training collection into a 19-hour Feature Train (FT) set and a 5-hour Feature Validate (FV) set. We used the FT to construct the classifiers and the FV to select model and fusion parameters and evaluate the detection. The validation process is beneficial in helping to avoid overfitting to the feature train data.

0.1

0.2

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Figure 4: Precision-Recall Curves on the NIST Feature Test Set. The average precision value of detector 1 (DQ) is 0.2941. The average precision of detector 2 (CD) is 0.3324. The average precision value of the combined fusion detector is 0.4181.

3.2.1 Fusion Improvements on Detection Results We experimented with the normalized ensemble fusion of the texture & motion analysis method (referred to detector 1) and the region boundary analysis method (referred to detector 2). The main step is to determine the optimal fusion parameters, then apply these selections to the test data. Precision vs. Recall − Text Overlay 1 DQ CD B*

0.9

B* 0.8

0.7

Precision

0.6

DQ 0.5

0.4

CD 0.3

Figure 5: An example of the detection result. The first 16 returns of the classifier fusion results on the NIST Feature Test set.

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

Figure 3: Precision-Recall Curves on Feature Validate Set. The average precision value of detector 1 (DQ) is 0.4164. The average precision of detector 2 (CD) is 0.3271. The average precision value of the combined fusion detector is 0.5018. Detectors 1 and 2 output an associated set of confidence scores for the validation set. These confidence scores are re-normalized to a range of [0, 1] according to (1) rank, (2) range, and (3) Gaussian normalizations. Next, all possible normalized score pairs are combined by the (1) minimum, (2) maximum, (3) average and (4) product functions. Subsequently, the optimal performer according to the NIST average precision measurement [9] is selected. The resulting optimal fusion combination is to take the average scores of the rank normalized detector 1 and the rank normalized detector 2. Figure 3 shows

3.2.2 Comparison of all detectors submitted to NIST TREC Text Overlay detection benchmark Various text overlay detectors are explored by seven participants of the NIST text overlay detection benchmark. Carnegie Mellon University Informedia Project [2] uses a heuristic text detector which utilizes a set of low level image features such as HSV color histogram, textures, EDH edge features, aggregated line features and MPEG motion vectors. Eurecom team [10] uses the supervised classification process via Gaussian Mixture Models of MPEG DCT macro-blocks coefficients. The text detector of the Fudan University team [12] includes three steps: Text Block Detection, Text Enhancement and Binarization. They apply Neural Network on the Harr Wavelet coefficients in their Text Block Detection. A bookstrap method is used in training. IBM submitted two sets of result that are generated by the proposed detector and a SVM-based

III - 475



➠ detector. The MediaMill group trained models by active learning. They use an i-Notation system for annotation and build the detectors based on Maximum Entropy. Microsoft Research Asia team [4] detects videotext based on keyframes. They merge to form candidate text regions, detect candidate text lines, and then use edge maps for final verification. University of Maryland team [11] developed a classifier that takes the OCR output from each detected text box and classifies it either as a positive text result or a false alarm. This classifier distinguishes the set of possible characters into 4 different groups to verify the type of the detection output. Table 1 shows the official testing results of the eleven submitted detectors evaluated by NIST and measured in average precision [9]. We can see that our proposed detector performs best among all the detectors. The average precision value, 0.418, is 127% better than the second best one, which is 0.184. According to the NIST ground truth, there are totally 110 positive examples in 1848 shots. Therefore, the expectated value of the Average Precision of the most naïve detector, which selects shots in random, is 0.06. We can see that all detectors perform better than that, and our proposed detector has the best performance. Figure 6 shows the precision-recall curves of these detectors. For those groups who submitted more than one set of detection results, we only show their best detector based on the average precision. We can see that our proposed detector has the highest precision values except in recall range of 0.57 to 0.75. Symbol of Group

Average Precision of the individual detectors 0.092 / 0.130 / 0.082 0.184 0.118 0.418 / 0.181 0.079 / 0.082 0.151 0.156

R1 / R2 / R3 OM1 Sys1 M1 (IBM B*) / M2 L1 / L2 RA D1

Table 1: Average Precision of Text Overlay Detectors submitted to NIST Video TREC-2002. Precision vs. Recall − Text Overlay 1 IBM CMU Eurecom Fudan MediaMill MSRA UMD

0.9

0.8

IBM 0.7

Precision

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Figure 6: Precision-Recall curves of seven results submitted to NIST TREC-2002 Video Text-Overlay detection benchmark.

4. CONCLUSION We presented a fusion-based text overlay classifier that combines two individual detectors using a normalized ensemble fusion module. The system explores the distinctive features of the two text detectors and thus generates improved fusion results based on these two detectors. Through the rigid testing procedure in NIST TREC-2002 benchmark, we find that the proposed classifier works well in different video data sets. Also, the proposed fused classifier performs best among 11 classifiers from seven teams in the benchmark. We observed that fusion of dissimilar detectors improve the performance of individual detectors. In the future, we plan to explore more fusion methods and test the effect of fusion across various base-line detectors.

5. REFERENCES [1] M. Bertini, C. Colombo and A. D. Bimbo, "Automatic caption localization in videos using salient points". In Proceedings IEEE International Conference on Multimedia and Expo ICME 2001, Tokyo, Japan, August 2001. [2] CMU Informedia Digital Video Understanding Project, http://www.informedia.cs.cmu.edu/ [3] C. Dorai, "Enhancements to Videotext Detection Algorithms, Confidence Measures, and TREC 2002 Performance," IBM T.J. Watson Research Center, Yorktown Heights, New York, September 2002. [4] X.-S. Hua, P. Yin, H.-J. Zhang, "Efficient Video Text Recognition Using Multiple Frame Integration," 2002 International Conference on Image Processing (ICIP2002), Rochester, New York, Sept 22-25, 2002. [5] H. Li, D. Doermann, O. Kia, “Automatic text detection and tracking in digital video”, IEEE Transaction. on Image processing, Vol. 9, No. 1, January 2000. [6] R. Lienhart, W. Effelsberg, "Automatic text segmentation and text recognition for video indexing", Multimedia System, 2000. [7] T. Sato, T. Kanade, E. Hughes, and M. Smith, "Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions", Multimedia Systems, Vol 7, 1999. [8] J.C. Shim, C. Dorai, and R. Bolle. “Automatic text extraction from video for content-based annotation and retrieval”, IEEE Conference on Pattern Recognition, 1998. [9] A. F. Smeaton and P. Over, “The TREC-2002 Video Track Report,” Proc. of Text Retrieval Conference, Gaithersburg, Maryland, Nov. 2002. [10] F. Souvannavong, B. Merialdo and B. Huet, “Semantic Fetature Extraction using MPEG Macro-Block Classification,” Proc. Text Retrieval Conference, Gaithersburg, MD, Nov. 2002. [11] C. Wolf, D. Doermann, and M. Rautiainen, “Video Indexing and Retrieval at UMD,” Proc. Text Retrieval Conference, Gaithersburg, MD, Nov. 2002. [12] L. Wu, X. Huang, J. Niu, Y. Xia, Z. Feng, Y. Zhou. “FDU at TREC2002: Filtering, Q&A, Web and Video Tasks.” Proc. Text Retrieval Conference, Gaithersburg, MD, Nov. 2002. [13] D. Zhang, B. L. Tseng, C.-Y. Lin, and S.-F. Chang, “Accurate Overlay Text Extraction for Digital Video Analysis”, Int’l Conf. on Information Technology: Research and Education ITRE 2003, Newark, New Jersey, Aug. 2003. [14] Y. Zhong, H, Zhang, and A. K. Jain, "Automatic Caption Localization in Compressed Video", IEEE Trans on PAMI, Vol 22. No.4 April 2000.

III - 476

Suggest Documents