A Fast Anchor Shot Detection Algorithm on Compressed Video

0 downloads 0 Views 266KB Size Report
news video, the technique of detecting anchor shots is studied as an integral .... cb rrsd avg. ) , cr cr sd avg. , where d r cb. Cd cb s. ∈. = min. , d r cb. Cd cb e. ∈.
A Fast Anchor Shot Detection Algorithm on Compressed Video WeiQiang Wang1

Wen Gao1,2

1

(Institute of Computing Technology, Chinese Academy of Sciences, BeiJing, 100080) 2 (Department of Computer Science, Harbin Institute of Technology, 150001) Email: {wqwang, wgao}@ict.ac.cn

Abstract. Detecting anchor shots accurately is very important for automatically parsing news video and extracting news items. The paper presents a fast anchor shot detection algorithm, based on background chrominance and skin tone models. The algorithm involves only simple computation, but robust. Moreover it operates in MPEG compression domain, which makes the detection speed very fast. The algorithm was evaluated on a big test set containing more than 480000 frames and news video from two different TV stations. More than 98.9% accuracy and 100% recall have been obtained. The experiment results also show the system has an average detection speed of 77.55 f/s. The statistics indicates the algorithm is fast and effective.

1. Introduction To index, browse and retrieve increasing audio-visual information available, parsing tools are required to automatically extract structure and content features of the audiovisual data. Then description of content is generated. Owing to the-state-of-arts of computer vision and audio signal analysis techniques, parsing general video automatically has not yet been well realised to date. In almost all research on parsing news video, the technique of detecting anchor shots is studied as an integral topic. Identification of anchor shots is crucial for extracting news items, since the items are often introduced and/or concluded by anchor shots. Merlino [1] detects some text tags in a broadcast’s closed-caption transcript, such as ‘ >>’ (speaker change), "I’m" , et. al., as a textual cue. Combining with a priori model about news structure, anchor shots are located. Zhang et.al. [2] exploit image analysis techniques to identify anchor shots. For each shot generated by the shot segmentation module, the parser exploits histogram and pair-wise pixel metrics to find candidates of anchor shots, and further verifies them based on region models. Qi et.al. [3] integrate audio and visual cues to detect anchorperson shots. After audio segments characterized by different speakers and shots represented by key frames are clustered, anchorperson candidates are selected based on the following heuristics, that the proportion of the anchorperson speech/image is higher and the distribution is more disperse. Then integration of audio and visual channels is imposed to identify the true ones. This paper presents a fast anchor shot detection algorithm, which is based on background chrominance and skin tone models. Compared with the forementioned

algorithms, the whole detection process can run online, and shot segmentation is not required. The speed is very fast since the algorithm operates in compression domain. The rest of the paper is organized as follows. In section 2, our algorithm is presented in detail. Section 3 gives the results of evaluation experiments, over CCTV and JXTV news. Section 4 concludes the paper.

2. Anchor Shot Detection Algorithm In news video produced by different broadcast corporations, an anchorperson frame always has the similar structure, made up of anchorperson and background, as in Fig. 1. Anchorpersons may differ, or are in different dresses on different days, but some region B in the background usually keeps unchanged for a long period and has distinct content. We assume there exists a sub-region T , which has consistent color texture, in the region B . We call it a feature region. Then the region T ’s color model can be constructed through statistics approaches, and be exploited to locate start frame and end frame positions of anchor shots in news video. Background Anchor Feature Region Anchor’s Face Region Fig. 1. Typical anchor shots in CCTV news

2.1 Model Construction Before detection, the system needs to construct models for all types of anchor shots in advance. It can be implemented through a semi-automatic tool. Each model is formalized as a 3_tuple G =< L, D, F > , where L is the feature region, D is dynamic distribution of DC in L , and F is an anchorperson’s face region. D is formalized as a 6_tuple < rv cb , ravg cb , rsd cb , rv cr , ravg cr , rsd cr >, where rv cb is the distribution interval of DC values in L for the component Cb, ravg cb , rsd cb are the distribution intervals of average and standard variance of all the DC values in L . rv cr , ravg cr and rsd cr have the consistent meanings, for the component Cr. The frames in anchor

shots can be labeled manually through a learning tool. Then the regions L and F are chosen visually. Let fi (i = 1,2,...M ) represent the frames in the training set, Dij ( j = 1,2,...N ) represent DC values in the region L , and rv cb = [v scb , v ecb ] , then v scb = min{min Du} − τ vcb , i

j

vecb = max{max Du} + τ vcb i

j

Define ravg cb = [avg scb , avg ecb ] and rsd cb = [ sd scb , sd ecb ] , then

(1)

Ei =

1 N

N



D

(2)

ij

j =1

(3)

cb cb , avg ecb = max{Ei} + τ avg avg scb = min{Ei} − τ avg i

1 N

Vi =

N

∑ (D

ij

j =1

− Ei ) 2

i

, sd scb = min{Vi} − τ sdcb , i

(4)

sd ecb = max{Vi} + τ sdcb i

cb Where τ vcb , τ avg and τ sdcb are different relaxation factors. Similarly, for the component

Cr, the intervals rv cr , ravg cr and rsd cr can be calculated out based on the data in the training set. Experiment can help to select an appropriate model. 2.2 Algorithm Description The algorithm in [4] can extract a DC image for each frame in MPEG compressed video with minimal decoding. A DC in each block is equal to the average of the pixels in that block. Through interactively choosing a feature region T (Fig.1 (b)), we can construct the region T ’s color model through statistics. Since anchor shots generally last more than three seconds, the algorithm exploits two different resolution granularities to increase the detection speed. First, all I frames in display order are checked. Once frames in an anchor shot are found, the start and end frame positions of the anchor shot are refined in the resolution of frames. The detailed description of the detection algorithm is given as follows. ① Initialization. Open a video stream file fp , obtain the sequence number

CurFrmNum of the first I frame,and the GOP length gl of the MPEG stream. ② For chrominance components Cb and Cr, extract DC images Xcb = {xijcb }

and Xcr = {xijcr } from the frame with the sequence number CurFrmNum . ③ Suppose T represents the feature region, bij is the corresponding region for the element whose index is (i, j ) in DC image Xcb and Xcr . Let Ccb ={cijcb | cijcb = xijcb,bij ∈T} , Ccr = {cijcr | cijcr = xijcr, bij ∈T} . Calculate a feature vector Q= (rscb, recb,

where rscb = min d , recb = max d , rscr = min d ,

cr cr avgcb, sdcb, rscr , recr , avg , sd ) ,

recr sd

= max d , avg

cr

d ∈Ccr

=

1 C cr

cb

1 = C cb

∑ ( d − avg

cr

d∈Ccb

∑d

d ∈C cb

)2

, avg

cr

1 = C cr

d ∈Ccr

d ∈Ccb

∑d

d ∈C cr

, sd cb =

1 Ccb

∑ (d − avg

cb 2

)

,

d ∈Ccb

, where C represents cardinal number of a set C .

d ∈ C cr

④ If [rscb, recb] ⊆[vscb, vecb] , avg cb ∈ [ avg scb , avg ecb ] , sdcb ∈[sdscb, sdecb] , [rscr,recr] ⊆[vscr,vecr] , cr cr cr avg cr ∈ [avg scr , avg ecr ] , sd ∈ [ sd s , sd e ] all hold, then the frame CurFrmNum is an

anchorperson frame, where [vscb , vecb ] , [avgscb , avgecb ] , [sdscb , sdecb ] , [v scr , v ecr ] ,

[avgscr , avgecr ] , [sdscr , sdecr ] are the dynamic ranges of different components in the feature vector F , which are obtained through the model construction process aforementioned. Besides the features in Q , relation features can also be found and utilized, to improve the detection accuracy. For example, we observed there is such relation feature avg cb ≥ avg cr in CCTV news. ⑤ To eliminate noises, the system declares an appearance event of anchor shots, if and only if it consistently detects existence of anchorperson frames in Ws consecutive I frames. Similarly, Only when the system cannot detect existence of anchorperson frames in We consecutive I frames, a disappearance event will be declared. Ws and We are system parameters. Define SAnchorFrmNum and EAnchorFrmNum as the start and end frame of the anchor shot in the resolution of GOP. ⑥ Refine the start and end position of the anchor shot in the resolution of frames, by the similar computation described by step ③ and ④ , between the frames SAnchorFrmNum − gl and SAnchorFrm Num , as well as between EAnchorFrm Num and EAnchorFrmNum + gl . ⑦ CurFrmNum = CurFrmNum + gl . If the end of the stream is not reached, then go to ②; else the stream file fp is closed, and the whole detection process ends. Through above computation, some false anchor shots may appear in the resultant clip set. To filter them out, a face skin tone model is exploited to refine the results. Though skin tone differs among different persons, different peoples, it distributes in a specific region on the plane Cb-Cr. The fact has been applied by many systems [5-7]. Our system exploits it in a more simple form to help filter the false claims. As in Fig.1 Cb Cr (b), define F a anchorperson face region, and dc mn , dcmn are the DC values of the blocks in the region F , respectively for Cb and Cr, where m, n are position index of the blocks in the region F . We define a function to examine skin tone. Cb  if dc mn ∈ [ s min Cb , s max Cb ] 1 Cr Skin ( m , n ) =  and dc mn ∈ [ s min Cr , s max Cr ] 0 else 

(5 )

Where [s minCb, s maxCb] , [s minCr, s maxCr] are the distribution ranges of face skin tone in the plane Cb-Cr, constructed by us through sampling 80 different faces. If

∑ Skin ( m , n )

BlockNum > µ

(6 )

block ( m , n )∈regionF

then the region F contains a face and the clip is declared as an anchor shot clip, where BlockNum is the number of blocks in F and µ is a predefined threshold.

3. Evaluation Experiment We random choose 11 days’ CCTV news from video database as test data to evaluate the algorithm. Three of them are used to construct the background model and the face

skin tone model. The remaining data forms a test data set. Before testing, we label manually all the anchor shots as a standard reference. Tab. 1 tabulates the experiment results, when only background chrominance model is used in identifying anchor shots. The detection accuracy is derived, i.e., P = 1 − E = 1 − 8 = 91.7﹪ and the recall D

U 0 R = 1− = 1− = S 88

96

100 ﹪. Eight false claims occur in the experiment result. Our

observation implies their blue tone in the feature region is very similar with that of the anchor shots, which confuses the system. Table 1. the Experiment Results only Using Background Model ( Ws = We =3)

Video News0 News1 News2 News3 News4 News5 News6 News7 Total

Frames 44965 44770 45106 44913 36339 59610 44447 49775 369925

Anchor shots (S) 9 11 15 13 13 12 10 5 88

Output (D) 9 15 17 15 13 12 10 5 96

Error (E) 0 4 2 2 0 0 0 0 8

Missed (U) 0 0 0 0 0 0 0 0 0

For the obtained results, the face skin tone model is further applied to refine them. Keeping the recall 100﹪, the system filters out seven false claims and the accuracy increases to 98.9﹪, with µ equal to 0.17. The statistics indicates the algorithm is very effective. Additional experiment is done to measure detection time to reflect the fastness of the algorithm. The results are tabulated in Tab. 2. Though the decoding engine embedded in the system has only the performance of 8 f/s for the MPEG-2 streams in the test set, Tab. 2 indicates the algorithm has a average detection speed of 77.55 f/s, faster than real time. It results from several factors. Firstly, the algorithm completely operates in compression domain, involving very simple computation. Secondly, two different resolutions is applied and a large number of P, B frames are skipped in the state of coarse resolution. Table 2. detection time for the programs in the test set

News ID Time(m:s)

0 9: 40

1 8: 44

2 9: 20

3 9: 39

4 8: 28

5 12: 36

6 9: 22

7 11: 41

Table 3. the experiment results for JiangXi Satellite TV news

Video JXTV0210 JXTV0212 JXTV0213 JXTV0214 Total

Frames 29600 29484 25990 29500 114574

Anchor shots(S) 9 9 8 8 34

Output (D) 9 9 8 8 34

Error (E) 0 0 0 0 0

Missed (U) 0 0 0 0 0

To evaluate the validity and robustness of the algorithm further, 4 days of JiangXi

Satellite TV news are chosen as another group of test data, to make sure the algorithm can also be applied to other TV stations’ news. The corresponding experiment results are tabulated in Tab. 3. The accuracy and the recall are calculated out as follows. R =1−

0 100 ﹪ . The very high accuracy and E U 0 = = 1− = 100 ﹪ , P = 1 − = 1 − 34 D S 34

recall demonstrate the algorithm can be applied to news programs of different TV stations. High accuracy and recall are the prominent advantage of the algorithm.

4. Conclusion The paper presents a fast anchor shot detection algorithm, based on the background feature region’s color model and face skin tone models. We evaluate it on a large data set of two TV stations ’s news. The experiment results indicate it has an ideal performance, and achieve the 100﹪ recall and more than 98.9﹪ accuracy, as well as average detection speed of 77.55 f/s. Therefore it is a fast and effective algorithm and can be applied to not only CCTV news, but also other TV news, if the anchorperson frames in news video satisfy the assumptions mentioned in section 2. What need to be done is to choose the background feature region and construct its color model again, as well as specify the anchorperson face region. These can be finished semiautomatically through an interactive tool. Acknowledgment This work has been supported by National Science Foundation of China (contract number 69789301), National Hi-Tech Development Programme of China (contract number 863-306-ZD03-01-2), and 100 Outstanding Scientist foundation of Chinese Academy of Sciences.

References [1] A. Merlino, D. Morey and M.Maybury, “Broadcast News Navigation Using Story Segmentation,” Proc. ACM Multimedia 97, Seattle ,USA, Nov. 1997, pp 381-391. [2] H. J. Zhang, S.Y. Tan, S. W. Smoliar and Y. Gong, “Automatic Parsing and Indexing of News Video”, Multimedia Systems, 2: 256-266, 1995 [3] W. Qi, L. Gu, H. Jiang, X.R. Chen and H.J. Zhang, “ Integrating Visual, Audio and Text Analysis for News Video”, IEEE ICIP-2000, Vancouver, Canada, Sept., 2000. [4] B.L. Yeo and B. Liu, "Rapid Scene Analysis on Compressed Videos", in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, December 1995 (Transactions Best Paper Award). pp. 533-544. [5] H. Wang and S.-F. Chang, ”A Highly Efficient System for Automatic Face Region Detection in MPEG Video,” IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Multimedia Technology, Systems, and Applications, Vol. 7, No. 4, August 1997 [6] K. Sobottka, I. Pitas, “A Novel Method for Automatic Face Segmentation, Facial Feature Extraction and Tracking”, Signal Processing: Image Communication, vol.12, No.3, pp263281, June, 1998 [7] H.M. Zhang, D.B. Zhao, W. Gao and X.L. Chen, “Combining Skin Color Modal and Neural Network for Rotation Invariant Face Detection”, Int. Conf . Multimodal Interface 2000, Beijing, 2000.

Suggest Documents