Semantic Principal Video Shot Classification via ...

4 downloads 133127 Views 2MB Size Report
As digital cameras become more affordable, digital ... concepts in a specific surgery education video domain. 1. .... video domains such as news and films.


➡ Semantic Principal Video Shot Classification via Mixture Gaussian Hangzai Luo, Jianping Fan, Jing Xiao Dept. of Computer Science University of North Carolina Charlotte, NC 28223, USA { hluo, jfan, xiao }@uncc.edu ABSTRACT As digital cameras become more affordable, digital video now plays an important role in medical education and healthcare. In this paper, we propose a novel framework to facilitate semantic classification of surgery education videos. Specifically, the framework includes: (a) Semantic-sensitive video content characterization via principal video shots. (b) Semantic video classification via a mixture Gaussian model to bridge the semantic gap bwteen low-level visual features and semantic visual concepts in a specific surgery education video domain.

1. Introduction In order to support content-based video access, several content-based video retrieval systems have been proposed [1-4], but they still suffer from the following challenging problems: (a) Semantic-Sensitive Video Content Characterization: Three approaches are widely used for characterizing video contents in a database: manual text annotation, shot-based and object-based (or region-based). Manual text annotation is too tedius and expensive for large-scale video collections. The main weakness of the shot-based approach is that shot-based global visual features are too general to discriminate among various semantic concepts. The object-based approach is more effective in supporting semantic video classification [2-4], but automatic semantic video object extraction in a general case is very difficult if not impossible [5-6]. Clearly more study is needed. A major difficulty in video content characterization is the lack of means to relate syntactic visual features to the semantic concepts that capture the semantic meanings of the video clips. (b) Semantic Video Classification: The essential goal of semantic video classification is to enable video retrieval based on semantics conveniently and to support naive users to specify and evaluate their query concepts more effectively and efficiently. Bridging the semantic

0-7803-7965-9/03/$17.00 ©2003 IEEE

Xingquan Zhu Dept. of Computer Science University of Vermont Burlington, VT 05401, USA [email protected]

gap may be the biggest challenge that we face in supporting content-based video retrieval (CBVR) [7-13]. Obviously, an effective semantic video classification technique should incoporate heavily domain-specific concepts, and its performance largely depends on the effectiveness of inherent framework for video content characterization. To facilitate solving these problems eventually, we have developed the following innovative techniques: (a) Semantic-sensitive video content characterization via principal video shots. (b) Semantic video classification via a mixture Guassian model to bridge the semantic gap in the specific domain of surgery education video. In the rest of the paper, we will describe these techniques (Sections 2 and 3), discuss the performance of our system (Section 4), and provide some conclusions (Section 5).

2. Video Content Characterization For each MPEG video in the database, our automatic shot detection technique [4] first detects all its physical video shots, and audio information of the video is also used to enhance the performance of shot detection [15]. However, physical video shots and their low-level visual features are too general to discriminate among various semantic medical concepts. It is thus necessary and desirable to understand principal visual patterns that medical experts usually use and how they combine these principal visual patterns to decide whether two video shots are semantically similar. We focus on a specific domain of surgery education video. In the domain of surgery education video, the presence of pre-defined semantic medical concepts (by domain experts), such as presentation, dialog, surgery, and diagnosis, is implicitly related to the relevant domaindependent and concept-driven salient objects, such as slides, sketch drawing, human faces, blood-red regions, gastrointestinal regions, and skin regions, etc. Thus we define principal video shots as some specific physical video shots that contain these concept-driven salient objects. As illustrated in Fig. 1, characterizing video contents via principal video shots will help bridging semantic gaps.

II - 189

ICME 2003



➡ Semantic Visual Concepts

Table 1: Average performance data: precision versus recall for some of our salient object detection functions face skin slide blood-skin intestine 87.5% 91.6% 89.6% 86.7% 88.7%

"small" semantic gap is bridgeable via semantic classification! Principal Video Shots

12.8%

"small" semantic gap is bridgeable via label-based classification!

8.7%

10.8%

13.6%

11.5%

Physical Video Shots Described by Low-Level Visual Features

Figure 1: Detecting principal video shots helps bridging semantic gaps.

Figure 3: Results of extracting salient object human faces.

Figure 2: Results of extracting salient object gastroinstinal regions

We have designed a set of salient object detection functions, and each function can detect one type of these selected salient objects. As an example, consider the detection of the salient object gastroinstinal regions. Our salient object detection function for gastroinstinal regions consists of the following three components: (a) low-level video analysis to obtain homogeneous image regions on color or texture; (b) classification of the relevant image regions as the semantic-sensitive salient object gastroinstinal regions; (c) temporal object tracking. Fig. 2 and Fig. 3 show a few example results of extracting the salient objects gastrointestinal regions and human faces. The average performance of our salient object detection functions is summarized in Table 1. The data are obtained by averaging precision and recall rates for one type of salient objects over 5000 surgical video clips. These rates are defined as: 1+η precision = 1+ρ

(1)

1+ζ (2) 1+ρ where ρ indicates the set of all the testing video clips recall =

that contain the specific salient object, ζ is the subset in ρ of the testing video clips that contain the specific salient object but not detected, η is subset of the video clips in ρ that contain the specific salient object and are also detected correctly, i.e., η = ρ ∩ ζ. Principal video shots can be characterized by a set of low-level perceptional features. Shot-based visual features for surgical content representation include the 32bin shot-based histograms of average HSV colors and their variances within a shot and the 9-bin edge histograms as the texture features. Shot-based motion features were not used because they did not show strong impact to surgical video content characterization and semantic classification. This is very different from other video domains such as news and films. Object-based visual features include density (i.e., coverage ratio between object region and relevant rectangular box), heightwidth ratio, relative locations, dominant colors and variances, and Tamura texture features. Shot-based auditory features include loudness, frequencies, pitch, fundamental frequency, and frequency transition ratio. Shotbased image-textual features include average length, average width, relative locations, and coverage in a shot.

3. Semantic Video Classification Classifying principal video shots into a set of pre-defined semantic medical concepts is very important for organizing large-scale video collections [12]. In this section, we propose a learning-based novel semantic video calssification technique, where labeled training examples are

II - 190





Figure 5: Results of principal video shot classification of a test video which consists of two semantic visual concepts: Traumatic Surgery and Diagnosis.

Figure 4: Results of principal video shot classification of a test video which consists of four semantic visual concepts: Presentation, Dialog, Traumatic Surgery, and Diagnosis. selected and labeled by medical experts. Each training example is a pair (X, C(X)) consisting of a set of objectbased and shot-based perceptional features X and a semantic lable C(X). We model the semnatic video classification, i.e., from the principal video shots to the semantic medical concepts, by utilizing a set of object-based and shot-based perceptional features X = (x1 , x2 , · · · , xm ) and probabilities. We assume that the data distribution for the predefined semantic medical concepts can be described by a mixture Gaussian function in their perceptional feature space. The mixture probability function of the principal video shots residing in Ns semantic medical concepts is given by: P (X) =

Ns X

P (Sk )P (X|Sk , θsk )

(3)

Gaussian components and their percentages); (b) classifier parameters (i.e., Gaussian parameters). In our current experiments, we use mixture Gaussian model to approximate the data distrubution and the Gaussian parametrs are determined via an adaptive ExpectationMaximization (EM) algorithm. Our adaptive EM algorithm takes the following steps to determine the classifier structure: (1) Initial Gaussian components for each semantic medical concept are first determined by its related principal video shots. (2) The EM algorithm is used to determine the Gaussian parameters for each semantic medical concept. The EM is used to determine the Gaussian parameters θsk = (µsk , σsk ) by maximizing the expression: Ns N X X

P (Sk |Xp , Cp ) log P (Sk |Xp , θsk )

(5)

p=1 k=1

where the training set of N examples with known semantic labels is indicated by: χ = {(Xp , Cp )}|N p . (3) If the estimation error for step (2) is less than a predefined threshold δ or the number of iteration is greater than T , stop and output results. Otherwise, go to step (4). (4) Add a new Gaussian component and perform step (2) again.

k=1

where P (X|Sk , θsk ) is the conditional probability that a principal video shot with the feature vector X belongs to a semantic medical concept Sk in its perceptional feature space, θsk is the associated Gaussian parameters for Sk , and P (Sk ) is the fraction of principal video shots assigned to the semantic medical concept Sk . The posterior probability P (Sk |X, θsk ), for a principal video shot with the feature value X to be assigned to the semantic medical concept Sk can be obtained from: P (Sk |X, θsk ) =

P (X|Sk , θsk )P (Sk ) P (X)

(4)

After the classifier is obtained, the task of principal video shot classification is to assign an unlabeled principal video shot to the relevant semantic medical concepts according to the maximum posterior probability rule. Fig. 4 and Fig. 5 show some results of principal video shot classification in the domain of surgery education video.

4. Performance Analysis The average performance of our semantic surgical video classification technique is given in Table 2. The data are the average precision and recall rate for the same semantic medical concept over 5000 surgical video clips. The precision and recall rates here are defined as:

To build this Bayesian classifier, we should address two key problems: (a) classifier structure (i.e., how many

II - 191

precision =

1+ν 1+θ+η

(6)



➠ Table 2: Performance (i.e., precision versus recall) comparison for two different video content characterization frameworks via principal video shots and physical video shots concept present surgery dialog diagnosis principal 81.6% 82.4% 79.8% 77.2% video shot 18.5% 18.1% 20.6% 23.4% physical 63.4% 64.8% 67.4% 56.7% video shot 37.1% 35.6% 33.2% 44.2% 1+γ+η (7) 1+θ+η where θ is the set of all principal video shots that are classified into the specific semantic medical concept by the classifier, γ indicates the subset in θ where irrelevant principal video shots that are classified into the specific semantic medical concept, η is the set of relevant principal video shots that are misclassified (i.e., have not been classified into the corresponding semantic medical concept), ν is subset of relevant principal video shots in θ that are classified into the specific semantic medical concept, i.e., ν = θ ∩ γ. recall =

Comparing the performances of our semantic surgical video classifier by using two different video content charaterization frameworks via principal video shots or physical video shots, it is clear that our semantic surgical video classifier based on principal video shots has better performance than the same classifier based on physical video shots. This shows that the accuracy of semantic video classifier largely depends on the success of the underlying video content characterization and representation strategy. Characterizing video content via principal video shots not only makes the semantic gap bridgeable but also enables more effective semantic video classification. 5. Conclusion and Future Works We have proposed a novel framework for supporting more effective video content characterization and semantic classification. This work is very attractive for bridging the semantic gap faced in CBVR systems as evident from our results. We are now working on: (a) feature dimension reduction for reducing the size of labeled training data set because it is very expensive to obtain largescale labeled training data set; (b) using unlabeled data to obtain more accurate estimation because it is cheap to obtain large-scale unlabeled training data set, and unlabeled training data also indicates the joint probability.

Foundation, Switzerland.

References [1] H.J. Zhang, J. Wu, D. Zhong and S. Smoliar,“An integrated system for content-based video retrieval and browsing”, Pattern Recognition, pp.643-658, 1997. [2] Y. Deng and B.S. Manjunath, “NeTra-V: Toward an object-based video representation”, IEEE Trans. on CSVT, vol.8, pp.616-627, 1998. [3] S.F. Chang, W. Chen, H.J. Meng, H. Sundaram and D. Zhong, “A fully automatic content-based video search engine supporting spatiotemporal queries”, IEEE Trans. on CSVT, vol.8, pp. 602-615, 1998. [4] J. Fan, W.G. Aref, A.K. Elmagamid, M.-S. Hacid, M.S. Marzouk, and X. Zhu, “MultiView: Multi-level video content representation and retrieval”, Journal of Electronic Imaging, vol.10, no.4, pp.895-908, 2001. [5] J. Fan, D.K.Y. Yau, A.K. Elmagarmid and W.G. Aref, “Image segmentation by integrating color edge detection and seeded region growing”, IEEE Trans. on Image Processing, vol.10, no.10, pp.1454-1466, 2001. [6] S.-F. Chang, W. Chen and H. Sundaram, “Semantic visual templates: linking visual features to semantics”, Proc. ICIP, 1998. [7] M.R. Naphade and T.S. Huang, “A probabilistic framework for semantic video indexing, filtering, and retrival”, IEEE Trans. on Multimedia, vol.3, pp.141-151, 2001. [8] J. Li, J.Z. Wang, and G. Wiederhold, “SIMPLIcity: Semantic-sensitive integrated matching for picture libraries”, VISUAL, Lyon, France, 2000. [9] Z. Yang and C.-C. J. Kuo, “Learning image similarities and categories from content analysis and relevance feedback”, ACM Multimedia, 2000. [10] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, “Content-based image retrieval at the end of the early years”, IEEE Trans. on PAMI, vol.22, pp.1349-1380, 2000. [11] J. Huang, S.R. Kumar, R. Zabih, “An automatic hierarchical image classification scheme”, ACM Multimedia Conference, pp.219-228, 1998. [12] D. Zhong, H.J. Zhang, S.-F. Chang, “Clustering methods for video browsing and annotation”, Proc. SPIE, pp.239-246, 1996. [13] Z. Su, S. Li and H.J. Zhang, “Extraction of feature subspace for content-based retrieval using relevance feedback”, ACM Multimedia, pp.98-106, 2001. [14] B. Li E. Chang, Y. Wu, “Discovery of a perceptual distance function for measuring image similarity”, Proc. ACM Multimedia 2002. [15] Y. Wang, Z. Liu and J.-C. Huang, “Multimedia content analysis”, IEEE Signal Processing, pp.12-36, 2000.

This project is supported by National Science Foundation under IIS-0208539 and a fund from AO Research

II - 192

Suggest Documents