integration of multimodal features for video scene ... - Semantic Scholar

0 downloads 0 Views 126KB Size Report
They are root mean square volume, zero crossing rate, pitch period, ... volume standard deviation, standard deviation of zero crossing rate, volume dynamic.
INTEGRATION OF MULTIMODAL FEATURES FOR VIDEO SCENE CLASSIFICATION BASED ON HMM J. Huang, Z. Liu, Y. Wang

Dept. of EE Polytechnic University 6 MetroTech Center Brooklyn, NY 11201 fjhuang, zhul, [email protected]

Y. Chen, E. K. Wong

Dept. of CIS Polytechnic University 6 MetroTech Center Brooklyn, NY 11201 fychen, [email protected]

Abstract - Along with the advance in multimedia and internet technology, a huge amount of data, including digital video and audio, are generated daily. Tools for ecient indexing and retrieval are indispensable. With multi-modal information present in the data, e ective integration is necessary and is still a challenging problem. In this paper, we present four di erent methods for integrating audio and visual information for video classi cation based on Hidden Markov Model. Our results have shown signi cant improvement over using single modality.

INTRODUCTION

Along with the advancement in multimedia and internet technology, the amount of digital data, including TV programs, conference archives, and movies, grows exponentially. For ecient access, understanding, and retrieval of digital video, tools that can automatically understand the semantic content in a video are becoming indispensable. In this paper, we consider the classi cation of a video sequence into one of a few predetermined scene types. As a proof of concept, we have focused on the classi cation of several typical scene categories in TV programs: news reports, weather forecasts, commercials, basketball games, and football games. Previously, we have successfully applied Hidden Markov Models (HMM) for this task using audio features [1]. To further improve the classi cation accuracy, we have recently explored the use of visual information in addition to audio. This is motivated by the fact that audio and visual features are e ective at discriminating di erent scene classes. How to combine audio and visual information e ectively for scene classi cation is a challenging problem. Integration of audio and visual features in a HMM classi er has been studied previously for speech recognition [2, 3] and for speech to lip movement synthesis [4]. In this paper, we examine di erent techniques to integrate audio and visual information for video classi cation based on HMM.

AUDIOVISUAL FEATURES

We consider two general types of features for video classi cation. One type is the audio features, which are computed from low-level acoustic properties. The other one is visual features, which include color and motion.

Audio Features

The audio signal is segmented into overlapping 1.5-second clips with a 0.5-second shift from the previous clip. Then each clip is divided into 512-sample frames which are shifted from the previous frame by 256 samples. Eight features are computed for each frame. They are root mean square volume, zero crossing rate, pitch period, frequency centroid, frequency bandwidth, and energy ratio in the three subbands (0{630 Hz, 630{1720 Hz, and 1720-4400 Hz). Based on these frame-level features, 14 clip-level audio features are extracted. These features include: non-silence ratio, volume standard deviation, standard deviation of zero crossing rate, volume dynamic range, volume undulation, 4 Hz modulation energy, standard deviation of pitch period, smooth pitch ratio, non-pitch ratio, frequency centroid, frequency bandwidth, and energy ratios of the three subbands. For more detailed descriptions and selections on audio features, see [1, 5].

Visual Features

Color is the most widely used visual feature in video retrieval. Color features can include color histogram, dominant color, mean and standard deviation of colors. Motion is another useful visual cue. It is invariant to changes of color and lighting, theoretically. Motion features include motion histogram/phase correlation, dominant motion, and model parameters for global motion description. In the simulation results presented below, the visual features include the most dominant color, the most dominant motion vectors, and the mean and variance of motion vector. The video data was digitized at 10 frames per second. We quantize the colors of each video frame into 64 colors adaptively, and determine the most dominant color and its percentage in the frame. We generate motion vectors by a hierarchical block matching algorithm for every two consecutive frames. Then we calculate the mean and the variance of motion vectors, and determine the most dominant motion vector and its percentage for every frame.

INTEGRATION METHODOLOGIES

Hidden Markov Models (HMMs) are e ective tools for modeling time-varying patterns. A discrete HMM is characterized by  = (A; B; ), where A is the state transition probability matrix, B is the observation symbol probability matrix, and  is the initial state distribution. In a previous paper[1], we have presented the classi cation results using HMM based on audio features only. We rst train i ; i = 1; 2; : : : ; C , where C is the number of classes. For each observation sequence O, we compute P (Oji ) and the classi cation is based on the maximum likelihood of P (Oj). In this paper, we explore how to combine audio and visual information using HMM to enhance classi cation accuracy. Four strategies are described below.

Direct Concatenation

Concatenating feature vectors from di erent modalities into one super vector is an easy and simple way for combining audio and visual information. This approach has been reported in [2, 3] for speech recognition. As described in the previous section, the time scale of visual features is di erent from that of audio features. To synchronize multi-modal features, we calculate the mean and the variance of the most dominant color and its percentage, and the mean and the variance of the most dominant motion vector and its percentage for every 1.5 seconds, which correspond

to the same time scale as an audio clip. In general, this approach can improve classi cation results. However, as the feature dimension increases, more data are needed for training. Due to the limitation of data, the resulting HMM tends to be unreliable.

Product HMM

We have observed that di erent modal features, such as audio, color, and motion, are not highly correlated. By assuming that features are independent of each other, we train an HMM for each of the audio, color, and motion modalities separately. The observed sequences of di erent features are fed into the corresponding HMM. The nal observation probability is computed as P (Oji ) = Pm(Oa jai )mP (Omc jci )mP (Om jmi ), where a = (Aa ; B a; a ), c = (Ac; B c; c), and  = (A ; B ;  ). The three modules can have di erent number of states and number of observation symbols. This approach can easily accommodate new modality if the features in the new modality are independent of existing features. Note that in this and other approaches, the features from di erent modalities can be calculated on di erent time scale.

2-Stage HMM

Experiments show that one modality can be e ective for distinguishing certain types of scenes, while other modalities may be more e ective for other scene types. We have observed that audio alone can e ectively separate video into three broad scene categories: (1) commercials, (2) basketball and football games, and (3) news reports and weather forecasts. Each category has distinct audio characteristics. On the other hand, visual information, such as color and motion, can distinguish basketball games from football games, and news reports from weather forecasts. Based on the above observations, we propose a two-stage HMM classi er. An audio-based HMM classi er is rst used to separate the input video sequence into three broad categories: commercials, football/basketball games, and news reports/weather forecasts. In the second stage, visual-based HMM classi ers are used to separate basketball games from football games, and weather forecasts from news reports.

Integration by Neural Network

Figure 1 shows the HMM/Neural Network Architecture for integrating multimodalities. For every modality, we train one HMM for each video class. Then a 3 layer perceptron is used to combine the outputs from all modalities. Neural network is powerful in realizing complex non-linear mapping. Here we want to use neural network to learn typical combinations of the classi cation results based on di erent modalities for each video class. Di erent from the product HMM method, this approach doesn't assume features are independent. As in the product HMM approach, it is easy to consider new modalities. We only need to train the HMM for the new modality and then retrain the neural net. The HMMs of old modalities do not need to be retrained.

SIMULATION RESULTS

We examined the four integration methods in classifying ve types of television programs. They are news reports, weather forecasts, commercials, live basketball

Output

3 Layer Perception Neural Network

HMM Classifier 1

Modality 1

HMM Classifier 2

Modality 2

......

......

Input Feature

Figure 1: Block Diagram of HMM-Neural Network. games, and live football games. We have collected about 20 minutes of video sequences from TV broadcasting for each class. All video are digitized at 10 frames per second and at 240  180 pixels per frame, and audio are sampled at 22.05 KHz and 16 bits per sample. For the concatenation approach, we calculated the clip-level audio, color, and motion features, as described in the previous section. The feature vectors from every 20 successive clips form one observation sequence (11 seconds long) for the HMM classi er. The next sequence is shifted from the previous one by one clip. As for the other three approaches, we determined the color and the motion features for every frame. Every 110 frame is grouped as an observation sequence; and the next sequence is generated by shifting 5 frames from the previous one. We separated data into training set and testing set arbitrarily, so that there are about 1000 sequences (corresponding to 10 minutes of video) for each scene class in both sets. We used the discrete ergodic HMM[6], in which any state can be visited from any other states. The highest classi cation is usually achieved when the number of observation symbols is 256 and the accuracy drops if more symbols are used. Number of states doesn't have a great in uence on the accuracy. Hence, we present the result obtained using 5-state HMM with 256 observation symbols for every case. Table 1 gives the result for using audio only. The results of using visual features only, which are not presented in here, are inferior to using audio features. Tables 2{5 show the results using the four proposed integration strategies. By integrating visual features with audio features, classi cation accuracy is improved by 7{12 % over using audio features alone. In general, the improvements in identifying commercial and football games are more signi cant than in other classes. This may be due to the repeated visual patterns in those two classes. The product HMM method gives the best result on the average, and classi cation accuracy for each class is evenly good. This con rms that the di erent modal features we used are not highly correlated. The concatenation method is most accurate in classifying weather forecasts. This may be due to the fact that the dynamic range of audio and visual features is small in this type of scene, so that using a small number of observation symbols for the super feature vector is sucient.

CONCLUSIONS AND DISCUSSIONS

We have shown that using joint audio and visual information can signi cantly improve the accuracy for scene classi cation over using audio or visual information

only. This is because multimodal features can resolve ambiguities that are present in a single modality. Direct concatenation o ers simple and easy implementation. However, it can su er from limited training data if the feature dimension is large. 2-Stage HMM is application dependent, and may not be e ective in all classi cation tasks. It is dicult to decide which features should be used in each stage. With the product and neural network approaches, small modules can be trained reliably for di erent modal features and new features can be easily accommodated by building a new HMM for this feature set. The product method achieves the best results in our study. However, we also observed that the concatenation approach can perform better if features are highly correlated.

ACKNOWLEDGMENTS

This work was supported in part by the National Science Foundation (NSF) through its STIMULATE program under Grant No. IRI-9619114.

References

[1] Z. Liu, J. Huang, and Y. Wang, \Classi cation of TV Programs Based on Audio Information using Hidden Markov Model," IEEE Workshop on Multimedia Signal Processing, pp. 27{32, Log Angeles, CA, Dec. 7{9, 1998. [2] M. T. Chan, Y. Zhang, and T. S. Huang, \Real-Time Lip Tracking and Bimodal Continuous Speech Recognition," IEEE Workshop on Multimedia Signal Processing, pp. 65{70, Log Angeles, CA, Dec. 7{9, 1998. [3] G. Potamianos and H. P. Graf, \Discriminative Training of HMM Stream Exponents for Audio-visual Speech Recognition," Proc. Intl. Conf. Acoustic, Speech and Signal Processing, Seattle, WA, Vol. 6, pp. 3733{6, May 12{15, 1998. [4] R. Rao, R. Mersereau and T. Chen, \Using HMM's in Audio-to-Visual Conversion," IEEE Workshop on Multimedia Signal Processing, pp. 19-24, Princeton, NJ, June 23{25, 1997. [5] Z. Liu, Y. Wang, and T. Chen, \Audio Feature Extraction and Analysis for Scene Classi cation," Journal of VLSI Signal Processing, pp. 61{79, Oct. 1998. [6] L. R. Rabiner, \A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, Vol. 77, No. 2, pp. 257{ 286, Feb. 1989.

ad

ad bskb ftb news wth

Output Class

bskb

ftb

news

wth

75.66 7.36 0.38 15.66 0.94 1.46 91.79 5.29 1.46 0.00 1.82 13.28 83.64 1.26 0.00 0.00 0.19 4.58 57.55 37.68 0.00 0.00 0.00 10.08 89.92

Table 1: Accuracy using Audio Features only (Average Accuracy: 79.71 %.)

Output Class

ad

ad bskb ftb news wth

bskb

ftb

news

wth

91.23 7.08 0.00 1.60 0.09 2.55 86.13 8.21 3.10 0.00 1.58 1.34 94.31 2.77 0.00 2.63 1.66 3.02 64.95 27.75 0.00 0.00 0.00 4.17 95.83

Table 2: Accuracy by Direct Concatenation (Average Accuracy of 86.49 %).

ad

ad bskb ftb news wth

Output Class

bskb

ftb

news

wth

93.58 0.47 0.00 5.38 0.57 6.39 93.34 0.27 0.00 0.00 0.00 0.00 100.00 0.00 0.00 7.30 2.14 0.29 83.54 6.72 0.39 0.00 0.00 13.08 86.53

Table 3: Accuracy by Product HMM (Average Accuracy of 91.40 %).

ad ad bskb ftb news wth

Stage I bskb ftb

82.26 8.40 3.43 95.30

Stage II

news wth

Accu.

9.34 82.26 1.27 90.69 9.31 86.42 2.77 97.23 92.66 2.72 97.28 81.89 18.11 79.66 16.18 82.85 80.60

0.00

Table 4: Accuracy by 2-Stage HMM (Average Accuracy of 87.39 %).

ad

ad bskb ftb news wth

Output Class

bskb

ftb

news

wth

88.40 5.66 0.00 2.26 3.68 2.83 95.80 1.37 0.00 0.00 0.00 0.00 100.00 0.00 0.00 5.55 11.68 7.69 62.51 12.56 0.00 0.87 0.58 4.46 94.09

Table 5: Accuracy by HMM-Neural Network Method (Average Accuracy of 88.16 %).