16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
2003/8/17~19, Kinmen, ROC
Movie Scene Classification Using Hidden Markov Model Yuan-Kai Wang (王元凱) and Chih-Yao Chang(張智堯) Department of Electronic Engineering, Fu Jen Catholic University 510 Chung Cheng Rd. Hsin-Chuang, Taipei Hsien, 24205, Taiwan, R.O.C. Tel: 886-2-29031111 ext 2101 E-mail:
[email protected] Abstract
as well as their structure is single and straight story line with single sport field. Shot classification is enough to extract structures of these videos.
Movie is a kind of complex video with rich content. The analysis of movie is more complicated than other types of videos like surveillance, sport games, and documentaries. In this paper, a statistical approach using hidden Markov model to classify movie scenes is proposed. Two important kinds of movie scenes, dialogue and fighting scenes, are classified. Color and motion features are extracted for each frame. Features of all frames within a scene are regarded as a time series of observations that are statistically modeled by Gaussian mixture ergodic hidden Markov model. Two movies with 41 dialogue scenes and 15 fighting scenes are experimented. The highest accuracy rate can achieve 80%.
From shot to scene, similarities or correlations between shots are usually adopted as the clues for scene boundary detection [9]. However, the detection is less helpful for the interpretation of meanings of scenes and shots. Yeung and Yeo [10] proposed a time-constrained clustering approach to find story units in films. A three-level video event detection algorithm proposed by Qian, Haering and Sezan [11] could detect event of wild animals for special videos Hidden Markov model (HMM) is an extension of Markov chain. It adopts an embedded stochastic process to model an underlying stochastic process that is not observable (it is hidden), where each hidden state generates an observation. Because HMM is capable of analyzing time sequence data [12], it has been extensively applied in many applications from speech recognition [13] to gene analysis [15]. Video data consists of a set of image frames with temporal relation. Since video could be regarded as abundant information distributed over time series, there have been a lot of literatures that manipulate video with HMMs [16, 17, 18, 19]. These literatures treat features of frames or shots as observations in hidden Markov models. The structure of video is then statistically modeled as state transitions of HMMs.
1. Introduction Movie films are made up of a series of image frames. A fundamental syntactic unit of movies beyond frame is shot, which consists of a set of consecutive frames with consistency of visual content. However, what really attracts and recalls audience to the movie is semantic scene. A semantic scene is usually a set of successive shots with consistency of semantics, such as a two-person dialogue scene with alternative close-up shots. For video retrieval and indexing, scene could be one of the most valuable query types for users. Most of earlier researches in the study of video structure put emphasis on shot change detection [1, 2, 3], also known as video segmentation. For example, Boreczky and Wilcox [4] segmented videos into shots and classified shot boundary into cut, fade, pan, zoom, or dissolve. Some other researches [5, 6, 7] proposed shot classification approach for structure analysis of videos. Huang and Chang [8] extracted motion and color features for each shot, and utilize hidden Markov models to perform shot classification. These papers works well on syntactic structure, or shot-level structure, of videos due to that video contents they analyzed are confined to sport games, such as baseball, basketball, football. Although these sport game videos may present multiple camera viewpoints, their backgrounds are single sport field,
In this paper, a statistical approach using hidden Markov models is proposed for the extraction of semantic meanings in movies. Movies are rich content that is more complex than other kinds of videos, such as surveillance data and sport games. Especially, some complex structures, such as dialog scene, could exist only in movies. Dialog scene is the most common and important scene in movies. The percentage in loving movie should be higher. Therefore, dialogue scenes are chosen to be analyzed in this paper. In addition to dialogue scenes, fighting scenes are also considered. Classifying fighting scene is challenging because characters move quickly in the scene, and camera also moves quickly in order to capture character’s motion. To classify dialogue 196
16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
scenes and fighting scenes, color and motion features of movie segments are extracted. These features are considered as temporal observations that will be modeled by HMMs. A full ergodic and Gaussian mixture HMM is proposed in our approach. The likelihood of a movie segment for each HMM is calculated. Classification is achieved by selecting the maximum likelihood.
2003/8/17~19, Kinmen, ROC
The shot flow of a dialogue scene can be regarded as the combination of four kinds of shots: individual close-up, two-person dialogue, over-shoulder, and three-person dialogue. 1.Individual close-up: There is only one person who appears on the shot. This is used to intensify the person who is talking. 2.Two- person dialogue: There are two persons appear on the screen. The gestures of the characters are usually face-to face or back- to-back. This is one of the fundamental shot in dialogue scenes. 3.Over-shoulder: It is another common two-person dialogue shot. There are a person’s face and another person’s shoulder on the screen. 4.Three-person dialogue shot: There will be a character in front and the other two at the back, or vice versa, to imply that there are three characters having a conversation. It often appears on the first shot at dialogue scene.
This paper is organized as follows. Section 2 elaborates dialogue and fighting scenes in detail. Section 3 elucidates the feature extraction method used before the classification of HMM. The HMM to classify movie scene is explained in Section 4. Experimental results presented in Section 5 will be thoughly discussed. Finally, concluding remarks are given in Section 6.
2. Definition of Scene
2.2 Fighting Scene
From the viewpoint of audience, movie is made up of a series of meaningful scenes. And a scene is a conscious space that has been created. A scene may present different shot flows with different directors or cameramans. However, in order to build consistent semantics, these shot flows should contain common traits to audience’s sympathy. That is to say, there is a fixed model for the construction of shot flows. From the viewpoint of mathematics a shot flow that made up those meaningful scenes is a shot sequence with statistically temporal relation. Moreover, frames within a shot constitute a frame sequence with statistically temporal relation. In this paper, temporal relations for two kinds of important and challenging movie scenes, dialogue and fighting scenes, are studied. The following will explain temporal characteristics of these two kinds of scenes.
If there are two or more persons fighting, the scene is called a fighting scene. Usually, the shot flow in fighting scene swift quickly, and the average shot length is short. The audiences can hardly be impressed by a single shot except the close-ups. Due to the fact that there’s no impression of the shot in mind, it is considered as a complicated scene. Chinese Kung-Fu movies are special action movies that have a lot of fighting scenes. Figure 2. illustrates the shot flow of a fighting scene in a Chinese Kung-Fu movie.
2.1 Dialogue Scene A dialogue scene is a scene which contains two or more characters having a conversation. Dialogue scenes shall play an important role in a movie, because principal details have to be explained through dialogues. Figure 1. is the shot flow of a dialogue scene.
Figure 2. The shot flow of an example fighting scene with 16 shots. The fighting scene in Chinese Kung-Fu movies is a shot flow that is a combination of flying shot, quick fighting shot, swift movement close-up, and special action close-up shot. 1. Flying: A person is flying or floating in the air, which is known as Qing-gong in Chinese Kung-Fu. 2. Quick fighting: A character fights with another within a short distance. It is featured by swift actions of characters and tremendous motion change.
Figure 1. The shot flow of an example dialogue scene with 9 shots. 197
16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
3. Swift movement close-up: The camera moves with the character. It occurs when the character moves up to kill his enemy. 4. Special action close-ups: It appears very frequently in action scenes. It is used to emphasize the special action of the character, such as using secret weapons.
2003/8/17~19, Kinmen, ROC
T(i) and D are the extracted color features of each frame. Figure 3. shows the absolute histogram difference D variation in 1000 frames.
3. Feature Extraction Two kinds of features are extracted, namely color feature and motion feature. Global characteristics of color information within a shot are considered to be stable. Therefore, they have been used extensively for shot change detection and shot classification. A scene is composed of several shots. Color information within a scene is considered as stable transition from one global characteristic to another characteristic. Stable transitions can appear with different transition fashions for different classes of scenes. Hence, color features are taken as part of our observations to be statistically modeled by HMM.
Figure 3. Absolute histogram difference D variation in 1000 frames. 3.2 Motion Features Motion features are extracted by block matching. Block matching algorithm is a widely used motion estimation method since Jain and Jain first introduced in 1981 [21]. Each frame is divided into blocks with the size of M*N pixels. The algorithm assumes all pixels in the take the same d =same [d x , d yblock ]T movement. That is, a block has a motion vector . The motion vector is estimated by searching for the best matching block within a search window of the size (M+2Wx)*(N+2Wy) in next frame. The dashed rectangle shown in Figure 4(a) is a search window. The difference between two blocks M −1 N −1 1 − Rij is a mean absoluteM *difference which is ∑ ∑0 Cij(MAD), N 0 defined as , where Cij is the block M −1 N −1 1 Cij − Rwithin in current frame, Rij Mis* Nthe the search ∑0 a∑block ij 0 windows of next frame. Exhaustive search is performed to find an Rij with the minimum MAD. The displacement between the block and the best matching block is the motion vector. An example is shown in Figure 4(b). The light-gray block is the best matching block in next frame. M=N=32 and Wx=Wy=16 are chosen in our approach.
Motion feature estimates movements of objects among continuous frames. Different classes of scenes have different motion fashions. For example, characters in dialogue scenes are almost still or move slowly, but characters in fighting scenes move quickly. Motion feature can be discriminate characteristic for the classification of scenes. Next subsections will give details of the color and motion features. 3.1 Color Features There are eleven color features. Ten features are histogram values of quantized luminance, T(i), 1≦i ≦ 10, and one feature is absolute histogram difference D. RGB color information of each frame is converted to luminance information by the formula: L = 0.3008( R ) + 0.5859(G ) + 0.1133( B )
Luminance of pixel ranges from 0 to 255. The range is quantized into 10 intervals. Luminance histogram of each interval is calculated to find the luminance distribution of pixels in a frame. Suppose N is the total number of pixels in a frame, and H(i) is denoted as the number of pixels with the luminance values of interval i, 1≤ i≤10. The normalized histogram values of quantized luminance T(i) are defined as H(i)/N, 1≤ i≤10. The absolute histogram difference of the adjacent frames, D, is defined as follows: , H t −1is (i )] |the normalized luminance 1≤ i≤10,D =where ∑ | [ H t (iH) −t(i) histogram of interval i for the t frame in a movie. The
(a) (b) Figure 4. Block matching search region and motion vector.
198
16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
2003/8/17~19, Kinmen, ROC
probabilities of observations, and probabilities of state transitions. A state can connect to each other with a left-to-right topology, which is called a left-right HMM model. Left-right HMM has been used extensively in speech recognition. The model has limited statistical modeling flexibility, but efficient computation algorithm. However, in order to model perplexed characteristics of movie scenes, an ergodic model is adopted in our approach. The difference between ergodic model and left-right model is shown in Figure 7. State transition in left-right model is restricted to a left-right fashion. However, the state transition in ergodic model can move a state to all states, including itself. Since the ergodic model has no restriction with state transition, it is more suitable for more complex applications as movie scene classification does.
There are 20 features extracted from motion vectors of a frame. Magnitudes and phases are calculated to extract features, such as magnitude histogram, phase histogram, mean of magnitude, and magnitude of mean vector. The magnitude of each motion vector is quantized into 10 intervals. Magnitude histogram is the distribution of magnitudes of all motion vectors over the 10 intervals. The phase of each motion vector is quantized into 8 intervals that range from 0 to 2π. Phase histogram is obtained by calculating the distribution of phases of all motion vectors over the 8 intervals. Mean of magnitude is the average magnitude of all motion vectors in a frame. Magnitude of mean vector is the magnitude of the mean vector that is the average of all motion vectors. Figure 5. shows the motion magnitude histogram in a frame. Figure 6. shows the motion vector phase histogram in a frame.
(a) (b) Figure 7. Comparison between left-right model and ergodic model. (a) A left-right model with four states. (b) An ergodic model with four states. Except for complex state transition structure, probability distribution of observation of the HMM in our approach will adopt Gaussian mixture model. Most applications of HMMs on classification problem assume observations are discrete symbols. Vector quantization is used to code a feature vector into a discrete symbol if observations of features are continuous. However, the quantization induces inaccuracy and destroys the information in features. Therefore, Gaussian mixture HMM is adopted in our approach to prevent from encountering this problem.
Figure 5. Motion magnitude histogram quantized into 10 intervals.
An HMM λ can be represented by a 3-tuple (A,B,π), where π is the initial state distribution, A=[ aij ] is a matrix of state transition probabilities with elements aij being the state transition probability from state i to state j, and B=[bj(O)] is a matrix of observation probabilities with elements bj(O) being the observation probability of O in state j. For Gaussian mixture HMM, where M cjm≧0, b j (O) = ∑m=1 c jm.π (cOjm, µ is jm , Uthe jm ) mixture coefficient M cM-th = 1 mixture in state j. for ∑the is the m =1 jm (O , µ jm ,UO probability of observing the π vector jm ) under the Gaussian distribution with mean vector and µ jm mixture covariance matrix for the M-th component in state j.U jm
Figure 6. Motion vector phase histogram, 8 phase from 0 to 2π.
4. Hidden Markov Models for Movie Scene Classification
The training of an HMM by a set of observations means the estimation of A, B and π. The training of
A hidden Markov model consists of states,
199
16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
2003/8/17~19, Kinmen, ROC
ergodic Gaussian mixture HMM is achieved by the Expectation-Maximization algorithm [14]. For a trained HMM λ, P(O| λ) is the probability of an observation O belonging to λ. The forward-backward procedure [13] is used to compute P(O| λ ). Suppose an observation with length T is denoted as O=O1O2…OT. A forward P (O1O2 ⋅ ⋅ ⋅ O,t , Si.e., α t (i ) variable is defined as i λ ) the probability of the partial observation sequence, O=O1O2…OT, (until time t, t ≦ T), and state Si at time t, given the HMM modelλ. P(O| λ) can be obtained by calculating as follows: 1) Initialization: α1 (i ) = π ⋅ bi (O1 ) 2) Induction: α t +1 ( j ) = [∑i α t (i )aij ]b j (Ot +1 ) 3) Termination: P(O λ ) = ∑i α T (i )
Figure 8. Sample shots of the experimental movies. Table 1. Mean and standard deviation of scene lengths of our experimental data.
There are two HMMs, λi, 1≦i≦2, in our approach. Each λ will model the statistical characteristic of either the dialogue scene or the fighting scene. Given a movie segment, features of the segment are extracted as an observation vector O. Two P (O | λ ) P (O | are λ1 ) estimated. probabilities, and The movie segment is classified to be class i, 1≦i≦2, if P (O probability. | λi ) is greater than another
Movie 1 Movie 2
2
Dialogue
Fighting
62.96
99.33
Mean (sec.) Standard Deviation
40.71
64.30
Mean(sec.)
49.83
52.75
Standard Deviation
21.46
20.26
Half of the dialogue scenes from each film are taken as training examples, and the rest as testing examples. The same method is used for fighting scenes as well. HMMs with different number of states are experimented. Classification results are shown in Tables 2 and Table 3. In the experiments of dialogue scene, better classification results are obtained in the HMMs with three states and with six states. For fighting scenes, there is an apparent accuracy improvement when the HMM has four states. When the state number is increased to 5, log of zero occurs. It means that the number of training data is insufficient.
5. Experimental Results Two movies are adopted in our experiments to verify our approach. One movie is “Crouching Tiger and Hidden Dragon”. It is a typical Chinese Kung-Fu movie. Its fighting scenes have longer scene length than that of usual movies. However, the standard deviation of its scene lengths is pretty high due to that there are some shorter fighting scenes. Similarly, dialogue scenes of the first movie take up much time and have much higher standard deviation. The second movie is “Cat and Mouse”. It is a drama. Since it has more regular rhythm, its mean and standard deviation of scene length are smaller. There are 25 dialogue scenes and 9 fighting scenes in the first movie. While in “Cat and Mouse”, there are 16 dialogue scenes and 6 fighting scenes. A set of sample shots is illustrated in Figure 8. The statistics of scene lengths of the two movies are tabulated in Table1.
Table 2. Classification Results of dialogue scenes with different state numbers. State No.
2
3
4
5
6
Accuracy
65.0%
80.0%
75.0%
55.0%
80.0%
Table 3. Classification results of fighting scenes with different state numbers. State No.
2
3
4
5
6
Accuracy
28.5%
28.5%
57.4%
14.2%
14.2%
The accuracy of fighting scenes is not satisfactory. In addition to HMM parameters, the choice of feature are another reason. The motion features are obtained using block-matching and exhaustive search. If the block size is large, the motion of object will not be estimated precisely. If it is small, an error will occur when the object is moving. Especially, objects in the fighting scenes like Chinese Kung-Fu scenes move very quickly. Block matching may not be able to precisely estimate object motion. Besides, another 200
16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
2003/8/17~19, Kinmen, ROC
Proceedings, International Conference Information Technology Coding and Computing, pp. 473 -477, April 2001. [9] T. Lin and H.-J. Zhang, "Automatic video scene extraction by shot grouping", International Conference on Pattern Recognition, vol. 4, pp. 39-42, 2000. [10] M. M. Yeung and B. L. Liu, “Time-constrained clustering for segmentation of Video into story unit”, International Conference on Pattern Recognition, pp. 375-380, 1996. [11] R. Qian, N. Hearing and I. Sezan. “A computational approach to semantic event detection”, Proceedings Computer Vision and Pattern Recognition, vol. 1, pp. 200-206, June 1999. [12] I. L. MacDonald and W. Zucchini, Hidden Markov and other models for discrete-valued time series, Chapman and Hall, 1997. [13] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition”, Proceedings of the IEEE, vol.77, no.2, pp. 257-286, 1989. [14] L. R. Rabiner and B. H. Juang. “An Introduction to hidden Markov model.” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 4-16, 1986. [15] A. Krogh, S.I. Mian and D. Haussler, "A hidden Markov model that finds genes in E. Coli DNA", Nucleic Acids Research, Vol. 22, No. 22, pp. 4769-4778, 1994. [16] Z. Liu, J. Huang, and Y. Wang, "Classification of TV programs based on audio information using hidden Markov model," Proceedings of 1998 IEEE Second Workshop on Multimedia Signal Processing, pp. 27-31, 1998. [17] C. Taskiran, C. A. Bouman, and E. J. Delp, "Discovering video structure using the pseudo-semantic trace," Proceedings of the SPIE Conference on Storage and Retrieval for Media Databases , vol. 4315, pp. 571-578, 2001. [18] W. Wolf, “Hidden Markov model parsing of video programs”, Proceedings International Conference on Acoustics Speech and Signal Processing, vol. 4, pp. 2609-2611, 1997. [19] L. Tiecheng and K. John, “A hidden Markov model approach to the structure of documentaries”, Proceedings IEEE Workshop on Content-based Access of Image and Video Libraries, pp. 111-115, 2000. [20] Y. Wang, Z. Liu and J. Huang, "Multimedia content analysis using audio and visual information," IEEE Signal Processing Magazine, vol. 17, no. 6, pp. 12-36, 2000.
influential factor is nonlinear movements of objects in Chinese Kung-Fu, such as object rotation. The last possible reason is that the number of training samples is insufficient for HMM to learn scene features.
6. Conclusion and Future Work Two kinds of movie scenes are modeled and analyzed by HMMs in this paper. The highest accuracy achieves up to 80%. Frames with more complicated object movements are unable to be accurately modeled. In the future, other features could be proposed as discriminate observation to improve accuracy. Moreover, more movie scenes are needed for training HMMs. More complete analysis and statistical descriptions of films can be constructed if more training data is utilized.
Reference [1] P. Bouthemy, M. Gelgon, and F. Ganasia, “A Unified approach to shot change detection and camera motion characterization”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 7, pp. 1030-1044, 1997. [2] S. V. Porter, M. Mirmehdi and B. T. Thomas, “Video cut detection using frequency domain correlation”, IEEE International Conference on Pattern Recognition, pp. 413-416, 2000. [3] M.J. Pickering and S.M. Riiger, “Multi-timescale video shot change detection”, In Proceedings of the Tenth Text Retrieval Conference, pp. 22-25, 2001. [4] J. Boreczky and L. Wilcox, "A hidden Markov model framework for video segmentation using audio and image features, " Proceeding International Conference on Acoustics Speech and Signal Processing, vol. 6, pp. 3741-3744, 1998. [5] S. Porter, M. Mirmehdi and B. Thomas, “Detection and classification of shot transitions”, Proceedings of the British Machine Vision Conference, pp. 73-82, 2001. [6] Y. Yusoff, J. Kittler, W. Christmas, “Combining multiple experts for classifying shot changes in video sequences ”, International Conference on Multimedia Computing and Systems, vol. 2, pp. 700-704, 1999. [7] M. Wu, W. Wolf and B. Liu, “An algorithm for wipe detection", IEEE International Conference on Image Processing, vol. 1, pp. 893-897, 1998. [8] C.-L. Huang and C.-Y. Chang, “Video summarization using hidden Markov model”, [21] J. R. Jain and A. K. Jain, "Displacement measurement and its application in inter- frame image coding," IEEE Transaction on Communications, pp. 1799-1808, 1981. 201
16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003)
202
2003/8/17~19, Kinmen, ROC