Motion Activity Based Semantic Video Similarity

0 downloads 0 Views 241KB Size Report
Hence, in this paper, in order to support high-level semantic retrieval of video ... For each P-frame Pi, compute the X-histogram and Y-histogram according to the.
Motion Activity Based Semantic Video Similarity Retrieval† Duan-Yu Chen, Suh-Yin Lee and Hua-Tsung Chen Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta-Hsueh Rd, Hsinchu, Taiwan {dychen,sylee,huatsung}@csie.nctu.edu.tw Abstract. Semantic feature extraction of video shots and fast video sequence matching are important and required for efficient retrieval in a large video database. In this paper, a novel mechanism of similarity retrieval is proposed. Similarity measure between video sequences considering the spatio-temporal variation through consecutive frames is presented. For bridging the semantic gap between low-level features and the rich meaning that users desire to capture, video shots are analyzed and characterized by the high-level feature of motion activity in compressed domain. The extracted features of motion activity are further described by the 2D-histogram that is sensitive to the spatiotemporal variation of moving objects. In order to reduce the dimensions of feature vector space in sequence matching, Discrete Cosine Transform (DCT) is exploited to map semantic features of consecutive frames to the frequency domain while retains the discriminatory information and preserves the Euclidean distance between feature vectors. Experiments are performed on MPEG-7 testing videos, and the results of sequence matching show that a few DCT transformed coefficients are adequate and thus reveal the effectiveness of the proposed mechanism of video retrieval.

1 Introduction In the research of video sequence characterization, the most difficult task is to represent video content in a compact form and also to provide enough information to describe rich meaning of video content simultaneously. In the related literatures, video shots are mainly represented by key-frames. Low-level features, like color, texture and shape are extracted from these key-frames for supporting indexing and retrieval. The disadvantage of such strategy is that it ignores the inherent and significant feature – spatial temporal information of consecutive frames through video sequences. Therefore, some researchers take the property of temporal variation of video sequence into account to perform similarity matching. Wang et al [1] propose a query-by-example system, which extracts features of color, edge and motion and perform similarity measurement of temporal patterns using the method of dynamic programming. Lin et al [2] segment a video shot into subshots and compute the similarity of video shots between corresponding subshots, in which two descriptors are characterized, dominant color histograms and spatial structure histograms. Cheung and Zakhor [3] utilize the HSV color histogram to represent the keyframes of video clips and design a video signature clustering algorithm for video †

This research is partially supported by Lee & MTI Center, National Chiao-Tung University, Taiwan and National Science Council, Taiwan.

similarity detection. Dimitrova et al [4] represent video segments by the color superhistograms. Roach et al [5] identify and verify cartoons and non-cartoons videos by extracting motion feature in pixel domain. Zhao et al [6] present an approach – nearest feature line in shot retrieval Lines connecting the feature points are further used to approximate the variation in the whole shots. Mohan [8] characterizes the consecutive frames by using the reduced intensity image from DC-images of I, P and B frames. Yeung and Liu [9] select key-frames non-linearly according to the temporal variation of I-frames and perform video sequence matching based on comparison among DCimages of key-frames. In the previous researches of similarity matching among consecutive frames, most researchers focus on video partition, key-frame selection and low-level feature extraction on the selected key-frames [7]. In the strategy of key-frame matching, the dimensionality of key-frame descriptors is quite high and the high dimensionality of the feature vectors will have efficiency problem in indexing, searching and retrieval of huge volume of video data. Little efforts can accomplish video similarity matching taking high-level temporal variation into consideration throughout video sequences while at the same time they could reduce the dimensionality of the descriptors and preserve the original topology of the high dimensional feature space. Hence, in this paper, in order to support high-level semantic retrieval of video content, the proposed motion activity descriptor – 2D histogram [10] is exploited to describe video segments considering spatio-temporal relationships among video objects or moving blobs. Furthermore, to retrieve the nearest neighborrs of a query and preserve the local topology of the high dimensional space, the Discrete Cosine Transform is utilized to map the time sequence of high dimensional feature space to lower dimensional space. By applying the Discrete Cosine Transform, original time sequence of feature vector is transformed from time domain to frequency domain. Based on the property of energy concentration of the DCT coefficients, using a few DCT coefficients for indexing of video segments does not affect the retrieval accuracy and is thus adequate for representation of the feature in a video sequence. The rest of the paper is organized as follows. Representation and matching of video sequences are described in Section 2. Section 3 presents the experimental results. Conclusion and the future works are given in Section 4.

2 Video Sequence Matching While video segments are characterized by the motion activity descriptors, the Discrete Cosine Transform is applied to map the time sequence of the descriptor into frequency domain. A few DCT coefficients are selected to represent the whole video segment, and the choice of similarity measure is based on the meaning of the DCT coefficients and the characteristic of the motion activity descriptor. The details of the representation of video sequence and the defined similarity measure are illustrated in Subsection 2.1 and Subsection 2.2, respectively. 2.1 Representation of Video Sequences In order to reduce the dimensionality of the feature vector space, the Discrete Cosine Transform is exploited. The details of the algorithm of video sequence representation are described as follows. Video Sequence Representation

Input: Consecutive P-frames {P1, P2, P3, …, PN} Output: Sequences of reduced low-dimensional DCT coefficients {S1, S2, S3, …, Sk} 1. For each P-frame Pi, detect moving objects by clustering macroblocks, which have similar motion vector magnitude and similar motion direction. 2. For each object, Compute its centroid and object size in terms of macroblocks. 3. Set the number of histogram bins to k 4. For each P-frame Pi, compute the X-histogram and Y-histogram according to the horizontal and vertical position of objects, respectively. 5. For each sequence of histogram bin [ Bin tZ, j ], where t ∈ [1, N ] , j ∈ [1, k ] and Z ∈ { X , Y } , compute the transformed sequence [ Z f , j ] by utilizing the Discrete Co-

sine Transform N

Z f , j = C( f )

∑ Bin t =1

z t, j

⎛ (2t + 1) fπ ⎞ , where f ∈ [1, N ] cos⎜ ⎟ ⎝ 2N ⎠

6. Set the number of DCT coefficients to α. 7. For k transformed sequences [ Z f , j ] of DCT coefficients, Select the DC coefficient and (α-1) AC coefficients to represent a transformed sequence. 8. Generate the k reduced low-dimensional sequences [ Z f , j ], where f ∈ [1, α ] and j ∈ [1, k ] . 2.2 Choice of Similarity Measure Based on the observation of the Parseval’s theorem, the Euclidean distance between two transformed signals [W fX ] and [ H Xf ] of X-histogram ( [W fY ] and [ H Yf ] of Yhistogram) in frequency domain is the same as their distance in the time domain. Therefore, the L2-norm distance is used as the measure of the distance between two video sequences. Eq. (1) shows the distance measure of the j th X-histogram bin

between two transformed sequences [W fX ] and [ H Xf ] ( [W fY ] and [ H Yf ] of Yhistogram) in frequency domain, where M is the number of the selected DCT coefficients. The total distance of X-histogram DistX (W , H ) and that of Y-histogram DistY (W , H ) is thus defined as the sum of the distance of each bin shown in Eq. (2). Hence, the distance between two video sequences can be defined as the sum of DistX (W , H ) and DistY (W , H ) . Dist(W jX , H Xj ) =

M

∑(W

f,j

)

− H f , j 2 , Dist(W jY , H Yj ) =

c =1

k

Dist X (W , H ) =

∑ Dist(W j =1

M

∑(W

f,j

)2

(1)

, H Yj )

(2)

−Hf,j

c =1

X X j ,H j

) , DistY (W , H ) =

k

∑ Dist (W

Y j

j =1

However, two video sequences w and h which are regarded as similar is based on the human perception on the spatio-temporal distribution of moving objects, i.e., w and h are considered similar if they confirm to one or more of the following criteria: (1) the number of moving objects of w and h are similar; (2) the variation of spatial

distribution in horizontal direction of moving objects in w and h are resembling; (3) the variation of spatial distribution in vertical direction of moving objects in w and h are similar. In order to take these three criteria into account, the distance measure defined in Eq. (2) of video sequences is modified as Eq. (3), where the operator shr ( n, H ) denotes that each bin in the X-histogram or Y-histogram of the transformed DCT coefficients shifts rights and rotates n bins. The meaning of Eq. (3) is that different video sequences may consist of multiple objects, which may be of similar spatial relationships but with different spatial distribution. ⎛ Dist X (W , H ), Dist X (W , shr (1, H )), ⎞ ⎟⎟ Dist X ( w, h ) = Min ⎜⎜ ⎝ Dist X (W , shr ( 2, H )),..., Dist X (W , shr ( k − 1, H )) ⎠

⎛ DistY (W , H ), DistY (W , shr(1, H )), ⎞ ⎟⎟ DistY ( w, h) = Min⎜⎜ Dist W shr H Dist W shr k H ( , ( 2 , )),..., ( , ( − 1 , )) Y Y ⎝ ⎠

(3)

Therefore, the distance DistX (w,h) and DistY ( w, h) are considered together for the computation of the total distance Disttotal (w, h) between video sequences w and h . The total distance Disttotal ( w, h) is defined in Eq. (4), where WTH is the weight of Xhistogram ( WTV of Y-histogram), N is the number of P-frames, and MVi ,H and MVi ,V are the average motion vector magnitude of the X-component and Y-component respectively of inter-coded macroblocks in the i th P-frame. The similarity measure of Eq. (4) is based on the fact that human perception on similarity of video sequences is usually affected by the moving direction of objects in addition to the number of objects. That is, video sequences would be still regarded as similar if their objects move in the same or resembling direction. In general, cameras would pan or tilt while objects move horizontally or vertically. The overall motion in the horizontal and vertical directions of frames are thus computed for weight decision of X-histogram and Yhistogram. While the movement of most regions is in the horizontal (vertical) orientation, it means that the global motion or motion of large object is mainly in the horizontal (vertical) direction. Therefore, the X-histogram needs to be weighted more than the Y-histogram. On the contrary, if most regions move toward the vertical direction, the Y-histogram is assigned more weight than the X-histogram. The proposed similarity measure would be very effective to differentiate video sequences whose global motion is in distinct orientations, for example, most players in the baseball game run toward vertical direction and the camera would tilt to take the players or to track the baseball while players in football game primarily run horizontally and the camera would pan to focus on the significant events. (4) Dist total ( w , h ) = WT H ⋅ Dist X ( w, h ) + WTV ⋅ Dist Y ( w , h ) WTH =

3

1 N

N

∑ MV i =1

MVi , H

i, H

+ MVi ,V

, WTV = 1 − WTH

Experimental Results and Discussions

The testing data of experiments is the Spanish news from MPEG-7 test data set and they are segmented into 357 video shots. The content of the Spanish news mainly consists of the shots of anchor person, walking person, football game, bicycle racing and interview. Motion intensity of these shots ranges over low, medium to high, and the size of moving objects varies from small size as the players of football game in the

full-court view to large size as the players in the close-up view. The goal of the experiments is to evaluate 1) the effect of the number of bins of the 2D-histogram on the retrieval accuracy, 2) the effectiveness of exploiting individual X-histogram and Yhistogram, and of combining them together, 3) the retrieval performance of DCTbased feature space transformation and dimensionality reduction, and 4) the retrieval performance of the proposed object (moving region)-based motion activity descriptor. The performance metrics used in the experiments are precision and recall, which are collectively used to measure the effectiveness of a retrieval system. Eq. (5) shows the definition of precision and recall, where “Retrieve(q)” means the retrieved video sequences corresponding to a query sequence q, “Relevant(q)” denotes all the video sequences in the database that are relevant to a query sequence q and ⋅ indicates the cardinality of the set. Recall is defined as the ratio between the number of retrieved relevant video sequences and the total number of relevant video sequences in the video database, and precision is defined as the ratio between the number of retrieved relevant video sequences and the number of total retrieved video sequences. Retrieve(q) ∩ Relevant(q) Retrieve(q) ∩ Relevant(q) Recall = , Precision = (5) Relevant(q) Retrieve(q) Details of the experimental results are described in the following Subsections. Subsection 3.1 shows the retrieval performance of the selection of different number of DCT coefficients. Subsection 3.2 exhibits the influence of the number of histogram bins on the retrieval accuracy. In Subsection 3.3, the effectiveness of the object-based motion activity descriptor is demonstrated with its retrieval performance of distinct video clips. 3.1 Decision of the Number of DCT Coefficients Table 1. Performance comparison of different α settings using four feature descriptors (β = 4) Close-Up (CU)

Bicycle Racing (BR)

Walking Person (WP)

Anchor Person (AP)

Rank #1

2

2

1

5

Rank #2

3

3

2

3

Rank #1

2

2

1

5

Rank #2

1

3

2

3

Rank #1

2

2

1

5

Rank #2

3

3

2

2

Rank #1

2

2

1

5

Rank #2

3

3

2

2

Shot Type Descriptor XHistogram YHistogram 2D Histogram Weighted 2D Histogram

Four representative video shots are selected for testing, in which the motion intensity ranges over low, medium and high and the object size varies from small, medium to large. The shots of Close-Up (CU) are of high motion intensity, Bicycle Racing (BR) shots are of medium motion intensity, Walking Person (WP) shots are of high motion intensity and Anchor Person (AP) shots are of low motion intensity. The number of frames of these four shots is 203, 596, 187 and 631, respectively. To evaluate the effect of the number of DCT coefficients on the retrieval performance, the number of DCT coefficients α is varied and is tested in the condition that the number of histogram bins β is fixed and descriptors D, X-histogram, Y-histogram, 2D-histogram and

the weighted 2D-histogram are utilized, respectively. The value of α means that α DCT coefficients including the DC and (α-1) AC coefficients are used for similarity measurement. 3.2 Decision of the Motion Activity Descriptor Table 2. Performance comparison among four motion activity descriptors using the different parameter settings of β (α = 2) Shot Type

β

Setting

Close-Up (CU)

Bicycle Racing (BR)

Walking Person (WP)

Anchor Person (AP) W-2D

β =4

Rank #1

X

X

W-2D

Rank #2

W-2D

W-2D

X

2D

β =6

Rank #1

W-2D

Y

X

W-2D

Rank #2

X

W-2D

W-2D

2D

β =8

Rank #1

X

W-2D

W-2D

W-2D

Rank #2

W-2D

2D

2D

2D

β = 10

Rank #1

W-2D

W-2D

W-2D

X

Rank #2

2D

2D

2D

2-2D

X: X-Histogram Y: Y-Histogram 2D: 2D-Histogram W-2D: Weighted 2D-Histogram

The retrieval performance of these four types of shots CU, BR, WP and AP exploiting four descriptors over different number of DCT coefficients (α=1, α=2, α=3 and α=5) is shown in Table 1. We can observe that the parameter setting of α=2 achieves the best retrieval accuracy. Hence, we infer from the experimental results that two DCT coefficients are adequate for similarity matching of video clips and thus DC coefficient and an AC coefficient are selected for further experiments.To evaluate the retrieval performance of four motion activity descriptors, X-histogram (X), Y-histogram (Y), 2D-histogram (2D) and weighted 2D-histogram (W-2D), four representative shots in Subsection 6.1 are used and the value of β is varied over 4, 6, 8 and 10, and each of the corresponding performance of the recall-precision pair. The overall performance of these four descriptors over different histogram bins is illustrated in Table 2. We can observe that in most cases the descriptor of weighted 2D-histogram performs better than other descriptor does or at least it’s retrieval ranking is 2. Therefore, the weighted 2D-histogram is selected as the motion activity descriptor for further experiments. 3.3 Decision of the Number of Histogram Bins From the experimental results of Subsection 5.1 and Subsection 5.2, two DCT coefficients one DC and one AC are used and the motion activity descriptor of weighted 2D-histogram is exploited for the decision of the number of histogram bins. To assess the effect of the number of histogram bins β of four different descriptors, the parameter β is varied over 4, 6, 8 and 10. The rank of retrieval performance of each video shot is illustrated in Table 3. We can observe that the retrieval performance of the parameter setting β = 8 is better than others and the worst case is in the parameter setting β = 4. The experimental result reveals that the number of histogram bins should be moderate because the smaller of the number of the histogram bins, the less precise the description of the variation of spatial distribution is. On the contrary, while

the number of histogram bins is too large, the descriptor would be very sensitive to the slight change either in the horizontal or vertical directions. Table 3. Performance comparison of different number of histogram bins (β) Close-Up (CU)

Bicycle Racing (BR)

Walking Person (WP)

Anchor Person (AP)

Rank #1

6

8

8

8

Rank #2

10

10

10

10

Rank #3

8

6

6

6

Rank #4

4

4

4

4

Shot Type Performance

3.3 Evaluation of the Retrieval Performance Table 4. Retrieval performance of the descriptor of weighted 2D-histogram Close-Up (CU)

Bicycle Racing (BR)

Walking Person (WP)

Anchor Person (AP)

Recall

79%

87%

93%

86%

Precision

81%

84%

90%

77%

Clips Performance

Average Recall

Average Precision

86%

83%

The retrieval performance of the motion activity descriptor of weighted 2D-histogram is illustrated in Table 4. In the experiment, 30 relevant shots out of 347 ones are selected manually for each shot type, i.e., the similar video shots of each shot type are set as 30. Therefore, the number of returned video shots is set as 30 to evaluate the performance measurement – recall and precision. In Table 4, we can observe that the recall of the four shots is higher than 79% and the recall of the shots of BR, WP and AP is higher than 86%. The worst case is the shot AP shots, of which the precision is 77%. Because the object size of the shots AP is quite large and the motion intensity is low, some medium-size objects of WP shots move closely and the camera catch these objects in the center position of the frame. Therefore, these objects would be detected as a single large object and the corresponding shots are classified as AP. However, although the precision of the shots AP is lower than 80%, the precision of the shots of CU, BR, and WP is higher than 80%. From Table 4, the overall performance of the average recall and the average precision is up to 86% and 83%, respectively.

4 Conclusions and Future Work In this paper, a novel method of similarity retrieval between video sequences considering the spatio-temporal variation through consecutive frames is proposed. For computation efficiency, videos for all tasks are processed in compressed domain. Furthermore, for bridging the semantic gap between low-level features and the rich meaning that users desire to capture, video shots are analyzed and characterized by the highlevel feature of motion activity. The extracted features of motion activity are further described by the object-based 2D-histogram. In order to reduce the dimensions of feature vector space in video sequence matching, Discrete Cosine Transform (DCT) is exploited to map semantic features of consecutive frames to the frequency domain

while retaining the discriminatory information and preserving the distance between feature vectors. The energy of DCT transformed sequences is highly concentrated at low indices and experimental results reveal that two DCT coefficients are adequate for achieving good retrieval performance. In addition, the experimental results of sequence matching show that the retrieval performance of the proposed weighted 2Dhistogram is better than that of individual X-histogram, Y-histogram and 2Dhistogram. The number of histogram bins should be moderate since the object information would be too noisy if the number of histogram bins is too large. On the contrary, if the number of histogram bins is too small, the object-based descriptor cannot reflect the variation of spatial distribution and temporal variation of moving objects through out the video shots. The experimental results demonstrate the good retrieval performance and reveal the effectiveness of the proposed mechanism of similarity retrieval. In the future, we will exploit other features to improve the retrieval accuracy such as the color information - the luminance and the chrominance of moving objects, the orientation of moving objects, and the global motion of camera operations.

References 1. R. Wang, M. R. Naphade, and T. S. Huang: Video Retrieval and Relevance Feed-

back in The Context of A Post-Integration Model. Proc. IEEE 4th Workshop on Multimedia Signal Processing, pp. 33-38, Oct. 2001. 2. T. Lin, C. W. Ngo, H. J. Zhang and Q. Y. Shi: Integrating Color and Spatial Features for Content-Based Video Retrieval. Proc. IEEE Intl. Conf. on Image Processing, Vol. 2, pp. 592-595, Oct. 2001. 3. S. S. Cheung and A. Zakhor: Video Similarity Detection with Video Signature Clustering. Proc. IEEE Intl. Conf. on Image Processing, Vol. 2, pp. 649–652, Sep. 2001. 4. L. Agnihotri and N. Dimitrova: Video Clustering Using SuperHistograms in Large Archives. Proc. 4th Intl. Conf. on Visual Information Systems, pp. 62-73, Lyon, France, November 2000. 5. M. Roach, J. S. Mason and M. Pawlewski: Motion-Based Classification of Cartoons. Proc. Intl. Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 146-149, Hong Kong, May 2001. 6. L. Zhao, W. Qi, S. Z. Li, S. Q. Yang and H. J. Zhang: Content-based Retrieval of Video Shot Using the Improved Nearest Feature Line Method. Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Vol. 3, pp. 1625-1628, 2001. 7. B. S. Manjunath, J. R. Ohm, V. V. Vasudevan and A. Yamada: Color and Texture Descriptors. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, pp. 703-715, June 2001. 8. R. Mohan: Video Sequence Matching. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 6, pp. 3697-3700, May 1998. 9. M. M. Yeung and B. Liu: Efficient Matching and Clustering of Video Shots. Proc. IEEE Int. Conf. on Image Processing, Vol. 1, pp. 338-341, Oct. 1995. 10. D. Y. Chen, S. J. Lin and S. Y. Lee: Motion Activity Based Shot Identification. Proc. 5th Intl. Conf. on Visual Information System, pp. 288-301, Hsinchu, Taiwan, Mar. 2002.

Suggest Documents