Shot reconstruction degree: a novel criterion for ... - Semantic Scholar

Pattern Recognition Letters 25 (2004) 1451–1457 www.elsevier.com/locate/patrec

Shot reconstruction degree: a novel criterion for key frame selection Tieyan Liu a, Xudong Zhang a b

a,*

, Jian Feng b, Kwok-Tung Lo

c

Multimedia Signal Processing Laboratory, Department of Electronic Engineering, Tsinghua University, Beijing, PR China Department of Computer Engineering and Information Technology, City University of Hong Kong, Hong Kong, PR China c Center for Multimedia Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, PR China Received 11 June 2003; received in revised form 15 March 2004 Available online 20 July 2004

Abstract In this paper, the authors treat the key frame extraction problem from a viewpoint of shot reconstruction. As a result, a novel criterion called shot reconstruction degree (SRD) is proposed for key frame selection based on the degree of retaining motion dynamics of a video shot. Compared with the widely used fidelity criterion, the key frame set produced by SRD can better capture the detailed dynamics of the shot. Using the new SRD criterion, a novel inflexionbased key frame selection algorithm is developed. Simulation results show that the new algorithm results in good performance in terms of both fidelity and shot reconstruction degree. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Video content analysis; Video summarization; Key frame selection

1. Introduction Along with the fast development of Internet and multimedia signal processing techniques, video content analysis and retrieval has become a very hot research topic. In the past decades, people from different communities have been involved in the corresponding research work. Key frame *

Corresponding author. Tel.: +861-06278-1450; fax: +86106277-0317. E-mail addresses: [email protected] (T. Liu), [email protected] (X. Zhang), [email protected] (J. Feng), [email protected] (K.-T. Lo).

selection is one of the important techniques for video summarization that can support video browsing and query by example. Among many kinds of key frame selection algorithms developed so far, there appears a critical problem concerning what type of frames should be selected as key frames in order to capture the most important content of a video. In fact, in the literature, some contradictory approaches were simultaneously proposed. For example, some researchers suggested that the frame with minimal motion energy should be selected because cameras often focus on important objects or places for a relatively long period. Others regarded the frames with maximal

0167-8655/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.05.020

1452

T. Liu et al. / Pattern Recognition Letters 25 (2004) 1451–1457

motion dynamics as key frames, with the reason that if these frames are not captured, some unrecoverable information will be lost because of the dramatic content changes around these frames. Because video summarization is a subjective task, to reflect the true perceptive evaluation results, user study may be the most convinced way to distinguish different algorithms such as those mentioned above. However, to analyze the video sequences automatically, subjective evaluation is not as good as expected. Instead, people developed some objective criteria to measure the similarity between the selected key frames and the whole video. These works include, Aigrain et al. (1996), Yahiaoui et al. (2002, 2003) and Chang et al. (1999), among which the most commonly used one is the fidelity (Chang et al., 1999). Fidelity is usually computed as the maximum of the minimal distance between the key frame set and the video shot, or say, the semi-Hausdorff distance. Let S ¼ fFðt þ n DtÞjn ¼ 1; . . . ; N g

ð1Þ

be a video shot containing N frames, and the key frame set selected from it be KF ¼ fFðt þ nk DtÞjk ¼ 1; . . . ; Kg

ð2Þ

Define a distance function between any two image frames by Diff(Æ), then the distance between KF and an image frame Fðt þ n DtÞ in S can be calculated as follows. dðt þ n DtÞ ¼ min fDiffðFðt þ n DtÞ; Fðt þ nk DtÞÞg; k

k ¼ 1; . . . ; K

ð3Þ

Then the semi-Hausdorff distance between KF and S is defined as dsh ðS; KFÞ ¼ max fdðt þ n DtÞg; n

n ¼ 1; . . . ; N ð4Þ

Although fidelity is widely used in the key frame selection domain, what we concern is whether this criterion is the most suitable one for this task. As for video browsing and summarization, high fidelity can guarantee that the selected key frames are the representative frames in a video shot. However, it does not focus on the local details so

y

O

x

Fig. 1. An example on ineffectiveness of the fidelity.

that it cannot capture well the dynamics of the shot. An example showing the fidelity’s ineffectiveness is illustrated in Fig. 1. According to (4), we find that the rectangular points and the circular points shown in Fig. 1 have the same fidelity, or say, their semi-Hausdorff distances to the curve are both 0. In fact, it is easy to understand that under a global constraint, the final solution can be varying. However, it is obvious that their summarization capabilities are quite different. The circular points do not capture the evolution trend of the curve while the rectangular points do. In other words, the summarization capability of the circular points is worse than that of the rectangular points. Based on the above discussions, in this paper, we propose a new key frame selecting criterion that is focusing more on local details and evolution trend of a video shot. The basic idea is that if the shot reconstructed by interpolating the key frames can approximate the original shot well, we say that the key frame set can well capture the detailed dynamics of the shot. In other words, the motion dynamics of a shot is well maintained. On this viewpoint, certainly the rectangular points are better than the circular ones in the above example. Such a new criterion is called shot reconstruction degree (SRD). In general, if all local features are approximated well, the global features will also be approximated. Hence the key frames which can reconstruct the shot well will also lead to a high fidelity. In the following parts of this paper, the detailed definition of SRD will be given and its effectiveness will be examined. In Section 2, we will formulate


the SRD criterion in details and discuss the relationship between SRD and fidelity. Section 3 will introduce a new key frame selection algorithm following the idea of the SRD criterion. Experimental results will be presented and discussed in Section 4. Some concluding remarks and future work prospects are given in the Section 5. 2. Shot reconstruction degree In this section, we propose a new criterion called shot reconstruction degree (SRD) for key frame selection which will focus more on the details of the video sequences. The basic idea of SRD is to treat the video summarization process as a shot reconstruction problem. The better the shot reconstructed through the interpolation of the selected key frames approximates the original shot, the better summarization capability does the corresponding key frame selection algorithm have. The viewpoint of shot reconstruction guarantees the maximal information maintenance during the procedure of key frame selection with respect to a given key frame number. Or it can retain the most motion dynamics of the original video for a given key frame set. With a certain frame interpolation algorithm (denoted by FIP), the optimal key frame set under the SRD criterion should be selected as below. Let h be a certain key frame selection algorithm. For a video shot containing N frames, S ¼ fFðt þ n DtÞjn ¼ 1; . . . ; N g

ð5Þ

K key frames are selected by h, KF ¼ fFðt þ nk ðhÞ DtÞjk ¼ 1; . . . ; Rg

ð6Þ

For nk ðhÞ 6 n < nkþ1 ðhÞ, calculate the reconstructed version of Fðt þ n DtÞ based on Fðt þ nk ðhÞ DtÞ and Fðt þ nkþ1 ðhÞ DtÞ by F ðt þ n Dt; hÞ ¼ FIPðFðt þ nk ðhÞ DtÞ; Fðt þ nkþ1 ðhÞ DtÞ; nk ðhÞ; n; nkþ1 ðhÞÞ

ð7Þ

Under SRD, the optimal key frame selection algorithm is h0 ¼ arg maxfSRDðhÞg h

ð8Þ

1453

PN 1 where SRDðhÞ ¼ n¼0 SimðFðt þ nDtÞ; F ðt þ nDt; hÞÞ, and Sim(Æ) is the similarity between two video frames. From the above description, we find that in order to make SRD computable, we should make clear two items. The first one is the concrete definition of the similarity function Sim(Æ). In fact, we may have many choices on this. Here, one feasible definition is as below. SimðFðtÞ; F ðtÞÞ ¼ 10 log

1 NI

P

2552 ðx;yÞ2I

jf ðx; y; tÞ f ðx; y; tÞj2

ð9Þ

where NI is the number of pixels in the video frame. The reason why we choose this definition is that this is a PSNR like formulation which can help us to map the relationship between key frame number and SRD to that between rate and distortion. The second item is the implementation of the frame interpolation function FIP(Æ). In this paper, we adopted an inertia based frame interpolation algorithm (IMCI) (Liu et al., 2003). The reason why we use this method is that it can take great advantage of the redundancy within video frames (in sense of both spatial and motion inertia) so as to offer a very good interpolation result. However, here what we should point out is that the interpolation algorithm is just used for calculating the distance between the key frames and the whole video shot. Theoretically any interpolation algorithms will lead to similar analytical result. Adopting a better interpolation algorithm is just for fear that the differences caused by various key frame selection results are screened with the errors introduced by interpolation algorithms. Besides this, no other assumptions are embedded. After formulating the SRD criterion, we should discuss the relationship between SRD and fidelity. As stated above, SRD is better than the fidelity criterion in terms of retaining the motion dynamics of the sequence. However, one may argue that although in sense of motion dynamics SRD is better than fidelity, it does not mean the SRD criterion is better in other aspects. To answer this problem, we should point out that in fact SRD is a stricter criterion than fidelity. If every local

1454


However, more complex than the example shown in Fig. 2, the video sequence is a stochastic process, or a high-dimensional curve. To perform similarly to Fig. 2 is not an easy task: how to define the inflexions in high dimensional space? How to guarantee the continuity and regularity of the curve? In this direction, some researchers proposed their solutions. In (Latecki et al., 2001), a 37dimensional feature covering Y, U, V color components and the frame number is used to represent a video frame. A video sequence is then mapped to a continuous and regular trajectory in the R37 space. A polygon simplification technique is developed to select the inflexions in the trajectory for representation of the video. Based on the idea of Latecki et al. (2001), in our method, we use an even more simple approach to tackle this problem. We calculate the motion energy of each frame as shown in (10) and connect these energy values using a continuous curve in 2D space.

property is optimized, the global performance will therefore be optimized. That is to say, an algorithm with higher SRD most possibly will have a higher fidelity. This deduction is verified by the experiments in Section 4. After discussing the advantage of SRD, we should also look into the problems of this new criterion. Firstly, it is an objective criterion, which may not be always in accordance with the human perceptions. However, similar to the evolution of the fidelity criterion, an objective criterion may be more useful for automatic video processing tasks. Secondly, the computational complexity of the corresponding evaluation protocol is quite high due to the computation-intensive frame interpolation process. This problem may not affect the role of SRD as a criterion, but when applied in algorithm optimization, it will ask for observable simplification. 3. An SRD-oriented key frame selection algorithm

MEðFn Þ ¼

In Section 2, we described the definition of SRD. In this section, we will design a new key frame selection algorithm which is optimized according to the SRD criterion. For this purpose, we should firstly find out which feature will heavily affect the SRD of a key frame set. Easy to understand, the key frame position is one of such critical features. Fig. 2 shows an example of curve reconstruction, where different skeleton positions lead to quite different reconstruction results. From this figure we get the implication that selecting key frames at inflexions may result in a good reconstruction performance.

H W 1 ½X M 1 ½X M

j¼0

2 ðDx2n 1;n ði; jÞ þ Dyn 1;n ði; jÞÞ

i¼0

ð10Þ where W and H denote the width and height of the frame respectively, and M M is the block size, Dxn 1;n ði; jÞ and Dyn 1;n ði; jÞ are the displacements of the (i; j)th block from frame n 1 to frame n in the horizontal and vertical direction, respectively. After that the same polygon simplification steps as in (Latecki et al., 2001) are carried out to search for inflexions on the motion energy curve. The reasoning to use motion energy curve is that motion energy is a unique feature of a video in y

y

O

x

(a)

O

x

(b)

Fig. 2. Key frame positions affect reconstruction result. Selecting key frames: (a) at inflexions; (b) randomly.


the temporal domain. The inflexion point on this curve means the frame with a sudden change in motion activity. We pick those frames as key frames because they are hardly predicted by others and will have large reconstruction error when interpolated from others. However, even with this reasoning, we should acknowledge that the true inflexions of the video sequence may not be exactly in accordance with these pseudo inflexions in the motion energy curve. In other words, we are aiming at a sub-optimal solution, with the tradeoff between complexity and performance. Because the number of inflexions may not be equal to the target key frame number, we use the following refinement process to bridge the gap. Suppose the motion energy curve has K0 inflexions, and the target key frame number is KT . We proceed as follows to select the final key frames. (1) Select the K0 frames at inflexions into the key frame set. (2) If K0 < KT , reconstruct the whole video shot based on the K0 key frames using IMCI technique. Add those (KT K0 ) frames with largest reconstruction errors to the final key frame set. (3) Else if K0 > KT , reconstruct the even frames in the key frame set based on the odd frames. Remove those (K0 KT ) frames with minimal reconstruction errors from the key frame set. After the refinement steps, we finally get KT key frames. Now let us discuss the computational complexity of the above algorithm. In fact, besides the computations of polygon simplification, if there are N frames in a shot with KT key frames selected, approximately OðN þ KT =2Þ MCI operations are needed. This is not a slight load. Fortunately, in some cases, we need not select the key frames in real time because the video database maintenance is usually an offline task. Also, temporally sub-sampling technique can be used to reduce the computational load to some extent.

4. Simulation results In this section, we carried out some experiments to show the effectiveness of the SRD-oriented key

1455

frame selection algorithm as well as the relationship between SRD and fidelity. For this purpose, we took use of a MPEG-7 test sequence, Basketball, with 18035 frames and 88 shots in our experiments. The motion dynamics of this sequence is very large which can help make the differences among various key frame selection algorithms observable. Windows XP running on a 1.6 GHz Pentium IV PC was used for the simulation platform. In the first experiment, we evaluated the performance of different key frame selection algorithms in sense of SRD. Besides the proposed inflexion-based algorithm (denoted as ‘‘I’’), totally six typical reference algorithms were tested, which include the shot boundary based algorithm (Zhang et al., 1995) (denoted as ‘‘S’’), the motion metric based algorithm (Wolf, 1996) (denoted as ‘‘M’’), the in-shot clustering algorithm (Zhuang et al., 1998) (denoted as ‘‘C’’), the rate-driven algorithm (Lee and Kim, 2002) (denoted as ‘‘R’’), the leaky bucket like algorithm (Zhang et al., 2003) (denoted as ‘‘L’’) and the motion entropy maximization algorithm (Liu and Zhang, 2000) (denoted as ‘‘E’’). Given different target key frame numbers, we calculate the SRD of each selected key frame set, which forms the SRD curves as shown in Fig. 3. From the experimental results, we found that the proposed algorithm results in a leading SRD of the selected key frame set. For example, when the key frame number is 8 or 16, the SRD performance of the new algorithm is higher than the leaky bucket like algorithm, the motion entropy maximization algorithm and the rate driven algorithm by about 1 dB, and is better than other reference algorithms by about 3 dB. On another viewpoint, when the SRD is fixed at 15 dB, the proposed algorithm only needs four key frames, while the leaky bucket like algorithm, the motion entropy maximization algorithm and the rate driven algorithm need eight key frames, and for the other reference algorithms, even 16 key frames are not enough. We also found that there is still a 1 dB gap between the proposed algorithm and the limit calculated by exhaustive enumeration (labeled as ‘‘Limit’’ in Fig. 3). The reason is that the inflexions of motion energy curve are not the same as the inflexions of the video sequence.

1456


for example, the rate-driven algorithm. These conclusions are in accordance with the discussions in Section 2. 5. Conclusion

Fig. 3. SRD performances of different key frame selection algorithms.

In the second experiment, we examined the fidelity of each key frame selection algorithm mentioned above. The results are shown in Fig. 4. When compared with Fig. 3, we come to the following conclusions. The algorithms with high fidelity, for example the in-shot clustering algorithm, may not have high SRD. On the contrary, if an algorithm is optimized for SRD, its fidelity performance will also be very good, for example, the proposed algorithm. And even for an algorithm not especially optimized for SRD, if the SRD performance is good, the fidelity is also good,

In this paper, the authors proposed a novel criterion, called shot reconstruction degree to for key frame selection. This criterion examines the representativity of a key frame set from a viewpoint of its capability to reconstruct the original video shot through frame interpolation. Discussions show that this new criterion is a better and stricter one than the widely used fidelity criterion in sense of capturing the detailed dynamics of a video shot. Oriented to this new criterion, a novel inflexion-based key frame selection algorithm is developed. Simulation results show that the key frame set produced by the new algorithm has very good performance in terms of both fidelity and shot reconstruction degree. For the future work, more human perceptions will be introduced into the evaluation of video summarization techniques. Acknowledgements The work of Jian Feng was supported by the City University of Hong Kong under Grant A/C 7100250. References

Fig. 4. Fidelity performances of different key frame selection algorithms.

Aigrain, P., Zhang, H.J., Petkovic, D., 1996. Content-based representation and retrieval of visual media: A state-of-theart review. Multimed. Tools Appl. 3 (3), 179–202. Chang, H.S., Sull, S., Lee, S.U., 1999. Efficient video indexing scheme for content-based retrieval. IEEE Trans. Circ. Syst. Vid. Technol. 9 (8), 1269–1279. Latecki, L.J., DeMenthon, D., Rosenfeld, A., 2001. Extraction of key frames from videos by polygon simplification. Proc. 6th Internat. Symp. on Signal Processing and its Applications 2, 643–646. Lee, H.C., Kim, S.D., 2002. Rate-driven key frame selection using temporal variation of visual content. Electron. Lett. 38 (5), 217–218. Liu, T.Y., Zhang, X.D., 2000. Inertia-based cut detection technique: a step to the integration of video coding and content-based retrieval. In: Proc. 2000 Internat. Conf. on Signal Processing (ICSP’2000), vol. 2. pp. 1018–1025.

T. Liu et al. / Pattern Recognition Letters 25 (2004) 1451–1457 Liu, T.Y., Lo, K.T., Zhang, X.D., Feng, J., 2003. Frame interpolation scheme using inertia motion prediction. Signal Process.: Image Comm. 18 (3), 221–229. Wolf, W., 1996. Key frame selection by motion analysis. In: Proc. 1996 IEEE Internat. Conf. on Acoustic, Speech and Signal Processing. pp. 1228–1231. Yahiaoui, I., Merialdo, B., Huet, B., 2002. Image similarity for automatic video summarization. In: Proc. EUSIPCO’2002. Yahiaoui, I., Merialdo, B., Huet, B., 2003. Comparison of multi-episode video summarization algorithms. EURASIP J. Appl. Signal Proces. 1, 48–55.

1457

Zhang, H., Low, C.Y., Smoiler, S.W., 1995. Video parsing and browsing using compressed data. Multimed. Tools Appl. 1, 89–111. Zhang, X.D., Liu, T.Y., Lo, K.T., Feng, J., 2003. Dynamic selection and effective compression of key frames for video abstraction. Pattern Recognition Lett. 24 (9–10), 1533– 1542. Zhuang, Y., Rui, Y., Huang, T.S., Mehrotra, S., 1998. Adaptive key frame selection using unsupervised clustering. In: Proc. 1998 IEEE Internat. Conf. on Image Processing, vol. 1. pp. 866–870.