REAL-TIME VIDEO SUMMARIZATION ON MOBILE

40 downloads 0 Views 406KB Size Report
Car rail crossing. 0.146. 0.146. 0.064. 0.362. 0.210. Cockpit Landing. 0.129. 0.156. 0.116. 0.172. 0.171. Cooking. 0.171. 0.139. 0.118. 0.321. 0.374. Eiffel Tower.
REAL-TIME VIDEO SUMMARIZATION ON MOBILE Smit Marvaniya ? , Mogilipaka Damoder ? , Viswanath Gopalakrishnan ? , Kiran Nanjunda Iyer ? , Kapil Soni ? ?

Samsung R&D Institue India, Bangalore

ABSTRACT This paper proposes a novel approach for real-time video summarization on mobile using Dictionary Learning, Global Camera Motion analysis and Colorfulness. A dictionary is represented as a distinct set of events that are described as spatio-temporal features. Uniqueness measure is predicted based on the correlation scores of the dictionary elements whereas the quality measure is estimated using Global Camera Motion analysis and Colorfulness. Our proposed technique combines the uniqueness measure and the quality measure to predict the interestingness. Experiments indicate that our method outperforms state-of-the-arts in terms of computation speed while retaining the similar subjective quality. Index Terms— Video Summarization, Dictionary Learning, Global Camera Motion, Colorfulness, Interestingness 1. INTRODUCTION Video data has become ubiquitous owing to the wide spread use of mobile camera phones, surveillance cameras and other capture devices. Considering the large volumes of visual media, it is desirable to summarize the video contents for further analysis. Creating a good video summary on mobile in realtime is challenging due to the computational overhead while maintaining the subjective quality. Many existing methods for video summarization can be classified into two groups: supervised and unsupervised learning methods. Supervised methods [1, 2, 3, 4, 5] use the prior information such as category specific videos [6], pre-trained object of interest [7], web images [1, 2], etc. to summarize the video. In [4] a probabilistic model is proposed to capture the sequential and diverse subset selection. Whereas unsupervised learning methods [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] have been considered in the past for the problem of video summarization. There have been numerous research effort [20, 21, 22, 23, 24] to extract the salient key frames using saliency measure such as motion analysis [20, 22], shot boundary detection [25], color histogram [21], interestingness [18], non-redundancy [24, 12], automatic storybarding [23] or object of interest [7] using unsupervised learning. Ejaz et al. [18] combines static and dynamic visual saliencies using non-linear fusion method to

predict the interestingness. In [26], video summary is predicted by modeling the motion attention and human perception models using temporal graph representation. Lu and Grauman [27] proposed a storydriven video summarization technique that selects the set of video subshots by modeling the story, importance and diversity factors between subshots. Recently, Gygli et al. [19] developed a method based on super-frames segmentation using motion analysis and interestingness prediction by combining low-level and high level features from each super-frame. This method is computationally very expensive due to the simultaneous optimization of all super-frame boundaries and computing the high level features. The method proposed by Zhao and Xing [28] for online dictionary learning uses sparse coding that is fast but still does not meet the real-time processing to generate the video summary. Furthermore, this method does not consider the quality of dictionary while updating the dictionary. In this work, we overcome the past limitations [28] by devising a fast dictionary learning and update algorithm combined with global camera motion analysis and colorfulness for real-time video summarization. In this work, we propose a real-time video summarization method on mobile. First, we segment the video in terms of fixed length super-frame to make the real-time processing. Spatio-temporal features such as color histograms are extracted for each super-frame. A dictionary element is represented as a set of low-level features. We propose a very fast on-line dictionary learning/update method that predicts the interestingness based on distinctiveness. At the same time, the perceptual quality of super-frame is also considered by combining the global camera motion analysis and colorfulness with dictionary learning. Figure 1 shows overall diagram of our proposed method. The main contributions of this paper are: 1) A method to predict the interestingness using global camera motion analysis and colorfulness. 2) Fast way of predicting the interestingness method using Dictionary Learning approach. The rest of the paper is organized as follows: We present the method to predict the interestingness per super-frame by combining global camera motion analysis, dictionary learning and colorfulness in Section 2. Section 3 illustrates the experimental results and conclusions in Section 4.

Fig. 1. The overall diagram of proposed Real-time Video Summarization method. 2. VIDEO SUMMARY BASED ON USER SPECIFIED LENGTH The problem of video summarization is defined in terms of energy function that determines a summary which includes super-frames having top interestingness scores and the length of the summary should less than or equal to the user specified length Lsum . Such an energy function is similar to 0/1knapsack problem that is given below: n X  maximize xi · I(Si )

n X

Fig. 2. Homography Computation within a super-frame.

where xi ∈ {0, 1} and xi = 1 indicates that a super-frame is included in the summary. n represents the number of superframes in a input video, Lsi represents the length of the superframe Si whereas I(Si ) represents the interestingness score assigned to super-frame Si . The procedure for predicting the interestingness score per super-frame is explained next.

where Sgcm (Si ) indicates the interestingness score computed using Global Camera Motion estimation, Sdl (Si ) indicates the interestingness score computed using dictionary learning approach, and Sclr (Si ) indicates the interestingness score computed using Colorfulness measure for ith super-frame. Interestingness scores predicted using Global Camera Motion, Dictionary Learning and Colorfulness are normalized for each super-frame. Values of wgcm , wdl and wclr are 0.15, 0.4 and 0.45 respectively used in our experiments.

2.1. INTERESTINGNESS PREDICTION PER SUPERFRAME

2.1.1. INTERESTINGNESS PREDICTION USING GLOBAL CAMERA MOTION ESTIMATION

Interestingness of a super-frame tells the importance of the event which can be computed using features such as Attention, Aesthetics, Faces, Features etc. In this work, we combine the Global Camera Motion, Colorfulness and Dictionary Learning to predict the interestingness for each super-frame. Interestingness score of a super-frame Si is calculated as:

Global Camera Motion analysis gives idea about the quality of the super-frame. In this work, we estimate large motion such as camera shake and fast movement of camera. We estimate the Homography matrices H1 and H2 by extracting the motion vectors from the first, second, k th (middle) and k +1st frames of the super-frame using optical flow[29] as shown in Figure 2. The Global Camera Motion score includes three different

x

i=1

s.t.

xi · |LSi | ≤ Lsum (1)

i=1

I(Si ) = wgcm ·Sgcm (Si )+wdl ·Sdl (Si )+wclr ·Sclr (Si ) (2)

terms such as Orientation Consistency score (Sa ), Scale Consistency score (Ss ) and Translation Consistency score (St ). It is calculated for super-frame Si as: Sgcm (Si ) = Sa (Si ) · Ss (Si ) · St (Si )

(3)

The Orientation Consistency score estimates camera orientation within a super-frame. It is computed using the camera orientations (θ1i and θ2i ) for a super-frame Si that is defined as: i Sa (Si ) = e−wa ·4θ (4) where 4θi = θ1i − θ2i is the relative changes in the orientation of super-frame Si . Such an exponential cost function penalizes relative orientation change rather than absolute orientation change within a super-frame. Value of wa is 15 used in our experiments. Similarly, Scale Consistency score for super-frame is calculated as: i

i

Ss (Si ) = 0.5 · (e−ws ·4ρ1 + e−ws ·4ρ2 ) 4ρi1

(5)

and 4ρi2

where 4ρk = max(1−4ρk , 4ρk −1), are the differences in the scale change for a super-frame Si . This cost function is useful to deal with zoom in and zoom out events in the video. Value of ws is 10 used in our experiments. We have also added the Translation Consistency score to each super-frame to avoid the camera shake which is defined as: i St (Si ) = e−wt ·4τ (6) where p i

4τ =

(∆xi1 − ∆xi2 )2 + (∆y1i − ∆y2i )2 ) d

(7)

where d is the diagonal length of the frame. 4τ i is the relative translation change rather than absolute translation change within a super-frame. Value of wt is 10 used in our experiments. The Global Camera Motion score gives information about the quality of the super-frame. Next, we introduce a Dictionary Learning score that helps in choosing super-frames that are relatively distinct.

Fig. 3. Interestingness prediction per super-frame using Dictionary based approach. where BC(hip , hjq ) measures the distance between two histograms hip and hjq of super-frames Si and Sj using Bhattacharyya distance. This distance measure is more robust to camera movements such as translation, rotation and scale changes. The proposed distance measure defines a matching distance in the super-frame space rather than image space. We predict the interestingness for each super-frame using Dictionary Learning approach to maintain the past events. A dictionary D is represented as a set of distinct events in the space of histograms. Initial dictionary D is built using first k super-frames. The interestingness score using dictionary based representation is computed by comparing new super frame k + 1st against all the super-frames present in the existing dictionary D. An example of interestingness prediction for k + 1st super-frame and Dictionary updation is demonstrated in Figure 3. The interestingness score for super-frame Sk+1 is defined as: Sdl (Sk+1 ) = argminSf Dist(Sk+1 , Sj )

(9)

Sj ∈D

Dictionary update is decided based on the Correlation score of new super-frame Sk+1 as well as the correlation scores of the dictionary D. The correlation score of the super-frame Sk+1 is defined as: X Correlation(Sk+1 ) = Sf Dist(Sk+1 , Sj ) Sj ∈D & Sj 6=Sk+1

2.1.2. INTERESTINGNESS PREDICTION USING DICTIONARY LEARNING We predict the interestingness score per super-frame using Dictionary Learning approach to add the diversity factor. The super-frame Si is represented as m key frames. The spatiotemporal features in terms of color histogram is extracted from each key frame which are represented as {hi1 , hi2 ,..., him }. To capture the distinct events, we calculate the distance measure between the super-frames. Given two super-frames Si and Sj , the super-frame distance is defined as: Sf Dist(Si , Sj ) =

argmin (BC(hip , hjq ))

hip ∈Si ,hjq ∈Sj

(8)

(10) The correlation score for super-frame Sk+1 gives information about dissimilarity with existing dictionary D. When the correlation score for super-frame Sk+1 is less than the most correlated super-frame in the dictionary D, then most correlated super-frame will be removed and the current superframe Sk+1 will be added to the dictionary D. For every dictionary update, the correlation scores of the dictionary D will be re-computed. Let n and k be the number of super-frames and the size of the dictionary. Then, the time complexity of predicting interestingness using dictionary learning is O(n × k). Furthermore, the required memory storage for dictionary is O(k).

2.1.3. INTERESTINGNESS PREDICTION USING COLORFULNESS In order to make the video summary visually appealing, we have added colorfulness measure using [30]. The super-frame Si is represented as k key frames. Each key-frame is divided into 64 bin color histogram hic . A hypothetical average color histogram is precomputed as havg . The Colorfulness measure of a super-frame Si is defined as: Sclr (Si ) = argmax(1 − EM D(hic , havg ))

(11)

hic ∈Si

where EM D(hic , havg ) measures the distance between hic and havg using Earth Mover Distance. 3. EXPERIMENTAL RESULTS We used SumMe dataset [19] to conduct the various experiments for demonstrating the advantage of the proposed method over the state-of-the-art method described in [28]. Because of the unavailability of dataset used in [28], we have used full HD videos from SumMe dataset and compared the average processing time in seconds. Table 1 compares our approach with [28] for the task of video summarization using Dictionary Learning. The proposed method significantly outperforms [28] as specially in predicting the interestingness and learning/updating the dictionary. Iratio of [28] is 1.06 whereas Iratio of the proposed method is 0.0386 which shows that our method is very much suitable for real-time video summarization. The processing time of the proposed method has been measured on SAMSUNG Galaxy Note 4 (2.7 GHz Quad Core, 3 GB RAM) whereas processing speed of [28] was measured in PC. Table 1. Comparison of Iratio = Ttotal /Tvideo and total processing time. Tvideo is the average length of the original input videos. T1 is the time taken to generate the feature descriptors, and T2 is the combined time taken to predict the interestingness and dictionary learning/update. T1 /Tvideo and T2 /Ttotal indicate the distribution of processing time between T1 and T2 with respect to input video. Ttotal is the total processing time to generate the summary. Ref

TV ideo

[28]

2276s

Ours 544s

T1 (T1 /TV ideo ) 1879s (0.83) 4s (0.008)

T2 T I (T2 /TV ideo ) T otal Ratio 529s 2408s 1.06 (0.23) 17s 21s 0.0386 (0.0313)

Additionally, we quantitatively evaluate our method on SumMe dataset. Table 2 compares mean f-measures of our method with those of Ejaz et al.[18], Gygli et al.[19], Uniform Sampling and Clustering at 15% summary length. As can be seen, our method outperforms [18]. The proposed method

Table 2. Comparative mean F-measures at 15% summary length on SumMe dataset [19]. Computational Methods Ejaz et Gygli et Video Name Uniform Cluster al.[18] al. [19] Base Jumping 0.168 0.109 0.194 0.121 Bike Polo 0.058 0.13 0.076 0.356 Scuba 0.162 0.135 0.2 0.184 Valparaiso Downhill 0.154 0.154 0.231 0.242 Bearpark climbing 0.152 0.158 0.227 0.118 Bus in Rock Tunnel 0.124 0.102 0.112 0.135 Car rail crossing 0.146 0.146 0.064 0.362 Cockpit Landing 0.129 0.156 0.116 0.172 Cooking 0.171 0.139 0.118 0.321 Eiffel Tower 0.166 0.179 0.136 0.295 Excavators river cross 0.131 0.163 0.041 0.189 Jumps 0.052 0.298 0.243 0.427 Kids playing in leaves 0.209 0.165 0.084 0.089 Playing on water slide 0.186 0.141 0.124 0.2 Saving dolphines 0.165 0.214 0.154 0.145 St Maarten Landing 0.092 0.096 0.419 0.313 Statue of Liberty 0.143 0.125 0.083 0.192 Uncut Evening Flight 0.122 0.098 0.299 0.271 Paluma Jump 0.132 0.072 0.028 0.181 Playing Ball 0.179 0.176 0.14 0.174 Notre Dame 0.124 0.141 0.138 0.235 Air Force One 0.161 0.143 0.215 0.318 Fire Domino 0.233 0.349 0.252 0.13 Car over Camera 0.099 0.296 0.201 0.372 Paintball 0.109 0.198 0.281 0.32 Mean 0.143 0.163 0.167 0.234

Ours 0.126 0.166 0.200 0.251 0.259 0.184 0.210 0.171 0.374 0.231 0.109 0.264 0.123 0.135 0.160 0.375 0.167 0.160 0.268 0.172 0.142 0.344 0.313 0.280 0.296 0.219

achieves an average mean F-measure as 0.219 which is 70.6% of average human mean F-measure (0.311) on SumMe dataset [19]. The method proposed in [19] is computationally very expensive due to the computation of low level features such as colorfulness, contrast, etc. and high level features such as motion classification, person and landmark detection whereas our method is computationally very efficient.

4. CONCLUSIONS We have introduced a novel method to predict the interestingness for the task of real-time video summarization. The method uses the quality and uniqueness measures by combining the Global Camera Motion, Dictionary Learning and Colorfulness. Efficiency of dictionary update is achieved by measuring the correlation scores of the dictionary using novel super-frame distance measure. Experiments indicate that our method outperforms existing methods in terms of processing speed while retaining similar subjective quality. A CNN based feature extraction for dictionary learning can be considered as a possible future work.

5. REFERENCES [1] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan, “Large-scale video summarization using web-image priors,” in CVPR, pp. 2698–2705, 2013. [2] G. Kim, L. Sigal, and E. P. Xing, “Joint summarization of large-scale collections of web images and videos for storyline reconstruction,” in CVPR, 2014. [3] M. Sun, A. Farhadi, and S. Seitz, “Ranking domainspecific highlights by analyzing edited videos,” in ECCV, pp. 787–802, 2014.

[15] Y. Liu, F. Zhou, W. Liu, F. De la Torre, and Y. Liu, “Unsupervised summarization of rushes videos,” in ACM MM, 2010. [16] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” ACM Transactions on Multimedia Computing, Communications, and Applications, 2007. [17] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua, “Event driven web video summarization by tag localization and key-shot identification,” IEEE Transactions on Multimedia, pp. 975–985, 2012.

[4] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in NIPS, pp. 2069–2077, 2014.

[18] N. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention based framework for extracting key frames from videos,” Signal Processing: Image Communication, 2013.

[5] M. Gygli and H. G. L. Van Gool, “Video summarization by learning submodular mixtures of objectives,” in CVPR, pp. 2698–2705, 2015.

[19] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in ECCV, pp. 505–520, 2014.

[6] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in ECCV, 2014.

[20] W. Wolf, “Key frame selection by motion analysis,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1228–1231, 1996.

[7] J. Ghosh, Y. J. Lee, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in CVPR, 2012.

[21] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for content-based video retrieval and browsing,” Pattern recognition, pp. 643–658, 1997.

[8] T. M. T. Yao and Y. Rui, “Highlight detection with pairwise deep ranking for first-person video summarization,” in CVPR, 2016.

[22] T. Liu, H.-J. Zhang, and F. Qi, “A novel video keyframe-extraction algorithm based on perceived motion energy model,” IEEE Transactions on CSVT, 2003.

[9] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in CVPR, pp. 5179–5187, 2015. [10] W.-S. Chu, Y. Song, and A. Jaimes, “Video cosummarization: Video summarization by visual cooccurrence,” in CVPR, 2015. [11] S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding based feature learning for video summarization,” IEEE Transactions on Multimedia, 2014. [12] Y. Cong, J. Yuan, and J. Luo, “Towards scalable summarization of consumer videos via sparse dictionary selection,” IEEE Transactions on Multimedia, 2012. [13] S. Avila, A. Lopes, A. Luz, and A. Araujo, “Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method,” PRL, 2011. [14] K. Li, S. Oh, A. A. Perera, and Y. Fu, “A videography analysis framework for video retrieval and summarization,” in BMVC, 2012.

[23] D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz, “Schematic storyboarding for video visualization and editing,” in ACM Transactions on Graphics, 2006. [24] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for video object summarization,” IEEE TPAMI, 2010. [25] F. Dirfaux, “Key frame selection to represent a video,” in ICIP, pp. 275–278, 2000. [26] C. Ngo, Y. Ma, and H. Zhang, “Automatic video summarization by graph modeling,” in ICCV, 2003. [27] Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in CVPR, pp. 2714–2721, 2013. [28] B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer videos,” in CVPR, pp. 2513–2520, 2014. [29] J. Shi and C. Tomasi, “Good features to track,” in CVPR, pp. 593–600, 1994. [30] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in photographic images using a computational approach,” in ECCV, 2006.

Suggest Documents