An iteratively reweighting algorithm for dynamic video ...

Multimed Tools Appl DOI 10.1007/s11042-014-2126-8

An iteratively reweighting algorithm for dynamic video summarization Pei Dong & Yong Xia & Shanshan Wang & Li Zhuo & David Dagan Feng

Received: 3 January 2014 / Revised: 7 May 2014 / Accepted: 26 May 2014 # Springer Science+Business Media New York 2014

Abstract Information explosion has imposed unprecedented challenges on the conventional ways of video data consumption. Hence providing condensed and meaningful video summary to viewers has been recognized as a beneficial and attractive research in the multimedia community in recent years. Analyzing both the visual and textual modalities proves essential for an automatic video summarizer to pick up important contents from a video. However, most established studies in this direction either use heuristic rules or rely on simple ways of text analysis. This paper proposes an iteratively reweighting dynamic video summarization (IRDVS) algorithm based on the joint and adaptive use of the visual modality and accompanying subtitles. The proposed algorithm takes advantage of our developed SEmantic inDicator of videO seGment (SEDOG) feature for exploring the most representative concepts for describing the video. Meanwhile, the iteratively reweighting scheme effectively updates the dynamic surrogate of the original video by combining the high-level features in an adaptive manner. The proposed algorithm has been compared to four state-of-the-art video summarization approaches, namely the speech transcript-based (STVS) algorithm, attention modelbased (AMVS) algorithm, sparse dictionary selection-based (DSVS) algorithm and heterogeneity image patch index-based (HIPVS) algorithm, on different video genres, including documentary, movie and TV news. Our results show that the proposed IRDVS algorithm can produce summarized videos with better quality. P. Dong : Y. Xia : S. Wang : D. D. Feng Biomedical and Multimedia Information Technology (BMIT) Research Group, School of Information Technologies, The University of Sydney, Sydney NSW 2006, Australia P. Dong (*) : L. Zhuo Signal and Information Processing Laboratory, Beijing University of Technology, Beijing 100124, China e-mail: [email protected] Y. Xia (*) Shaanxi Key Lab of Speech & Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China e-mail: [email protected] S. Wang Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China S. Wang School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Multimed Tools Appl

Keywords Video summarization . Semantic indicator of video segment (SEDOG) . Iterative weight estimation . Multimodal features . Saliency ranking

1 Introduction With the advent of the big data era, humans have been increasingly exposed to an exponentially growing amount of video data. Take YouTube for example, about 100 hours of videos are uploaded every minute [1]. As a result, traditional manual browsing often turns to be inefficient and less helpful when we need the relevant and essential video content. Naturally, fast and effective approaches are on great demand to facilitate human beings to digest and manipulate huge video archives. Video summarization aims to produce a concise surrogate of a full-length video and thus offers video users a less time-consuming experience to grasp the video’s essence. The video surrogate can be either dynamic or static. A dynamic video summary, also known as a video skim, is a playable yet shorter video clip that consists of a number of segments extracted from the original video. A static video summary is often termed as an image storyboard, which consists of a series of selected video key frames. Recently, video summarization has drawn increasing research attention in the multimedia community, with a number of algorithms having been proposed in the literature [44,50,61]. A specialized group of methods focus on the raw (i.e., unedited) rushes videos [48,49]. They aim to identify the large proportion of junk and repetitive materials from the retakes using the algorithms such as clustering [56] and maximal matching by bipartite graphs [6]. Only the most representative video clips are retained to compose a much shorter summary than the original video [64]. Most other video summarization approaches are proposed for edited, content-intensive videos. Many researchers employed the visual (e.g. color, shape, texture and edge) or audio descriptors for identifying the essential video content [4,5,14,54]. Pritch et al. [54] condensed the activities in a video into a synopsis by using color information and either pixel-based or object-based graph optimization. Almeida et al. [4,5] adopted the zero-mean normalized cross correlation metric to measure the frame similarity on the color histograms of the compressed domain DC coefficients. De Avila et al. [14] proposed the VSUMM algorithm for static video summarization. They adopted the k-means algorithm to cluster the pre-sampled video frames into a number of clusters according to the extracted color histogram features and used the information from shot detection to estimate the number of clusters. Matos and Pereira [43] addressed the creation of MPEG-7 compliant summary descriptions via a low-level arousal model, which works primarily for the high action video content like sports and action movies. A heterogeneity image patch index [13] was proposed to measure the entropy of video frame based on the pixel information only. Both static and dynamic video summarization schemes were introduced utilizing this index and were evaluated on consumer videos. Those low-level feature based algorithms, however, are often inadequate in video summarization, since they do not well consider the underlying semantics of the video towards a high-level understanding. 1.1 Related work The mismatch between low-level features and high-level semantics, known as the semantic gap, makes it ineffective for video summarizers in fulfilling their mission. Therefore, many researchers explored the semantics for video summarization [9,10,12,15,16,18,20,21,26,41,45,59,60]. Domain-specific rules were designed to detect semantic events in soccer videos [18,60] and

Multimed Tools Appl

broadcast baseball videos [26]. In contrast to such methods, more generic approaches were developed to extract semantics for less domain-specific applications, including: (1) employing the text analysis techniques from natural language processing [59], (2) estimating the amount of attention that would be drawn from viewers [16,41], (3) integrating the linguistic word category to complement the audiovisual content analysis [20,21], (4) utilizing the concept entities of transcript terms [9] or high-level semantic concepts [15] to assist visual content importance estimation, (5) exploring the physiological responses from viewers [10,45], and (6) regularizing the sparsity in video representation to semantically grasp the video [12]. Taskiran et al. [59] proposed a video summarizer exploiting the text information recognized from the speech transcripts of video programs. This algorithm was rooted from natural language processing. Different from most video summarization approaches, it segments the original video based on automatic pause detection in the audio track rather than the analysis of the visual changes. The term frequency-inverse document frequency is used to derive a score for each video segment. A video skim is then generated by selecting the segments with high scores while fulfilling a summary duration requirement. Although well suited for mining the essential information in audio, this algorithm might be less competent when the viewer is interested in both the semantics and visual content. Exploiting the human attention that originated from psychological research [28] provided another angle to summarize videos. Extending the idea on visual attention [2,27] which was a pioneering study in computer vision, Ma et al. [41] introduced a user attention model framework which can incorporate the information from the visual, audio and linguistic cues. This method is free from the heuristic rule-based understanding of the video’s high-level semantics. In its application to video summarization, the video content importance is captured via exploiting the attention mechanisms, namely the models from the visual and aural attentions. However, this work did not actually employ the linguistic module which could bring benefit to the video summary quality. A recent work by Ejaz et al. [16] was proposed for key frame extraction via a visual attention clue-based approach, in which the improved computational efficiency over traditional optical flow-based methods was achieved by using the temporal gradient-based dynamic visual saliency detection. Extending the idea of movie summarization by visual and audio-based curve [19], Evangelopoulos et al. [21] proposed to further incorporate a textual cue so that a saliency curve derived from three modalities could be formed to detect salient events that lead to video summaries. In Ref. [20], the hierarchical fusion of multimodal saliency was further explored. Although the authors considered multimodal information, the weights designated to the movie transcript terms were determined by a simple part-of-speech (POS) based method. Chen et al. [9] proposed to generalize the transcript terms of video shots into four concept entity categories (namely “who”, “what”, “where” and “when”) and the shots were correspondingly grouped based on this textual classification. Then a graph entropy-based method is used to pick up a subset of significant shots for flexible browsing with their relations emphasized. In Ref. [15], video summarization with high-level concept detection was studied, where the semantic coherence between the video’s accompanying subtitles and trained high-level concepts was explored. It has been pointed out that thousands of basic concept detectors might be supportive enough to achieve a decent concept detection accuracy [25]. However, training concept detectors requires considerable amount of manual work. This results in the currently relatively small number of available concept detectors. Nevertheless, incorporating semantic information with other features substantially benefits video summarization [15,22,40]. Physiological data obtained from the viewers, such as the signals of the brain [65], heart and skin [10,45], have been exploited as external sources of information to overcome the challenge of semantic gap in video analysis. Money and Agius [45] considered five types of

Multimed Tools Appl

physiological response measurements, including the electro-dermal response, heart rate, blood volume pulse, respiration rate, and respiration amplitude, to identify the essential parts from movies that match the emotional experience of a viewer. In contrast to Ref. [45] which is for single-user, personalized summarization, Chênes et al. [10] integrated the physiological signals of more than one participant together for a general highlight of the original video. Sparse representation was also explored for video summarization. Cong et al. [12] employed both the CENTRIST descriptor [67] for scene understanding and color moments to represent video frames. Then the summarization of consumer videos is mathematically formulated into an iterative dictionary selection problem. Key frames and essential video segments are identified via selecting a sparse dictionary which addresses both the sparseness of the selected visual data and low reconstruction error in representing the original video. Although the aforementioned approaches have realized the importance of using high-level semantics in video summary, they still suffer from several limitations. These methods either fail to employ the valuable text information or merely use the text information. Some, however, conduct their text analysis via a simple differentiation of linguistic term categories or a rough classification of text terms into very limited number of entity types. Some rely on heuristic settings for feature extraction and fusion. In our previous work [15], we also have considered to explore the high-level semantic information and trained concepts for video summarization. Unfortunately, the semantic coherence proposed in that paper needs manual setting as well for concept selection and it simply averages the contribution of different concepts. 1.2 Outline of our work Anchoring in the above observations, this paper proposes an iteratively reweighting dynamic video summarization (IRDVS) algorithm based on the joint and adaptive use of the visual and textual information. The proposed algorithm takes advantage of our developed SEmantic inDicator of videO seGment (SEDOG) feature in exploring the most representative concepts for describing the video. Like our previously introduced semantic coherence [15], SEDOG also leverages the detectors trained on a set of semantic concepts [31] and external linguistic knowledgebase [51]. Differently, SEDOG avoids manually tuning parameters in the concept selection step and more reasonable feature value derivation schemes are developed. Furthermore, IRDVS also has proposed an iterative reweighting scheme to balance the contribution of different features and therefore generate the dynamic video summary in an adaptive manner. To evaluate the proposed algorithm, it has been compared to four state-ofthe-art algorithms namely a speech transcript-based video summarization (STVS) algorithm [59], an attention model-based video summarization (AMVS) algorithm [41], a sparse dictionary selection-based video summarization (DSVS) algorithm [12] and a heterogeneity image patch index-based video summarization (HIPVS) algorithm [13] on different video genres, including documentary, movie and TV news.

2 Proposed method The proposed IRDVS algorithm consists of four major steps, including (1) extracting four groups of low-level visual features for the video frames, (2) deriving two types of high-level visual features and a semantic feature named SEDOG for each video segment based on the low-level visual features and accompanying subtitles, (3) combining three high-level features in an iteratively estimated linear model towards the saliency scores, and (4) producing a

Multimed Tools Appl

dynamic surrogate of the original video using the saliency scores. The diagram of this algorithm is illustrated in Fig. 1. 2.1 Low-level visual representation To enable a semantically meaningful representation of the video content, four groups of basic visual features are extracted from video frames. This step serves as a preparation for the highlevel visual and semantic representation, which will not only connect the signal-level information and high-level semantics [3,25,57] but also lead to the identification of the salient parts from the original video. Among the various types of low-level descriptors, we employ the features of color moments, wavelet texture and local keypoints [31] that will be used to derive the concept detection-based [38,66] semantic feature SEDOG, and the motion feature for the purpose of high-level motion attention-based representation. 2.1.1 Color moment features To capture the low-level fundamental characteristics of a video frame, we divide the frame into 5×5 non-overlapping blocks [30] and, on each block, calculate the first-order moment and second- and third-order central moments of each color component in Lab color space [31]. The color moments on all 25 blocks in the i th frame compose the color moment feature vector fcm(i). 2.1.2 Wavelet texture features To describe the textured information in each frame, we divide the frame into 3×3 non-overlapping blocks and apply the three-level Haar wavelet decomposition to the intensity values of each block [31]. The variances of the obtained wavelet coefficients in horizontal, vertical and diagonal directions at each level are computed. The assembly of the variances of wavelet coefficients on all 9 blocks in the i th frame is defined as a wavelet texture feature vector fwt(i). 2.1.3 Motion features It is widely recognized that the human eye is sensitive to the changes in visual content. To characterize the visual changes, we divide each frame into M×N non-overlapping blocks, each containing 16×16 pixels, and calculate block-based motion vectors via full-search motion

Video frames

Video sequence

Visual features: color (fcm), texture (fwt), motion (fmv), keypoint (fkp)

Subtitles

High-level features: motion attention (fM), face attention (fF), SEDOG (fE)

Semantic concept detectors, WordNet::Similarity Iteratively estimated linear model

Video summary Fig. 1 Diagram of the proposed IRDVS algorithm

Saliency scores

Multimed Tools Appl

estimation [33]. In the i th frame, let the motion vector on the (m,n) th block be denoted by v(i,m,n). The motion features for this frame is the assembly of M×N motion vectors fmv(i). 2.1.4 Local keypoint features Since local keypoint-based bag of features (BoF) can serve as an effective complement to globally computed features in semantic video analysis [37], we employ the soft-weighted local keypoint features [30], which are based on the significance of keypoints over a vocabulary of 500 visual words, to characterize salient regions. Given the i th frame, keypoints are detected by the difference of Gaussians (DoG) detector, represented by the scale-invariant feature transform (SIFT) descriptor, and clustered into the 500 visual words. The keypoint feature vector fkp(i) is calculated as the weighted similarities between the keypoints and visual words under a four nearest neighbor principle [30]. 2.2 High-level visual and semantic representation The grasp of video information towards content selection could be more natural and reliable when high-level representation is available [18,44,68]. Based on the fundamental visual features produced in the previous subsection and the video’s accompanying text, we consider both the clues from user attention and multimodal semantics, with three groups of high-level visual and semantic representations derived for any given video segment χs that ranges from the i1(s) th frame to the i2(s) th frame. 2.2.1 Motion attention features The early work by James [28] and other researchers in psychology on human attention has set a crucial foundation for the use of attention modeling in computer vision [27,41,55]. The cognitive mechanism of attention is essential in the analysis and understanding of human thinking and activities [29,35,53] and hence useful in selecting relatively salient content for video summaries [7,8,16]. We employ the motion attention model, which is a component of the widely-used user attention model [41], to derive high-level motion attention features that are more suitable for semantic analysis. Suppose we have a spatial window with 5×5 blocks in the same frame, and a temporal window with 7 blocks at the same spatial location and in adjacent frames. Let both windows be centered on the (m,n) th block in the i th frame, and we consider the motion vectors in them. We evenly divide the phase interval [0,2π) into eight bins and count the spatial phase (t) histogram H(s) i,m,n(ζ) , 1≤ζ≤8, in the spatial window, and temporal phase histogram Hi,m,n(ζ), 1≤ζ≤8, in the temporal window, respectively. Thus, the spatial coherence inductor and temporal coherence inductor [41] are calculated as

where ps ðζ Þ ¼

ðsÞ

H i;m;n ðζ Þ ðsÞ ∑ζ H i;m;n ðζ Þ

C s ði; m; nÞ ¼ −∑ζ ps ðζ Þlogps ðζ Þ;

ð1Þ

C t ði; m; nÞ ¼ −∑ζ pt ðζ Þlogpt ðζ Þ;

ð2Þ

and pt ðζ Þ ¼

ðtÞ

H i;m;n ðζ Þ ðtÞ

∑ζ H i;m;n ðζ Þ

are phase distributions in the spatial window

and temporal window, respectively. Then, the motion attention feature for the i th frame can be defined by combining the magnitudes of motion vectors, spatial coherence inductors and

Multimed Tools Appl

temporal coherence inductors as follows g ðiÞ ¼ 1 ∑M ∑N jvði; m; nÞjC t ði; m; nÞ½1−jvði; m; nÞjC s ði; m; nÞ: MOT MN m¼1 n¼1

ð3Þ

Finally, we apply a median filter with nine input elements to the obtained motion attention feature sequence to suppress the noise and violent changes between adjacent frames. The motion attention feature for the video segment χs is the average of the smoothed features {MOT(i)|i=i1(s),…,i2(s)} over related frames f M ðsÞ ¼

1 i ð sÞ ∑2 MOT ðiÞ: i2 ðsÞ−i1 ðsÞ þ 1 i¼i1 ðsÞ

ð4Þ

2.2.2 Face attention features The presence of human faces usually indicates the semantic importance of the video content. We adopt the robust real-time face detection algorithm [62] to detect the human faces in each frame. For the j th detected face, besides its area AF(j), a position-related weight wfp(j) is also assigned to express the attention it draws from viewers. The distribution of this weight inside each frame is illustrated in Fig. 2. Thus, the face attention feature for the i th frame is calculated as [41] g ðiÞ ¼ ∑ j wfp ð jÞ A F ð jÞ ; FAC Afrm

ð5Þ

where Afrm is the area of the whole video frame. To alleviate the impact of imperfect face detection, the obtained face attention feature sequence is smoothed by a median filter with five input elements. The face attention feature for the video segment χs is defined as the average of the smoothed features {FAC(i)|i=i1(s),…,i2(s)} over related frames f F ðsÞ ¼

1 i ðsÞ ∑2 FAC ðiÞ: i2 ðsÞ−i1 ðsÞ þ 1 i¼i1 ðsÞ

ð6Þ

2.2.3 SEDOG Mining the concepts relevant to the visual information is beneficial to identifying the semantic meaning conveyed in a video sequence [24,46,58,69,70]. To exploit the semantics, we introduce a high-level feature, namely the SEmantic inDicator of videO seGment (SEDOG).

1/3

1/3

1/3

1 24

2 24

1 24

3/12

4 24

8 24

4 24

4/12

1 24

2 24

1 24

5/12

= =

=

Fig. 2 Position-related face weights in each frame (left) and a sample image with weights assigned to three faces (right)

Multimed Tools Appl

Color moments (fcm), wavelet texture (fwt), local keypoints (fkp)

Keyframe

Concept detectors (cm,wt and kp)

Concept membership (u)

VIREO-374 semantic concepts

Subtitles

WordNet::Similarity

SEDOG (fE)

Similarities between subtitles and concepts

Textual relatedness ρ

Fig. 3 Diagram for computing the SEDOG for each video segment

SEDOG is computed by using the VIREO-374 [31], which consists of 374 concepts and three types of support vector machines (SVM) for each concept. The SVMs have been trained by using the color moment features, wavelet texture features and local keypoint features to estimate the membership of a video frame belonging to the corresponding concept. Some concepts in the VIREO-374 are listed in Table 1. For the s th video segment, we calculate its SEDOG feature in three major steps, as summarized in Fig. 3. First, for the middle frame [9] im(s) of χs, we extract the color moment features fcm(im(s)), wavelet texture features fwt(im(s)) and local keypoint features fkp(im(s)), and apply each group of features to the SVM-based concept detectors in VIREO-374 to generate the semantic memberships {ucm(s,j),uwt(s,j),ukp(s,j)|j=1,2,…,374} of the frame belonging to each concept. The concept membership of the segment χs and j th concept is defined as uðs; jÞ ¼

ucm ðs; jÞ þ uwt ðs; jÞ þ ukp ðs; jÞ ; 3

ð7Þ

Table 1 Examples of single-word concepts and multi-word concepts in VIREO-374. Parts-of-speech (e.g. #n) and proper senses (e.g. #1) are manually assigned based on concept definitions [32] and WordNet:: Similarity [51] Single-word concepts

Multi-word concepts

Original concept

Original concept

Term with POS and sense index

Constituent terms with POSs and sense indices

Airplane

airplane#n#1

Corporate_Leader

company#n#1,leader#n#1

Animal

animal#n#1

Explosion_Fire

explosion#n#1,fire#n#1

Car Desert

car#n#1 desert#n#1

Industrial_Setting Male_News_Subject

industry#n#1,setting#n#2 male#n#2,news#n#1,subject#n#6

Forest

forest#n#1

Pedestrian_Zone

pedestrian#n#1,zone#n#1

First_Lady

first_lady#n#2

People_Marching

person#n#1,marching#n#1

Military

military#n#1

Police_Security

police#n#1,security#n#1

Pipes

pipe#n#2

Us_Flags

america#n#1,flag#n#1

Shopping_Mall

shopping_mall#n#1

Television_Tower

television#n#1,tower#n#1

Multimed Tools Appl

which represents the confidence the three types of detectors mutually have that the j th concept relates to the frame im(s) of video segment χs. Second, the semantic similarity is measured in terms of the textual information using subtitles. Since subtitles in neighboring segments often relate to each other and contribute to the same semantics, a temporal window that consists of W segments is centered on the current segment. Let all subtitle terms (except the stop words1) in W segments be denoted by Γst(s) and the set of constituent words in the j th concept be denoted by Γcp(j). The textual semantic similarity between the segment χs and j th concept is calculated as κðs; jÞ ¼ maxγ∈Γst ðsÞ

1 ∑ ω∈Γcp ð jÞ ηðγ; ωÞ; Γcp ð jÞ

ð8Þ

where |⋅| represents the cardinality of a set, and η(⋅,⋅) is the semantic similarity between a pair of linguistic terms obtained by using the WordNet::Similarity package [51]. To accurately retrieve the term pair similarity, we have manually picked up the parts-of-speech and proper senses of the constituent terms of the concepts according to their definitions [32]. Note that the concepts like “First_Lady” and “Shopping_Mall” shown in Table 1 are still considered as single-word concepts, since each of them is defined as a whole in WordNet::Similarity. To reduce the impact of less relevant concepts, the textual relatedness is defined by thresholding the textual semantic similarity according to the corresponding concept membership (1 k ðs; jÞ; juðs; jÞ∈ð0:5; 1 ; ð9Þ pðs; jÞ ¼ Q juðs; jÞ∈½0; 0:5 0; where Q is a normalization factor that ensures ∑ 374 j = 1p(s,j)=1. Since the outputs of the SVMs are probabilities of binary classification problems, a threshold of 0.5 is naturally used in (9). Finally, the SEDOG score of the segment χs, denoted by fE(s), is defined as a sum of the concept memberships weighted by the corresponding textual relatednesses, shown as follows f E ðsÞ ¼ ∑374 j¼1 pðs; jÞuðs; jÞ:

ð10Þ

In this formulation, the textual relatedness p(s,j) is used to adjust the contribution of the j th concept or even prune the concept when ρ(s,j)=0. An example of computing the SEDOG score is illustrated in Fig. 4. 2.3 Dynamic video summarization In this study, video summarization aims to select a subset of video segments based on their saliency scores. For each video segment xs, we define its saliency score as a linear combination [17,34] of its motion attention feature fM(s), face attention feature fF(s) and SEDOG feature fE(s), shown as follows f SAL ðsÞ ¼ wM ðsÞ f M ðsÞ þ w F ðsÞ f F ðsÞ þ wE ðsÞ f E ðsÞ;

ð11Þ

where wM(s), wF(s) and wE(s) are the feature weighting parameters. Note that each type of feature is linearly normalized to the interval [0,1] prior to the linear fusion. The video

1

http://www.tomdiethe.com/teaching/remove_stopwords.m

Multimed Tools Appl

Fig. 4 An example showing the computation of SEDOG

summarization problem is now cast into the estimation of the weighting parameters, which can be achieved in the following iterative process. Let the iterations be indexed by k. At the k th iteration, we first calculate each weighting parameter w#(s) (#∈{M,F,E}), which is determined simultaneously by a macro factor α#(s) and a micro factor β#(s) in the following way w# ðsÞ ¼ α# ðsÞ⋅β # ðsÞ:

ð12Þ

The macro factor α#(s) measures the relative significance of f#(s) globally in the entire video sequence and can be defined as α# ðsÞ ¼ 1−

r# ðsÞ ; NS

ð13Þ

where r#(s) is the rank of f#(s) after {f#(s)|s=1,2,…,NS} is sorted in descending order, and NS is the total number of segments in the video. The micro factor β#(s) captures the importance of the current video segment χs as compared to its temporally closest segment χs' in the previous video summary and can be calculated as f # sðk Þ − f # s0 ðk−1Þ ðk Þ ð14Þ β # ðsÞ ¼ 1 þ : f # ðsðk Þ Þ þ f # s0 ðk−1Þ With the high-level visual and semantic features and the currently estimated weighting (k) (k) (k) parameters w(k) M (s), wF (s) and wE (s), we can calculate the saliency score f SAL(s) and sort the saliency scores of all video segments in descending order. Then, the first N(k) VS segments from the sorted sequence are selected, which have larger saliency scores, to form a new video summary. The number of chosen segments N(k) VS should also satisfy the

Multimed Tools Appl

constraint that the length of the video summary does not exceed a pre-specified target duration limit. The weighting parameters are equally initialized and the iterative estimation terminates when a maximum iteration number Kmax is reached. In our experiments, Kmax is set to be 15. The video summarization process is summarized in Algorithm 1.

Algorithm 1.

Video summarization based on iteratively estimated linear model

1:

Input: fM(s), fF(s) fE(s), s=1,2,…,NS; target duration limit.

2:

Initialization: wM ðsÞ ¼ w F ðsÞ ¼ wE ðsÞ ¼ 13 , s=1,2,…,NS; k=0.

ð0Þ

ð0Þ

ð0Þ

(k) SAL(s)

3:

according to Eq. (11). Sort f (k) Compute f SAL(s); obtain the initial summary considering saliency ranks and target duration limit.

4:

Calculate αM(s), αF(s) and αE(s) using Eq. (13).

5:

while k≤Kmax do

6:

k←k+1.

7:

(k) (k) Update β (k) M (s), β F (s) and β E (s) via Eq. (14).

8:

(k) (k) Update w(k) M (s), wF (s) and wE (s) via Eq. (12).

9:

Update f (k) SAL(s) via Eq. (11).

10:

Sort f (k) SAL(s); obtain a new summary for the k th iteration considering saliency ranks and target duration limit.

11:

end while

12:

Output: Final video skim (i.e. the summary of the Kmax th iteration).

3 Experiments and results Although numerous videos are accessible nowadays from either shared repositories (e.g. YouTube2 and Open Video Project3) or released datasets (e.g. Kodak’s consumer video benchmark dataset [39] and VSUMM dataset [14]), the evaluation of the proposed algorithm cannot be directly carried out by using these available resources. The main reason is that no suitable ground truths are provided for dynamic video summarization. For the Open Video Project and VSUMM dataset, despite that storyboards of key frames are available, they are produced specifically for evaluating the static video summarization rather than for dynamic summarization. In our experiments, for a comprehensive performance evaluation, the proposed IRDVS algorithm and other competing algorithms have been applied to a dataset of 14 test videos. This dataset has a total duration of about 7.6 h and spans different genres, namely documentary, movie and news, as listed in Table 2. Including documentaries is based on the consideration that the subtitles largely well match the video content and semantics [9]. Movies and news have also been widely utilized in video summarization. The four documentaries were obtained from YouTube, the movie videos, each lasting about half an hour, were extracted as continuous clips from three well known movies, and the news videos are from NBC News, USA and ABC News, Australia. As to the NBC News, commercials were manually removed and the remaining segments were concatenated as our test videos. In the experiments, the

2 3

http://www.youtube.com/ http://www.open-video.org/

Multimed Tools Appl Table 2 Test videos Genre Documentary

Movie

TV news

Video

Abbreviation

Duration (mins)

Astrobiology

ASTR

44.5

Constellations

CSTL

44.5

Cosmic Holes

CSMH

44.5

Mars A Beautiful Mind (part 1)

MARS BM-I

44.1 23.4

A Beautiful Mind (part 2)

BM-II

39.0

Harry Potter and the Sorcerer’s Stone (part 1)

HP1-I

37.5

Harry Potter and the Sorcerer’s Stone (part 2)

HP1-II

23.2

The Legend of 1900 (part 1)

LGD-I

26.7

The Legend of 1900 (part 2)

LGD-II

28.1

ABC News, Australia (part 1)

ABC-I

30.0

ABC News, Australia (part 2) NBC News, USA (part 1)

ABC-II NBC-I

30.0 21.8

NBC News, USA (part 2)

NBC-II

18.7

errors in video subtitles were preserved so that the algorithms can be tested in a practical setting. 3.1 Performance evaluation The evaluation of video summarization algorithms still remains an open problem and is quite subjective. Although quantitative scores can be given by users to measure the informativeness, enjoyability [42,47], satisfaction [52], experience [63], interrelation [9] and so on, subjective evaluation can merely provide an overall and rough assessment of the entire summary. As to the objective evaluation, the metrics used are often designed for specific video types. For instance, the duration of summary and the difference between the size of target and actual summaries are adopted for rushes video summarization [56,64], while the content missing rate calculated as the percentage of content elements defined in the original video but missing from the summary is employed for instructional videos [11]. In this paper, unlike the methods obtaining a single score from each user for the whole summary, we used a divide-and-conquer strategy based on the video segment scores from users. Firstly, the original video was manually partitioned into segments, each of which maintains semantically independent content. Next an importance score was assigned to each video segment by the invited user. Then algorithm-made video skims were compared to their ground-truth counterpart which was generated based on the segment scores. This method was inspired by the above mentioned wholesummary rating approaches. However, the major difference is that it makes the evaluation process more manageable since the user only has to deal with a number of smaller and simpler subtasks. For performance evaluation, the metrics of the precision (P), recall (R) and F-measure (F) were utilized. In our scenario, the precision is the fraction of video segments in the algorithmmade skim that are correct according to the ground-truth summary, while the recall is defined as the fraction of correct video segments that are picked up by the algorithm. The F-measure is the harmonic mean of the precision and recall and can be used as a more comprehensive

Multimed Tools Appl

metric. To obtain these metrics, the video segments in the algorithm-made video skim are categorized into true positives (TP) and false positives (FP). For a given video segment, if at least p (50 % in our experiments) of its frames also appear in the ground-truth summary, it is considered as a true positive; otherwise, it is a false positive. In the ground-truth summary, any video segment that is not sufficiently matched by any segment in the true positive set falls into the false negative (FN) category. Therefore, the precision, recall and F-measure of an algorithm-made summary can be calculated as follows P¼

nTP nTP 2PR ; ;R ¼ ;F ¼ PþR nTP þ nFP nTP þ nFN

ð15Þ

where nTP, nFP and nFN are the numbers of true positives, false positives and false negatives, respectively. 3.2 Ground-truth summary A ground-truth video skim was produced based on the scores given by an invited user and the target summary length. Firstly, the user viewed the original full-length video and grasped its structure and idea. Then by considering the relative content importance of video segments, the user provided quantitative scores in the interval [0,100] for all segments. In our experiments, no time limit was set for completing the whole task so that the video could be played as many times as necessary. With the user scores, all video segments were ranked in descending order. A ground-truth summary that maximally uses but does not exceed the summary length budget was formed by concatenating all highly ranked segments and in the meantime preserving the order of appearance in the original video. In the experiments, three users participated in the ground truth making task independently. Hence each of the test videos has three different ground-truth summaries at one specific summary length. 3.3 Results The proposed IRDVS algorithm was compared to four state-of-the-art video summarization methods, namely the STVS algorithm [59], AMVS algorithm [41], DSVS algorithm [12] and HIPVS algorithm [13]. The STVS algorithm temporally partitions a video by using audio pause detection and then computes a score for each video segment based on the automatically recognized speech transcripts to summarize the video. In our implementation, we directly provided the STVS algorithm with the video subtitles and the manually obtained video partitioning results used in making the ground-truth summaries. These two groups of information are near-perfect substitutes to the results of automatic speech recognition and audio pause detection since they are free from the performance limitations of the two modules. As to our implementation of the AMVS algorithm, the motion attention model and face attention model were integrated via a linear fusion, in which the weights, without prior knowledge, were set to be equal. Since Ref. [41] suggested using the information of pauses and silence to decide video segment boundaries when making video skims, we also provided the AMVS algorithm with the same manually produced video partitioning results as we did for STVS. We tested the DSVS algorithm with the parameter values suggested by its author and terminated its dictionary selection process after it sufficiently converged. The HIPVS algorithm was implemented with a suggested spatial downsampling operation on all original video frames.

Multimed Tools Appl Table 3 Performance comparison of the five video summarization algorithms based on the precision (P), recall (R) and F-measure (F). Each 3×2 cell presents the results of the STVS (top left), AMVS (top right), DSVS (middle left), HIPVS (middle right) and proposed IRDVS (bottom left) algorithms as well as the rank of IRDVS algorithm (bottom right). The best and second best ranks in the comparisons are highlighted in boldface and italic boldface respectively if produced by IRDVS Summary length

Metric

Documentary

Movie

20 %

P

0.107

0.184

0.145

0.360

0.134

0.334

0.129

0.172

0.171

0.184

0.171

0.231

0.149

0.196

0.164

0.235

1

0.304

2

0.436

1

0.325

1

0.042

0.120

0.088

0.204

0.041

0.200

0.057

0.175

0.135 0.415

0.164 1

0.187 0.644

0.244 1

0.151 0.623

0.155 1

0.158 0.561

0.188 1

0.060

0.145

0.108

0.251

0.063

0.248

0.077

0.215

0.151

0.167

0.182

0.199

0.177

0.146

0.170

0.171

0.297

1

0.410

1

0.496

1

0.401

1

0.219

0.253

0.350

0.431

0.209

0.368

0.259

0.351

0.296

0.295

0.258

0.231

0.322

0.205

0.292

0.244

0.339

1

0.376

2

0.483

1

0.399

1

R

0.100 0.228

0.146 0.288

0.176 0.211

0.266 0.340

0.075 0.229

0.249 0.234

0.117 0.223

0.220 0.288

0.505

1

0.726

1

0.678

1

0.636

1

F

0.136

0.185

0.230

0.327

0.109

0.295

0.158

0.269

R

F

30 %

40 %

P

P

R

F

News

Overall 0.293

0.257

0.291

0.229

0.273

0.266

0.209

0.251

0.258

0.402

1

0.494

1

0.553

1

0.483

1

0.356

0.364

0.479

0.536

0.406

0.424

0.414

0.441

0.383

0.403

0.329

0.360

0.362

0.317

0.358

0.360

0.415 0.162

1 0.213

0.475 0.262

3 0.325

0.512 0.150

1 0.285

0.468 0.191

1 0.274

0.295

0.405

0.252

0.479

0.267

0.382

0.271

0.422

0.563

1

0.779

1

0.707

1

0.683

1

0.222

0.269

0.336

0.401

0.218

0.339

0.259

0.336

0.333

0.403

0.283

0.476

1

0.586

0.410 1

0.305 0.590

0.335 1

0.307 0.551

0.383 1

The video summary quality evaluated by using the precision, recall and F-measure is presented in Table 3. For a given video genre, the performances were calculated as follows. Firstly, the video summaries were generated by applying all five algorithms at the summary lengths of 20 %, 30 % and 40 %. Next each summary was compared to the three user groundtruth summaries respectively to yield the metric values P, R and F. Finally, for each algorithm, its results of all test videos in the given genre and against all ground truths were averaged to be the performance under a specific summary length and metric. From the results of the documentary, movie and news, it is shown that the five algorithms could generate better summaries with the increase in the summary length. Our proposed IRDVS algorithm attained the best results for most of the cases, with the exception that for movies and under the metric precision it is the second best on 20 % and 30 % summaries and marginally underperformed the second best on 40 %

Multimed Tools Appl

summaries. Although the HIPVS algorithm is often the second best on documentaries and the AMVS algorithm yielded comparatively good results on the movie and news videos, both algorithms are of a text-free nature and thus less competent in well balancing the multi-modality information in all three video genres. With a further comparison on the recall values, it is noted that the performances of the STVS algorithm are comparatively low especially at a summary length of 20 %, which is probably due to its text-only analysis. Accordingly, the higher recalls of the proposed IRDVS algorithm could probably be attributed to the exploitation of information from both the visual and textual modalities. In the last column of Table 3, the overall performances calculated as the mean of the category results indicate that our IRDVS algorithm is the best in all cases. Furthermore, the five algorithms were ranked based on the averaged metric values over all summary lengths as listed in Table 4. The overall rank of the IRDVS algorithm is 1.00. Since multiple user ground-truth summaries were used in the evaluation, we analyzed the fluctuation of the precision, recall and F-measure. For each test video the metric values corresponding to three user ground truths were used to compute an individual standard deviation, and then these individual values of all test videos were averaged and illustrated in Fig. 5. These statistics demonstrate that the results of IRDVS against different ground truths are quite stable. However, the IRDVS, DSVS and HIPVS algorithms are generally advantageous in terms of standard deviation at 20 %, 30 % and 40 % summaries respectively, each winning two out of three cases. Since the value ranges of the five algorithms’ performances are quite different (see Table 3), the relative standard deviation (RSD) [23,36] was further employed, which is defined as the ratio of the standard deviation to the mean. According to the RSDs of all metrics shown in Fig. 6, the proposed IRDVS algorithm outperformed the other four competitors in all cases.

4 Discussions 4.1 Impact of window size parameter W The window size parameter W is a crucial factor for SEDOG. When W varies, different amount of context information from the textual modality will be considered. Therefore, it is desirable to find a proper value of W that generally well balances the accuracy and computational load of the IRDVS algorithm. Performances in the precision, recall and F-measure computed over all video genres and user ground Table 4 Performances and ranks of the STVS, AMVS, DSVS, HIPVS and proposed IRDVS algorithms based on the precision (P), recall (R) and F-measure (F). The best results are highlighted in boldface. The subscripts indicate the ranks among five algorithms. The overall ranks are averaged over all the metric ranks Algorithm

STVS

AMVS

DSVS

HIPVS

IRDVS

P

0.2674

0.3622

0.2823

0.2565

0.3971

R

0.1225

0.2233

0.2174

0.2992

0.6271

F

0.1655

0.2732

0.2424

0.2703

0.4781

Overall rank

4.67

2.33

3.67

3.33

1.00

Multimed Tools Appl

0.13

STVS AMVS DSVS HIPVS IRDVS

0.11 0.09 0.07 0.05 0.03

20% ƹ 0.119 0.052 0.070 h 0.098 0.057 0.072 Ʒ 0.056 0.051 0.049 + 0.064 0.071 0.067 Ƶ 0.045 0.053 0.044

0.096 0.073 0.063 0.049 0.057

30% 0.049 0.045 0.040 0.053 0.058

0.063 0.053 0.046 0.052 0.054

0.063 0.074 0.061 0.040 0.044

40% 0.038 0.054 0.039 0.043 0.047

0.047 0.063 0.046 0.041 0.041

Fig. 5 Stability measured in the standard deviation (smaller is better) of the algorithm-generated video summaries when evaluated against the ground-truth summaries made by using the scores from different users. Each value is an average of the standard deviations over all test videos. The best results (the smallest values) among the five algorithms are highlighted in boldface. The second bests, if produced by IRDVS algorithm, are highlighted in italic boldface

truths were illustrated in Fig. 7 for five different W settings. Additionally, the averaged performances over three summary lengths were reported in Table 5, with the metric ranks and overall ranks shown. Since the scheme with a window size of 3 obtained the overall optimal rank of 1.33, we set W to be 3 in all experiments.

STVS

AMVS

DSVS

HIPVS

IRDVS

80% 60% 40%

20% 0%

20%

30%

40%

Fig. 6 The relative standard deviations (smaller is better) of the performances of five algorithms. Each value is computed over the results against ground-truth summaries from different users

Multimed Tools Appl

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Ƶ Ƶ Ƶ Ƶ Ƶ

= = = = =

1 3 5 7 9

0.342 0.325 0.311 0.298 0.302

20% 0.536 0.561 0.561 0.561 0.563

0.396 0.401 0.393 0.382 0.387

0.398 0.399 0.393 0.387 0.392

30% 0.620 0.636 0.635 0.626 0.636

0.479 0.483 0.478 0.470 0.476

0.481 0.468 0.468 0.464 0.461

40% 0.674 0.683 0.683 0.682 0.678

0.558 0.551 0.550 0.547 0.544

Fig. 7 Performance comparison of the IRDVS algorithm when the window size W for SEDOG takes different values. The best result (the highest value) in each case is highlighted in boldface

4.2 Role and contribution of the iteratively estimated linear model 4.2.1 Convergence property In the iterative weight estimation process, since the feature sequences remain the same for all iterations, the feature weighting parameters decide the segment saliency scores and thus the video summaries. Therefore, we examined the average change of feature weights with the maximum number of iterations Kmax set as 15. The curves of the documentary videos “CSTL” and “CSMH”, movies “HP1-I” and “LGD-II” and news videos “ABC-II” and “NBC-II” are illustrated in Fig. 8. It is observed that the feature weights changed rapidly within the first few iterations and the proposed algorithm can gradually achieve a quite stable set of weights after about five iterations in most cases. 4.2.2 Final summary versus initial summary To demonstrate the improvement of the final summary obtained by the iterative weight estimation process over the initial summary based on equal-weight fusion, Fig. 9 illustrates several example frames of the video segments together with their accompanying subtitles from the 20 %-long summaries. All these examples are in the final summaries but failed to be picked up by the initial summaries. It proves that the iteratively reweighting process effectively incorporated highly relevant and semantically essential content into the final summaries. Table 5 Average performances of the IRDVS algorithm with different window size settings for SEDOG. The subscripts are metric ranks. The best results and ranks are highlighted in boldface W

1

3

5

7

9

P

0.4071

0.3972

0.3913

0.3835

0.3854

R F

0.6105 0.4782

0.6271 0.4781

0.6262 0.4743

0.6234 0.4665

0.6263 0.4694

Overall rank

2.67

1.33

2.67

4.67

3.67

Multimed Tools Appl

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 8 Curves of the average change of feature weighting parameters against the iteration index k. The summary lengths tested include 20 %, 30 % and 40 %

4.3 Computational cost The proposed IRDVS algorithm provides an efficient framework for generating dynamic video summaries by integrating multimodal features. The iterative weight estimation process can be easily realized to achieve a high speed. Our current implementation in 64-bit MATLAB 7.11 tested on a workstation with Intel Core 2 Duo 3.00 GHz CPU and 4 GB RAM consumed only 5.2 seconds to complete the 15 iterations for producing the summaries of our 7.6 hours dataset. However, the most time-consuming part of the IRDVS algorithm is feature extraction mainly because it involves the motion estimation, face detection and obtaining term pair

Multimed Tools Appl

Could life have existed even earlier on our planet? The oldest evidence of life is not evidence of the oldest life.

The reason I’m interested in Lassen Volcanic Park is because of these hydrothermal hot spot areas. (a) ASTR

SETI uses radio technology to listen for radio leaks from alien civilizations.

In 1930, Einstein and his colleague, Nathan Rosen calculated the mathematics of one of these intergalactic pipelines.

Backward time travel has ignited a myriad of science fiction scenarios.

… The other is a supermassive black hole that is millions to billions of times the mass of our sun.

(b) CSMH

Now, who among you will be the next Morse? The next Einstein? Who among you will be the vanguard ...

I'll take another. Excuse me? A thousand pardons. I simply assumed you were the waiter.

And my compliments to you, sir. Thank you very much.

(c) BM-I

They were quick to the scene, they were quick to take control, and from my understanding that they worked within the operational capacity and policy ...

I have learnt patience, which is so essential, tolerance. We live daily, just do our basic things day by day. A new report from Alzheimer’s Australia suggests many more families will face a similar dilemma in the future. (d) ABC-I

... it’s expected to employ more than 6,000 workers. The Government is appointing an employment coordinator to help with the recruitment ...

Fig. 9 Example frames that are in the final video summaries but missing from the summaries at the initialization stage generated by the equal-weight fusion. The subtitles of the corresponding video segments are also shown. The target summary length is 20 %

Multimed Tools Appl

similarities from WordNet::Similarity. Several strategies can be leveraged to reduce the overhead introduced by these inefficient steps. Firstly, the motion vectors can be alternatively parsed from compressed videos so that the motion estimation is waived. Secondly, a middlesize lookup table can be built up that contains frequently used term pair similarities for the semantic concepts and therefore only the other “unknown” similarities have to be obtained from WordNet::Similarity. Third, the multi-core features of modern CPU can be exploited to conduct the independent processes, e.g. the visual part and textual part, in a parallel manner.

5 Conclusions and outlook As a major endeavor towards assisted consumption and manipulation of the fast growing digital video archives, a variety of video summarization approaches have emerged in the multimedia community. Distinguished from most of the existing multimodal and semantic video summarization work, this paper proposes an iteratively reweighting dynamic video summarization (IRDVS) algorithm which adaptively integrates two high-level visual features with the proposed semantic indicator of video segment (SEDOG) derived from the visual and textual information. The summarization process is performed through rating each video segment according to its saliency score, which is calculated as a linear combination of all three types of high-level features. The weighting parameter of each feature is efficiently updated via an iterative estimation process. Under the test dataset that consists of documentaries, movies and TV news, the proposed IRDVS algorithm was compared favorably to the STVS algorithm [59], AMVS algorithm [41], DSVS algorithm [12] and HIPVS algorithm [13] in terms of the precision, recall and F-measure. Our future work may include long-term video summarization and recommendation with high-level semantics. Incorporating more modalities such as the audio tracks would be another future endeavor. Furthermore, we may also consider expanding our current dataset and providing the ground truths for the evaluation and comparison of dynamic video summarization methods. Acknowledgments This work was supported in part by the Australian Research Council grants, in part by the China Scholarship Council under Grant 2011623084, in part by the National Natural Science Foundation of China (No. 61372149, No. 61370189, No. 61100212), in part by the Program for New Century Excellent Talents in University (No. NCET-11-0892), in part by the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20121103110017), in part by the Natural Science Foundation of Beijing (No. 4142009), in part by the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (No. CIT&TCD201304036, No. CIT&TCD201404043), and in part by the Science and Technology Development Program of Beijing Education Committee (No. KM201410005002). We appreciate the anonymous reviewers for their constructive comments. Copyrights of images, videos and subtitles used in this work are the property of their respective owners.

References 1. (2013) Here’s to eight great years. YouTube Blog. http://youtube-global.blogspot.com/2013/05/heres-toeight-great-years.html. 2. Ahmad S (1991) VISIT: A neural model of covert visual attention. In: Advances in Neural Information Processing Systems (NIPS), vol 4. pp 420–427.

Multimed Tools Appl 3. Alatan AA, Akansu A, Wolf W (2001) Multi-modal dialog scene detection using hidden Markov models for content-based multimedia indexing. Multimed Tools and Appl 14(2):137–151 4. Almeida J, Leite NJ, Torres RS (2012) VISON: video summarization for online applications. Pattern Recogn Lett 33(4):397–409 5. Almeida J, Leite NJ, Torres RS (2013) Online video summarization on compressed domain. J Vis Commun Image Represent 24(6):729–738 6. Bai L, Hu Y, Lao S, Smeaton AF, O’Connor NE (2010) Automatic summarization of rushes video using bipartite graphs. Multimed Tools and Appl 49(1):63–80 7. Borji A, Itti L (2013) State-of-the-art in visual attention modeling. IEEE Trans on Pattern Anal and Mach Intell 35(1):185–207 8. Chen BW, Bharanitharan K, Wang JC, Fu Z, Wang JF (2014) Novel mutual information analysis of attentive motion entropy algorithm for sports video summarization. In: Huang YM, Chao HC, Deng DJ, Park JJ (eds) Advanced Technologies, Embedded and Multimedia for Human-centric Computing, vol 260. Lecture Notes in Electrical Engineering. Springer, Netherlands, pp 1031–1042 9. Chen B-W, Wang J-C, Wang J-F (2009) A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Trans on Multimed 11(2):295–312 10. Chênes C, Chanel G, Soleymani M, Pun T (2013) Highlight detection in movie scenes through inter-users, physiological linkage. In: Ramzan N, Zwol R, Lee J-S, Clüver K, Hua X-S (eds) Social Media Retrieval. Computer Communications and Networks, Springer London, pp 217–237 11. Choudary C, Liu T (2007) Summarization of visual content in instructional videos. IEEE Trans on Multimed 9(7):1443–1455 12. Cong Y, Yuan J, Luo J (2012) Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Trans on Multimed 14(1):66–75 13. Dang CT, Radha H (2014) Heterogeneity image patch index and its application to consumer video summarization. IEEE Trans on Image Process 23(6):2704–2718 14. de Avila SEF, Lopes APB, da Luz JA, de Albuquerque AA (2011) VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68 15. Dong P, Wang Z, Zhuo L, Feng DD (2010) Video summarization with visual and semantic features. In: Qiu G, Lam K-M, Kiya H, Xue X, Kuo CCJ, Lew MS (eds) Advances in Multimedia Information Processing Pacific Rim Conference on Multimedia 2010, Part I. Lecture Notes in Computer Science, vol 6297. Springer, Berlin, pp 203–214 16. Ejaz N, Mehmood I, Wook Baik S (2013) Efficient visual attention based framework for extracting key frames from videos. Signal Process Image Commun 28(1):34–44 17. Ejaz N, Tariq TB, Baik SW (2012) Adaptive key frame extraction for video summarization using an aggregation mechanism. J Vis Commun Image Represent 23(7):1031–1040 18. Ekin A, Tekalp AM, Mehrotra R (2003) Automatic soccer video analysis and summarization. IEEE Trans Image Process 12(7):796–807 19. Evangelopoulos G, Rapantzikos K, Potamianos A, Maragos P, Zlatintsi A, Avrithis Y (2008) Movie summarization based on audiovisual saliency detection. In: Proceedings of the 15th IEEE International Conference on Image Processing (ICIP), 12–15 Oct. 2008. pp 2528–2531. 20. Evangelopoulos G, Zlatintsi A, Potamianos A, Maragos P, Rapantzikos K, Skoumas G, Avrithis Y (2013) Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans on Multimed 15(7):1553–1568 21. Evangelopoulos G, Zlatintsi A, Skoumas G, Rapantzikos K, Potamianos A, Maragos P, Avrithis Y (2009) Video event detection and summarization using audio, visual and text saliency. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp 3553–3556. 22. Fersini E, Sartori F (2012) Semantic storyboard of judicial debates: a novel multimedia summarization environment. Program: Elec Libr Inf Syst 46(2):119–219 23. Garestier F, Le Toan T (2010) Estimation of the backscatter vertical profile of a pine forest using single baseline P-band (Pol-)InSAR data. IEEE Trans Geosci Remote Sens 48(9):3340–3348 24. Hauptmann A, Yan R, Lin W-H, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans on Multimed 9(5): 958–966 25. Hauptmann A, Yan R, Lin W-H (2007) How many high-level concepts will fill the semantic gap in news video retrieval? In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval (CIVR), Amsterdam, The Netherlands. ACM, pp 627–634. 26. Hung M-H, Hsieh C-H (2008) Event detection of broadcast baseball videos. IEEE Trans on Circ and Syst for Video Technol 18(12):1713–1726 27. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans on Pattern Anal and MachIntell 20(11):1254–1259

Multimed Tools Appl 28. James W (1890) The Principles of psychology. Harvard University Press. 29. Jiang Y-G, Bhattacharya S, Chang S-F, Shah M (2013) High-level event recognition in unconstrained videos. Int J Multimed Inf Retrieval 2(2):73–101 30. Jiang Y-G, Ngo C-W, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval (CIVR), Amsterdam, The Netherlands. ACM, pp 494–501. 31. Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans on Multimed 12(1):42–53 32. Kennedy L, Hauptmann A (2006) LSCOM lexicon definitions and annotations (version 1.0). DTO Challenge workshop on large scale concept ontology for multimedia. Columbia University ADVENT technical report. 33. Kim J-N, Choi T-S (2000) A fast full-search motion-estimation algorithm using representative pixels and adaptive matching scan. IEEE Trans on Circ and Syst for Video Technol 10(7):1040–1048 34. Kleban J, Sarkar A, Moxley E, Mangiat S, Joshi S, Kuo T, Manjunath BS (2007) Feature fusion and redundancy pruning for rush video summarization. In: Proceedings of the international workshop on TRECVID video summarization (TVS), Augsburg, Bavaria, Germany. ACM, pp 84–88. 35. Knudsen EI (2007) Fundamental components of attention. Annu Rev Neurosci 30:57–78 36. Koral KF, Yendiki A, Lin Q, Dewaraja YK, Fessler JA (2004) Determining total I-131 activity within a VoI using SPECT, a UHE collimator, OSEM, and a constant conversion factor. IEEE Trans Nucl Sci 51(3):611–618 37. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). pp 2169–2178. 38. Lin L, Chen C, Shyu M-L, Chen S-C (2011) Weighted subspace filtering and ranking algorithms for video concept retrieval. IEEE Multimed 18(3):32–43 39. Loui A, Luo J, Chang S-F, Ellis D, Jiang W, Kennedy L, Lee K, Yanagawa A (2007) Kodak’s consumer video benchmark data set: concept definition and annotation. In: Proceedings of the 9th ACM SIGMM international workshop on Multimedia Information Retrieval (MIR), Augsburg, Bavaria, Germany. ACM, pp 245–254. 40. Luo JB, Papin C, Costello K (2009) Towards extracting semantically meaningful key frames from personal video clips: from humans to computers. IEEE Trans on Circ and Syst for Video Technol 19(2):289–301 41. Ma Y-F, Hua X-S, Lu L, Zhang H-J (2005) A generic framework of user attention model and its application in video summarization. IEEE Trans on Multimed 7(5):907–919 42. Ma Y-F, Lu L, Zhang H-J, Li M (2002) A user attention model for video summarization. In: Proceedings of the Tenth ACM International Conference on Multimedia, Juan-les-Pins, France. ACM, pp 533–542. 43. Matos N, Pereira F (2008) Automatic creation and evaluation of MPEG-7 compliant summary descriptions for generic audiovisual content. Signal Process Image Commun 23(8):581–598 44. Money AG, Agius H (2008) Video summarisation: a conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143 45. Money AG, Agius H (2010) ELVIS: Entertainment-led video summaries. ACM Trans Multimed Comput Commun Appl 6(3):1–30 46. Mylonas P, Spyrou E, Avrithis Y, Kollias S (2009) Using visual context and region semantics for high-level concept detection. IEEE Trans on Multimed 11(2):229–243 47. Ngo C-W, Ma Y-F, Zhang H-J (2005) Video summarization and scene detection by graph modeling. IEEE Trans on Circ and Syst for Video Technol 15(2):296–305 48. Over P, Smeaton AF, Awad G (2008) The TRECVID 2008 BBC rushes summarization evaluation. In: Proceedings of the 2nd ACM TRECVID video summarization workshop, Vancouver, British Columbia, Canada. ACM, pp 1–20. 49. Over P, Smeaton AF, Kelly P (2007) The TRECVID 2007 BBC rushes summarization evaluation pilot. In: Proceedings of the international workshop on TRECVID video summarization, Augsburg, Bavaria, Germany. ACM, pp 1–15. 50. Pal R, Ghosh A, Pal SK (2012) Video summarization and significance of content: a review. In: Handbook on soft computing for video surveillance. Chapman & Hall/CRC cryptography and network security series. Chapman and Hall/CRC, pp 79–102. 51. Pedersen T, Patwardhan S, Michelizzi J (2004) WordNet:Similarity - Measuring the relatedness of concepts. In: Proceedings of the nineteenth national conference on artificial intelligence (AAAI). pp 1024–1025. 52. Peng W-T, Chu W-T, Chang C-H, Chou C-N, Huang W-J, Chang W-Y, Hung Y-P (2011) Editing by viewing: automatic home video summarization by viewing behavior analysis. IEEE Trans on Multimed 13(3):539–550

Multimed Tools Appl 53. Posner MI, Petersen SE (1990) The attention system of the human brain. Annu Rev Neurosci 13:25–42 54. Pritch Y, Rav-Acha A, Peleg S (2008) Nonchronological video synopsis and indexing. IEEE Trans on Pattern Anal and Mach Intell 30(11):1971–1984 55. Rapantzikos K, Avrithis Y, Kollias S (2011) Spatiotemporal features for action recognition and salient event detection. Cogn Comput 3(1):167–184 56. Ren J, Jiang J (2009) Hierarchical modeling and adaptive clustering for real-time summarization of rush videos. IEEE Trans on Multimed 11(5):906–917 57. Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 16–21 June 2012. pp 3681–3688. 58. Tang S, Zheng Y-T, Wang Y, Chua TS (2012) Sparse ensemble learning for concept detection. IEEE Trans on Multimed 14(1):43–54 59. Taskiran CM, Pizlo Z, Amir A, Ponceleon D, Delp EJ (2006) Automated video program summarization using speech transcripts. IEEE Trans on Multimed 8(4):775–791 60. Tavassolipour M, Karimian M, Kasaei S (2014) Event detection and summarization in soccer videos using Bayesian network and copula. IEEE Trans on Circand Systfor Video Technol 24(2):291–304 61. Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):1–37 62. Viola PA, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154 63. Wang M, Hong R, Li G, Zha Z-J, Yan S, Chua T-S (2012) Event driven web video summarization by tag localization and key-shot identification. IEEE Trans on Multimed 14(4):975–985 64. Wang F, Ngo C-W (2012) Summarizing rushes videos by motion, object, and event understanding. IEEE Trans on Multimed 14(1):76–87 65. Wang S, Zhu Y, Wu G, Ji Q (2013) Hybrid video emotional tagging using users’ EEG and video content. Multimed Tools and Appl doi:10.1007/s11042-013-1450-8 66. Wei X-Y, Jiang Y-G, Ngo C-W (2011) Concept-driven multi-modality fusion for video search. IEEE Trans on Circ and Syst for Video Technol 21(1):62–73 67. Wu J, Rehg JM (2011) CENTRIST: a visual descriptor for scene categorization. IEEE Trans on Pattern Analand Mach Intell 33(8):1489–1501 68. Xu G, Ma Y-F, Zhang H-J, Yang S-Q (2005) An HMM-based framework for video semantic analysis. IEEE Trans on Circ and Syst for Video Technol 15(11):1422–1433 69. Yuan Z, Lu T, Wu D, Huang Y, Yu H (2011) Video summarization with semantic concept preservation. In: Proceedings of the 10th International Conference on Mobile and Ubiquitous Multimedia (ACM MUM), Beijing, China. ACM, 2107609, pp 109–112. 70. Zhu S, Ngo C-W, Jiang Y-G (2012) Sampling and ontologically pooling web images for visual concept learning. IEEE Trans on Multimed 14(4):1068–1078

Pei Dong received the bachelor’s degree in electronic information engineering and master’s degree in signal and information processing from Beijing University of Technology, China, in 2005 and 2008, respectively. He is currently pursuing the Ph.D. degree in School of Information Technologies, The University of Sydney, Australia. His current research interests include video and image processing, pattern recognition, machine learning, and computer vision.

Multimed Tools Appl

Yong Xia received the B.E., M.E., and Ph.D. degrees in computer science and technology from Northwestern Polytechnical University, Xi’an, China, in 2001, 2004, and 2007, respectively. He was a Postdoctoral Research Fellow in the Biomedical and Multimedia Information Technology Research Group, School of Information Technologies, University of Sydney, Sydney, Australia. He is currently a full professor in School of Computer Science, Northwestern Polytechnical University, and also an Associate Medical Physics Specialist in the Department of PET and Nuclear Medicine, Royal Prince Alfred Hospital, Sydney. His research interests include medical imaging, image processing, computer-aided diagnosis, pattern recognition, and machine learning.

Shanshan Wang received her bachelor’s degree in biomedical engineering from Central South University, China in 2009. She is now pursuing her double-Ph.D. degree as a cotutelle student from both Shanghai Jiao Tong University, China on biomedical engineering and The University of Sydney, Australia on computer science. Her research interest is inverse problem in medical imaging and image processing such as MR/PET image reconstruction, image denoising and dictionary learning.

Multimed Tools Appl

Li Zhuo graduated from the University of Electronic Science and Technology, Chengdu in 1992, received the master degree in signal and information processing from Southeast University in 1998, and the Ph.D. degree in pattern recognition and intellectual system from Beijing University of Technology in 2004, respectively. She has been a full professor since 2007 and the supervisor of Ph.D. students since 2009. She has published over 100 research papers and authored three books. Her research interests include image/video coding and transmission, multimedia content analysis, and wireless video sensor networks.

David Dagan Feng received the M.E. degree in electrical engineering & computer science from Shanghai Jiao Tong University, Shanghai, China, in 1982, the M.Sc. degree in biocybernetics and the Ph.D. degree in computer science from the University of California, Los Angeles, CA, USA, in 1985 and 1988, respectively, where he received the Crump Prize for Excellence in Medical Engineering. He is currently the Head of School of Information Technologies and the Director of the Institute of Biomedical Engineering and Technology, University of Sydney, Sydney, Australia, a Guest Professor of a number of universities, and a Chair Professor of Hong Kong Polytechnic University, Hong Kong. He is a fellow of IEEE, ACS, HKIE, IET, and the Australian Academy of Technological Sciences and Engineering.