Multimedia event recounting with concept based representation

3 downloads 99 Views 861KB Size Report
Multimedia Event Recounting with Concept based. Representation. Qian Yu, Jingen Liu, Hui Cheng, Ajay Divakaran, Harpreet Sawhney. SRI International ...
Multimedia Event Recounting with Concept based Representation Qian Yu, Jingen Liu, Hui Cheng, Ajay Divakaran, Harpreet Sawhney SRI International Sarnoff 201 Washington Rd, Princeton, NJ, 08648,USA

[email protected], [email protected], [email protected] ABSTRACT Multimedia event detection has drawn a lot of attention in recent years. Given a recognized event, in this paper, we conduct a pilot study of the multimedia event recounting problem, which answers the question why this video is recognized as this event, ie. what evidences this decision is made on. In order to provide a semantic recounting of the multimedia event, we adopt a concept-based event representation for learning a discriminative event model. Then, we present a recounting approach that exactly recovers the contribution of semantic evidence to the event classification decision. This approach can be applied on any additive discriminative classifiers. The promising result is shown on the MED11 dataset [1] that contains 15 events in thousands of YouTube like videos.

Figure 1: One example recounting result. Given a video clip that has been retrieved by our discriminative classifier, we first list the concepts in the order of the contribution to the classification decision. Then, we generate a short essay to describe the sorted evidences in dialog manner. The underlined texts are the detected concepts with the link to where the concepts have been detected. Color indicates the category of concepts, ie. action, scene, object and audio etc.

Categories and Subject Descriptors

I.2.10 [Artificial Intelligence]: Vision and Scene Under standing Video analysis

General Terms Algorithms

Keywords Multimedia Event Representation, Multimedia Event Recounting, Textual descriptions of video content

1.

Besides the concern of recognizing atomic or semantical events, in many situations, just recognizing an event is not enough. The video event that we are addressing is a complex activity occurring at a specific time. Such a video may contain a lot of irrelevant information as well. We want to answer the question why this video is recognized as this event. In other words, for each retrieved video clip, we aim to provide the observed cues that lead to our recognition decision, ideally, in a textual format and in an ordered manner. This process is called event recounting. If the event recognition answers the question “is this the desired event?”, event recounting answers the next question, “why does this video contains the desired event?”. Figure 1 shows one example recounting output, where a video is recognized as a Parkour event. By using our concept-based event representation and recounting approach, we are able to enumerate the important semantic evidences towards this classification decision. Moreover, we are able to generate a short essay-like summary for the user to describe the semantic evidences that have been recounted for this video. Although the textual summary is returned to the user, the setting of recounting problem is fundamentally different from the setting of “video to text” transcription problem. Recounting only cares about the important evidences that contribute to the retrieval de-

INTRODUCTION

Multimedia event detection (MED) has been receiving increasing attention in recent years [7, 11]. Unlike atomic actions or concept recognition [2, 8], which focus on retrieving simple primitives, such as “walking”, “placing an object”, and “face or person”, MED aims to recognize more complex events like “wedding ceremony”, “woodworking” or “birthday party”, which are more suitable for video retrieval purpose. User usually tends to look for semantically meaning events, but rarely retrieve videos of a simple action such as “person walking” or “person bending” on the web.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’12, October 29–November 2, 2012, Nara, Japan. Copyright 2012 ACM 978-1-4503-1089-5/12/10 ...$15.00.

Area Chair: Cees Snoek

1073

as “kitchen”, “lake/pond”, “wheel-closeup”, and so on. Figure 3 lists 30 of them. In addition, it contains three common object concepts “face”, “car” and “person”. For each concept, we acquire training examples (video segments for actions and keyframes for scene and objects) from our developmental dataset. We employ well-established techniques for action, scene and object detection for building our concept detectors. In particular, static features (i.e., SIFT [5]), dynamic features (i.e., STIP [8] and Dense Trajectory Based features [6]), and the bag-of-word representations [6, 4] defined over codebooks of these features are used to represent action, scene and object concepts. Binary SVM classifiers with Histogram Intersection kernel are used for concept classification. While the concept detectors of “face”, “car” and “person” are adopted from some public available detectors such as [3] used in our work. We define a concept space CK as an K-dimensional semantic space, in which each dimension encodes the value of a semantic property. In order to embed a video x into the K-dimensional space, we define a set of pooling functions Φ = {φ1 , ..., φK }, where φi assigns a value ci ∈ [0, 1] to a video indicating the confidence of the ith concept presence in it. If the concept detector ϕi take the whole video as one single input, then we can treat φi and ϕi same. However, if the detector is applied to a video by means of sliding window (i.e., split a video into W input windows, and thus produces W outputs) or on a set of key frames, then we need to define φi (eg. a max pool function in our experiment) to convert W outputs of ϕi into one single confidence value ci . As a result, the function set Φ(x) embeds a video x in the K-dimensional semantic space as a vector (c1 , ..., cK ). Semantically similar videos form a cluster in the space. Thus we can perform event recognition by training a classifier in this space. In short, the event classification is decomposed into phases: (1) Embedding a given video in the concept space; and (2) Classifying the event with features derived from the embedding. There are many ways to encode occurrence of concepts in a video event. We exploit two types of features derived from C to model and classify events. These features span the spectrum from counting the occurrence of concepts to co-occurrence. Bag of Concepts(BoC): Akin to the bag of words descriptors used for visual word like features, a bag of concepts feature measures the frequency of occurrence of each concept over the whole video clip. To compute this histogram feature, the SVM output of each concept detector is binarized to represent the presence or absence of each concept in each window. Co-occurrence Matrix(CoMat): A histogram of pairwise co-occurrences is used to represent the pairwise presence of concepts independent of their temporal distance. The sign (before/after) and magnitude of time span between two con-

Figure 2: The relationship between event classification and event recounting. cision instead of transcribing the entire video content. From this point of view, recounting is only transcribing the relevant part of a video towards a specific event. The relationship between the event classification and event recounting is shown in Figure 2. Event classification only retrieves the positively classified video with a single confidence score. Our event recounting approach provides a breakdown of the evidences of why this classification decision has been made. On the other hand, the recounting relies on the event classification as it is only meaningful for positively classified video. Our recounting approach can be applied on any additive event classifiers. In order to recount a multimedia event in a semantic way, we characterize an event as a juxtaposition of various semantic concepts, such as actions, scenes and objects, which are more descriptive and meaningful. Although low-level feature based event representation, e.g., the bag-of-visual-word model, has been widely used in action recognition, the numeric low level feature is not suitable for high-level event analysis and understanding. In our approach, we represent the complex multimedia event in the semantic concept space while we represent the concepts in the low level feature space. The concepts, including action, scene, objects and audio, e.g., the ones shown in Figure 1 are trained and detected with the low-level features. Although there exists a large body of video retrieval and action recognition work in the literature, the semantic event recounting work is quite limited. In [9], a rule-based recounting approach is used to collect the evidence. This approach works on a tailored concept set for each event, thus, heavily relies on the human knowledge of the global event. Also, this approach cannot reveal the real contributions of evidences from a specific video, thus cannot sort the evidences according to the importance. In [10], a generative And-OR graph model is used to represent the causal relationship between only action concepts. For our multimedia event, eg. a birthday party event, the causal relationship is not always meaningful. The rest of the paper is organized as follows. We present our multimedia event recounting approach in Section 2. We show the recounting results and analysis on NIST MED11 dataset[1] in Section 3 followed by Section 4 that concludes the this paper and discuss future work.

2.

2.1

MULTIMEDIA EVENT RECOUNTING

Semantic Concept based Event Representation and Classification

We define our concept collection C = {C1 , C2 , ..., CK }. In this paper, we use 121 concepts, including 81 action concepts, 20 audio concepts, as well as 17 scene concepts such

Figure 3: Some action, scene, object concepts sampled from our 121 concept collection.

1074

cepts are considered in the different CoMat features. As all of our event level features are histogram type of features, we adopt the SVM with Intersection Kernel for our event classifiers. For each features, we train an individual SVM classifier and fuse all classifier confidences by an arithmetic mean as the final event confidence.

2.2

Figure 4: The fifteen events defined in MED11 dataset.

Recounting

Given the recounted evidences, we use a set of natural language templates that combines the objects, action and scenes in to complete sentences. This translation process also converts the importance of the concepts into the phrases e.g. likely, probably, certainly, etc. The occurrence with max detection confidence for importance evidence is also returned with time-stamp information for better viewing or verification purposes show in Figure1.

As our event classification is based on Support Vector Machines (SVMs), we present an approach to perform the recounting in the context of SVMs. Given the feature vector x ∈ Rn , where n is the feature dimension, the SVM decision function h(x) can be represented as follows. m X h(x) = αl K(x, xl ) + b (1)

3.

l=1

3.1 Experiment Settings

where xl is one of m support vector, ie. l = 1, ..., m, K(x, xl ) is the kernel value between x and xl . αl is the signed weight of xl and b is the bias. Here, we only use a single classifier trained on one feature for simplicity. The fused classifier can also be written in Eq.1. If the kernel function has the following form, K(x, z) =

n X

f (xi , zi )

EXPERIMENTS

We evaluate concept based event recognition on the TRECVID MED11 open source dataset [1], which includes over 45,000 YouTube-like videos with about 18-minutes length per video in average, i.e., over 1400 hours of video data approximately. This dataset contains 15 named event categories, such as “making a sandwich”, “parkour”, “Change a vehicle tire”, and more as listed in Figure 4, plus other unnamed negative events other than the 15 events. All the videos are unconstrained videos. 3, 500 videos are selected as the development dataset (DEV), which includes about 2062 videos from Event 01-15 plus 1438 other videos. The number of videos for each event ranges from around 110 to 170. The rest of videos of MED11 dataset are used as our testing data, including 1751 videos from Event 01-15, ranging from 80 to 170 for each event. From the DEV data, we annotated about 4, 000 short video segments to develop 81 action concept detectors, and 5, 000 keyframes to develop the scene and object detectors, as discussed in Section2.1.

(2)

i=1

where f can be any function and xi , zi are the ith feature value of x and z, then the global classifier defined in Eq.1 is additive with respective to the feature space. The function in the Intersection kernel satisfies such a form where fINT = min(x, z). Linear kernel also follows this form, where fLIN = xz. For an additive classifier h(x), the decision function can be rewritten as follows. m n X X αl f (xi , zil ) + b (3) h(x) = i=1 l=1

Suppose hi (x) =

m X

αl f (xi , zil )

(4)

l=1

Figure 5: The Average Precision of event recognition using various semantic features. The last column lists the mean Average Precision (mAP) over all events.

we can decompose the decision value of h(x) as h(x) =

n X

h(xi ) + b

(5)

3.2 Result and Analysis

i=1

Now we can decompose the global classifier Eq.1 in terms of response of each feature. Here hi (x) indeed encodes how much ith feature contributes towards the final decision value. For our event recounting purpose, as each feature has semantic information carried over from semantic concept detection, thus, we are able to recount how much contribution each semantic evidence has made to the final decision value. Moreover, the semantic relationship between concepts can also be carried over. For example, in our case the co-occurrence feature is able to capture the temporal relationship between concepts, eg. one concept followed by another concept. Also, by sorting hi (x), we are able to list evidences according their importance. We have shown our recounting approach in the context of SVMs. In fact, the approach can be applied to any additive classifiers as in Eq.5, which cover a wide spectrum of classification approaches.

In order to show the overall performance of event recognition accuracy, we use the mean average precision 1 as the metric shown in Figure 5. The recounting is only applied to the video that has been classified positively. We show recounting examples for two events, ie. Event 1 and Event 4, in Figure 6. The concept histogram is used as the feature for training the event classifier (a SVM classifier with intersection kernel). For a video clip that is correctly classified as a event, we show the top recounted concepts in Figure 6 (a)(b)(c). The concepts are shown together with the confidence contributed to the final event decision. The center frame of the sliding window with the maximum detected action concept confidence is shown as the exemplar. 1 computed on the top 2000 returned videos as we only care about the retrieved results.

1075

Figure 6: Recounting example. Given a video clip that is classified positively (Event 01, Event03 and Event04), the top six recounted concepts are shown in (a)(b)(c). The center frame (shown in a yellow bounding box) of the sliding window with the maximum action concept confidence is used as the action concept exemplar. The red bounding box shows the detected object concept. In (d)(e)(f), we show a list of the most recounted concepts among all positive videos of Event 01, Event03, Event 04. The x axis shows the indices of the concepts and y axis shows the number of positive videos that hit the concepts. The list of concepts is in the same of order that the arrow line intersects the histogram bars. metrics for evaluating the event recounting process for providing better understanding of the multimedia event.

The red bounding box shows the detected object concept. It is worth noting that, although our concept detection contains a lot of noise in terms of both false alarm and miss detection, the top recounted concepts are all relevant to the event. In order to show Figure 6 (a)(b)(c) are not by chance, we demonstrate the overall recounted concepts from these events. We collect the top recounted concepts from all positive video clips and create a histogram shown in Figure 6 (d)(e)(f). Note that, the concepts that have significant hits are all relevant to the event to some extent.

ACKNOWLEDGMENTS

6.

REFERENCES

This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government. [1] www-nlpir.nist.gov/projects/tv2011/tv2011.html. [2] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. In CVPR, 2009. [3] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008. [4] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. [5] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004. [6] H. Wang, A. Klaser, C. Schmid, and C. Liu. Action recognition by dense trajectories. In CVPR, 2011. [7] L. Bao, J. Cao, Y. Zhang, J. Li, M. Chen and A. Hauptmann Explicit and implicit concept-based video retrieval with bipartite graph propagation model. In ACM Multimedia, 2010. [8] I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [9] C. Tan, Y. Jiang and C. Ngo Towards textually describing complex video contents with audio-visual concept classifiers. In ACM Multimedia, 2011. [10] P. Srinivasan, J. Shi and L. Davis Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In CVPR, 2009. [11] Y. Jiang, X. Zeng, and et al. Columbia-ucf trecvid 2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In TRECVID, 2010.

Figure 7: The most positively/negatively recounted concepts for positive and negative samples of Event 01. After we have observed that all the recounted concepts are relevant to the event, we applied the “relevant” concepts only to do the event detection. It turns out the performance becomes worse if we only use the relevant concepts based on common knowledge. One reason to explain this is that some concepts although are not positively related to the event but they are negatively related to the event. We exhibit this in Figure 7, where both positively and negatively recounted concepts are shown for positive and negative samples. As we know, the concept’s contribution to event recognition can be negative. The more negatively contributed concepts push the final decision more to the non-event side.

4.

5.

SUMMARY AND FUTURE WORK

Under a concept based event representation, we have shown a recounting approach that exactly recovers the contribution of each semantic evidence towards the event classification. This approach is general enough for any additive classifier. Based on the recounted evidence, we also provide an essaylike summary for the user. In the future, we will investigate

1076