ABSTRACT. In this study, we present a system for video event classifica- tion that generates a temporal pyramid of static visual seman- tics using minimum-value ...
2012 IEEE International Conference on Multimedia and Expo
VIDEO EVENT DETECTION USING TEMPORAL PYRAMIDS OF VISUAL SEMANTICS WITH KERNEL OPTIMIZATION AND MODEL SUBSPACE BOOSTING Noel C. F. Codella , Apostol Natsev , Gang Hua Matthew Hill , Liangliang Cao , Leiguang Gong , John R. Smith
Multimedia Research Group IBM T. J. Watson Research Center Hawthorne, NY 10532
ABSTRACT
ror and may not constitute a thorough and complete description of the content of the video. For example, each video is typically intended for a particular purpose and audience, and subsequently the textual descriptions are, as well. However, content within the video may be of interest to other audiences that were not predicted by the individual who generated the content. Textual descriptions may also be purposely misleading to avoid recall in search results. There are also situations in which a video-similarity based search query may be more appropriate than a text query. As an example, a user might see a street performer engage in a dance that they cannot identify but would like to learn more about. In this circumstance, it is clearly desirable to be able to record a short video clip of the dance in order to identify it. In an effort to achieve these capabilities, a number of studies have been carried out that assess a variety of techniques. Merler et al. [1] developed an event modeling system that utilizes a range of features, including high-level semantic, low-level static visual, and low-level dynamic motion features. Evaluation was performed on the TRECVID MED 2010 dataset. Good results were observed, and best performance was obtained when all features were fused together using a sum of scores. However, this work did not study the effect of temporal order: features were aggregated over entire videos using either maximum-value or averaging methods. Therefore, there is the potential for further performance increase with methods of quantifying temporal information. Other studies have been carried out that attempt to model more complex temporal dynamics. Ballan et al. [2] describe a system whereby videos are represented as strings of of SIFT [3] features quantized using bag-of-words (BoW) generated from k-means clustering using euclidean distance. A new SVM kernel is defined which uses the NeedlemanWunsch edit distance [4] to define a similarity measure between two video feature strings. The technique was applied to the TRECVID MED 2005 data and compared to a traditional BoW approach [5] to evaluate the performance of the technique. While mean average precision (MAP) increased from 0.32 to 0.35 using the edit distance kernel, there were
In this study, we present a system for video event classification that generates a temporal pyramid of static visual semantics using minimum-value, maximum-value, and averagevalue aggregation techniques. Kernel optimization and model subspace boosting are then applied to customize the pyramid for each event. SVM models are independently trained for each level in the pyramid using kernel selection according to 3-fold cross-validation. Kernels that both enforce static temporal order and permit temporal alignment are evaluated. Model subspace boosting is used to select the best combination of pyramid levels and aggregation techniques for each event. The NIST TRECVID Multimedia Event Detection (MED) 2011 dataset was used for evaluation. Results demonstrate that kernel optimizations using both temporally static and dynamic kernels together achieves better performance than any one particular method alone. In addition, model subspace boosting reduces the size of the model by 80%, while maintaining 96% of the performance gain. Index Terms— Video, Event, Classification, Modeling, SVM, Semantic, Visual, Pyramid, Bipartite, Model, Selection, Kernel, Optimization, Temporal, TRECVID, MED 1. INTRODUCTION Over 30 hours of video are uploaded to YouTube every minute, and this rate is increasing. A growing portion of the population with mobile devices have the ability to record and upload video content instantly. Most of the videos uploaded contain manually generated tags or textual descriptors of the video content, but such descriptors are subject to human erSupported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
978-0-7695-4711-4/12 $26.00 © 2012 IEEE DOI 10.1109/ICME.2012.190
Department of Computer Science Stevens Institute of Technology Hoboken, NJ 07030
747
a number of events for which the traditional BoW approach still outperformed the edit distance technique. Bailer [6] presents a framework similar to that of Ballan. A new SVM kernel that is based on the “All subsequences kernel” for strings is implemented using MPEG-7 features as string elements and MPEG-7 kernels between elements to determine matches and non-matches. Evaluation was performed on TRECVID MED 2007 data and compared to static MPEG-7 kernels and RBF kernels. The MPEG-7 sequence kernel showed significant improvements to many event categories; however, there were also some categories in which performance of static MPEG-7 and RBF kernels clearly outperformed the MPEG-7 sequence kernel. Xu et al. [7] designed a system for event detection that clusters video frame features (both high-level semanic and low-level visual) into a 3-level temporal pyramid. The earth mover’s distance (EMD) is used to align and compare individual frames between subclips in the temporal pyramid. The Simplex method is applied to explicitly align subclips. The authors refer to the technique as Temporally Aligned Pyramid Matching (TAPM). Performance was evaluated on the TRECVID 2005 corpus. Fusion of all 3 temporal pyramid levels using temporal alignment yielded the best overall performance; however, fusion of only the first two levels yielded improved performance on 2 events as compared to all 3 levels. In this work, we present an event modeling system that uses kernel and model selection to optimize the use of temporal pyramids. Static visual semantic features are extracted from video frames and organized into a linear temporal pyramid, where the number of segments within each level incrementally increases from 1 to 10. Maximum-value, minimumvalue, and averaging aggregation methods are used within each pyramid segment, for a total of 30 unique feature vectors. SVM models are independently trained for each pyramid level and aggregation method, with kernels and parameters optimized by grid search with 3-fold cross validation. Bipartite matching for temporal alignment is included in the set of possible kernels, as well as other kernels that enforce fixed temporal sequence. Lastly, a non-shared subspace boosting (NSBoost) technique, as described by Yan et al [8], is used to select the best performing combination of pyramid levels and aggregation types for each event. The system is implemented in the MapReduce architecture for scalability, and evaluated on the NIST TRECVID Multimedia Event Detection (MED) 2011 video dataset.
Fig. 1. Visual description of a Linear Temporal Pyramid. Videos are temporally normalized to n temporal segments, where n incrementally increases from 1 to 10. Within each temporal segment, the frame level features contained therein are aggregated using either the maximum-value, minimumvalue, or average-value operations. Video level feature vectors then become the concatenation of the aggregated feature vectors of each temporal segment. Table 1. MED 2011 Video Events ID Event E006 Birthday Party E007 Changing a vehicle tire E008 Flash mob gathering E009 Getting a vehicle unstuck E010 Grooming an animal E011 Making a sandwich E012 Parade E013 Parkour E014 Repairing an appliance E015 Working on a sewing project
partitions: Learning and Validation. The Learning partition is used to directly build SVM event models, whereas the Validation partition is used to evaluate the performance of each model generated. The Validation partition has been constructed enforcing a sampling of 40 random positive examples for each event, and half of the background negative examples in the Development dataset. The Learning partition contains all remaining data. 2.1. Visual Semantic Features High-level static visual semantics were extracted from frames sampled at a rate of 0.5 frames per second (1 frame every 2 seconds) and normalized using a sigmoid method. We utilized a taxonomy of 780 visual concepts/categories based on the IBM Multimedia Analysis and Retrieval System (IMARS) taxonomy [10]. For each of these categories, static visual semantic models were trained from 20 low level features across various image granularities and partitions, using the robust subspace bagging approach [11, 8]. In total, generation of the visual semantic classifiers involved the training of 1,326,000 SVMs.
2. EVENT MODELING DATA TRECVID Multimedia Event Detection (MED) 2011 datasets were used for event modeling and performance evaluation [9]. MED’11 data comes prepackaged into Development and Test sets for model building and evaluation, respectively. The test set is composed of 10 events, shown in Table 1. The MED Development dataset have been further split into two
748
blades, where each blade contains 8 cores, 16 GB of RAM, and 500 GB of hard disk space. The non-shared subspace boosting (NSBoost) event modeling MapReduce implementation is applied [11, 8], which is separated into 2 phases, corresponding to a Map step, and a Reduce step, respectively (Fig. 2). 3.1. Map Step: Unit Model Training with Kernel Optimization In the first phase (also know as the ”Map” step), an SVM model is trained for each event, pyramid level, and aggregation type (450 in total for 10 events). These models are referred to as ”Unit Models.” During each Unit Model training, up to 47 SVM model parameters are evaluated, which translates to 211,500 SVMs generated in total for event detection. The 47 model strings include RBF, RBF with Bipartite matching via the cycle canceling algorithm, intersection, and Chi2 kernel types. The best performing kernels and parameters, as determined by grid search with 3-fold cross-validation, are selected. The resultant model is then applied to the Validation data partition. The scores and performance metrics on the Validation partition for that model are passed to the next phase, or ”Reduce” step, as a key-value pair, with the key being the event label concatenated with the feature type.
Fig. 2. Event modeling Map Reduce flow diagram. Mappers (blue) train SVM models for each event, pyramid level, and aggregation type, on the Learning data partition. At this stage, the kernel choice is optimized for each event, pyramid level, and aggregation type, from a selection of temporally fixed and temporally dynamic methods, according to 3-fold cross validation. The resultant models are then scored against the Validation dataset partition. Reducers (purple) fuse together all models trained for each event using a subspace boosting technique in order to determine the optimal combination of pyramid levels and aggregation models according to their performance on the Validation data partition.
3.2. Reduce Step: Unit Model Fusion with Boosting In the second phase, or ”Reduce” step, the Unit Models for each event are fused according to their performance on the Validation partition into what are referred as ”Fusion Models.” Each Reducer is tasked with finding an optimal combination of Unit Models to classify each event. A previously described method called non-shared subspace boosting (NSBoost) [8] is used to perform forward model selection, a process to determine which combination of Unit Models generate the best performance on the internal Validation partition.
2.2. Temporal Aggregations Temporal dynamics are modeled as fixed dimensional feature vectors that represent t temporally normalized segments of video features, where t is varied from 1 to 10 (Fig. 1). We refer to this structure as a “linear pyramid.” For each value of t, we employed the minimum-value, maximum-value, and average aggregation techniques to condense features from individual frames within each normalized temporal segment down to a single feature vector representing that normalized temporal segment. As an example, if a video is composed of 100 frames, and we choose t=5, then features are aggregated in groups of 20 frames, taking the minimum-value, maximumvalue, or average-value, leaving a single vector to represent that temporal segment.
4. EVENT MODELING EXPERIMENTS The combination of all implemented pyramid levels, aggregations, kernels, and model subspace boosting, referred to as Linear Temporal Pyramid Alignment (LTPA), was compared to several reference modeling systems, each removing certain components of LTPA to assess the relative contribution of each to any improvement in system performance:
3. MODELING ARCHITECTURE
• Baseline Single-Level Features: A reference modeling system was constructed using only the first temporal pyramid level that contains a single segment. In this manner, temporal information is ignored by this modeling scheme, so that the effects of the Linear Temporal Pyramid can be measured.
The IBM Multimedia Analysis and Retrieval System (IMARS) was used for model training and event classification of videos [11, 12, 13]. IMARS is configured on a dual-rack Apache Hadoop 0.20.2 system, with 224 CPUs, 448 GB of total RAM, and 14 TB of HDFS storage. This translates to 28
749
• LTPA Simple Fusion: In order to assess the influence of NSBoost on the performance of the event classification system, we examined a simple late-fusion sum of all pyramid levels and aggregation types. Kernel selection across all kernel types is still enabled in this experiment, in order to isolate the effects of the NSBoost technique. • Traditional Pyramid Alignment (TPA): The next experiment performed compared the linear temporal pyramid to a traditional pyramid structure where the number of temporal segments increases exponentially at each level. 4 levels of lengths 1, 2, 4, and 8 were used for this experiment. Kernel selection across all kernel types and NSBoost are both enabled for this experiment, in order to isolate the effects of the additional pyramid levels.
Fig. 3. Relative performance change, in terms of average precision, from comparing single aggregations to LTPA using both simple unit model fusion and model subspace boosting.
the most discriminative pyramid levels and aggregations types decreased performance by only 1% over a simple late fusion sum of all pyramid levels and aggregation methods (Fig. 3). However, NSBoost was able to select the most relevant models and trim the total number of unit models by 80% – a significant reduction in model size and subsequent classification runtime. Each component of the presented modeling system has provided incremental performance boosts when layered with other components. Linear Temporal Pyramid Alignment (LTPA) with NSBoost demonstrated a performance improvement of an additional 5% over the traditional pyramid alignment structure, where each level creates an exponential number of temporal segments. Static temporal order kernels, such as traditional RBF, Chi2 , and histogram intersection kernels, boosted performance by an additional 2.3% as compared to the system that used bipartite kernels only. Addition of bipartite matching kernels to the grid search optimization boosted performance by an additional 5.8% over the system that excluded it. This demonstrates that the combination of both kernel types appears to be complimentary. Histograms demonstrating the kernel type and temporal pyramid level use across event models before and after model subspace boosting are shown in Fig. 4. Results of kernel selection before model subspace boosting are shown in Fig. 4a. Bi-partite kernels represented the majority of selected kernels across most events, except “Birthday Party” (E006), “Flash mob gathering” (E008), “Repairing an appliance” (E014), and “Working on a sewing project” (E015), where static-order kernels were more favored. This may suggest that these events demonstrate a more consistent temporal order of global semantics, whereas for other events, temporal order is not as vital for discrimination. Kernel usage after model subspace boosting are shown in Fig. 4b. Bipartite kernels still dominate over most events; however, relative contributions have changed significantly for some events. For example, “Making a sandwich” (E011) favored bipartite kernels before model selection; but most bipartite models have been eliminated after selection while maintaining 80% of the performance.
• TPA Bipartite Matching Only: Using the Traditional Temporal Pyramid, we examined the effects of limiting kernel optimization to consider only bipartite matching kernels – all kernels that enforced temporal order were omitted. In this manner we can better understand how the combination of both temporally aligning and fixed order kernels can be complimentary in the overall event detection system. NSBoost is still enabled in this experiment, to isolate the effects of removing the static kernels. • TPA Sans Bipartite Matching: Using the Traditional Temporal Pyramid, we built event models that excluded the bipartite kernels from the kernel parameter optimization step. In this manner, no temporal alignment can be performed, and the effects of such alignment can be quantitatively measured. NSBoost is still included in this experiment, to isolate the effects of removing the bipartite kernels.
Table 2. MED 11 Test Mean Average Precision (MAP) System MAP % Boost # Models Single-Level 0.1214 0% 33 LTPA Simple 0.1521 25.4% 450 LTPA NSBoost 0.1506 24.1% 90 TPA 0.1444 19.0% 68 TPA Bipartite Only 0.1417 16.7% 58 TPA Sans Bipartite 0.1373 13.2% 35
5. EVENT MODELING RESULTS Table 2 shows the mean average precision (MAP) for the 5 systems evaluated, with the relative change in performance as compared to the single-level feature system, and the number of unit models used. The NSBoost technique to select
750
Fig. 4. (a) Kernel usage histogram per event model for LTPA, across all pyramid levels and aggregation types, before model subspace boosting. Key: Histogram = Histogram Intersection, RBF = Radial Basis Function, BPM = Radial Basis Function with Bipartite Matching. (b) Kernel usage histogram after model subspace boosting. (c) Pyramid level usage per event, across all aggregation types, before model subspace boosting. (d) Pyramid level usage per event, across all aggregation types, after model subspace boosting. space boosting. Our experiments on the NIST TRECVID 2011 dataset have demonstrated significant performance gains as compared to a method that aggregates features across the entire video. Surprisingly, one event in our dataset saw its AP score almost double. The optimized combination of both static temporal order and temporal alignment kernels presented here have outperformed any one particular method alone. In addition, the model subspace boosting method maintains system performance while significantly reducing the number of unit models, in comparison to simpler fusion methods. We have introduced the idea of a linear temporal pyramid, the only difference to a traditional pyramid being that the number of segments in each level is incrementally increased. Our data shows an improvement in performance using this model as opposed to a traditional pyramid structure whose number of segments in each level increases exponentially. In a traditional pyramid structure, boundaries at level n-1 remain in level n – each segment is simply subdivided again. However, in a linear pyramid structure, boundaries at level n-1 are no longer present in level n, leading to overlap between levels. We conjecture that this overlap is primarily responsible for the increase in performance as the divisions in the object being studied are no longer fixed. This allows more meaningful boundaries to be learned based on the data. This is supported by Fig. 3b, which clearly shows that all levels within
Pyramid level usage before model subspace selection is shown in Fig. 4c. This graph simply shows that all pyramid levels are used when no model selection is employed. The effects of model subspace boosting are shown in Fig. 4d. These results demonstrate that some subcomponents are more discriminative than others for certain events. While most event models performance decreased or remained steady, two events (E006 and E010) saw a performance improvement (see Fig. 3) after removing components with lower discriminative power. By comparing Fig. 4 to the per-event performance gains seen by LTPA in Fig. 3, it becomes clear that although LTPA improves the performance over single temporal aggregations alone, they do so in different ways across events: kernel types are mixed across various levels of the temporal pyramid, and not all temporal pyramid levels are always selected for inclusion in the model. The presented system allows the temporal pyramid to be dynamic, efficient, and tailored to best discriminate each event individually.
6. DISCUSSION & CONCLUSION This work presents an event modeling architecture that customizes a linear temporal pyramid for each individual event using a combination of kernel optimization and model sub-
751
ternational conference on Multimedia, New York, NY, USA, 2008, MM ’08, pp. 239–248, ACM.
the temporal pyramid appear to be contributing discriminative information, as they have been selected by NSBoost. In this study we examined two general classes of kernels: one that maintains a fixed temporal order, and another that allows dynamic matching of temporal segments. Our experiments show that while each class on its own performs well, an additional boost in performance is seen when both are used together. This intuitively makes sense, as both the unordered content of a video and its sequence may yield discriminative information that is taken into account by the fusion of these techniques. In addition, Fig. 3a and 3c also show that some events are better modeled by different distributions of techniques. This is evident when one compares event E008, which tends to prefer static order kernels, to some others, such as E010, which prefer dynamic matching kernels. Since the applied system has been built in the MapReduce framework, this permits arbitrary scaling to handle any number of new kernels, features, or other event modeling approaches. Future work should study the implementation of other feature types, such as multi-modal semantics and low-level features, in addition to the incorporation of other more complex temporal modeling structures, such as the Needlemann-Wunsch edit distance [14] or temporal cooccurrence methods [15]. Other techniques that apply well known spatial features to the temporal domain may also have the potential to boost performance, such as temporal edge histrograms, temporal Fourier or Gabor response functions, or local binary patterns over temporal windows.
[6] Werner Bailer, “A feature sequence kernel for video concept classification,” vol. 6523, pp. 359–369, 2011. [7] Dong Xu and Shih-Fu Chang, “Video event recognition using kernel methods with multilevel temporal alignment,” IEEE Trans on Pattern Analysis and Machine Intelligence, vol. 30, pp. 1985 – 1997, 2008. [8] Rong Yan, Jelena Tesic, and John R. Smith, “Modelshared subspace boosting for multi-label classification,” Proc. ACM SIGKDD Intl Conf on Knowledge Discovery and Data mining, pp. 834 – 843, 2007. [9] Liangliang Cao, Shih-Fu Chang, Noel Codella, Courtenay Cotton, Dan Ellis, Leiguang Gong, Matthew Hill, Gang Hua, John Kender, Michele Merler, Yadong Mu, Apostol Natsev, and John R. Smith, “Ibm research and columbia university trecvid-2011 multimedia event detection (med) system,” TRECVID Multimedia Event Detection Task (MED), 2011. [10] A. Haubold and A. Natsev, “Web-based information content and its application to concept-based video retrieval.,” in ACM International Conference on Image and Video Retrieval (ACM CIVR), 2008. [11] Rong Yan, Marc-Olivier Fleury, Michele Merler, Apostol Natsev, and John R. Smith, “Large-scale multimedia semantic concept modeling using robust subspace bagging and mapreduce,” pp. 35–42, 2009.
7. REFERENCES [1] Michele Merler, Bert Huang, Lexing Xie, Gang Hua, and Apostol Natsev, “Semantic model vectors for complex video event recognition,” IEEE Transactions on Multimedia, 2012, Special issue on Object and Event Classification in Large-Scale Video Collections.
[12] Murray Campbell, Alexander Haubold, Ming Liu, Apostol Natsev, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan, and Jun Yang, “Ibm research trecvid2007 video retrieval system,” Proc. NIST TRECVID Workshop, 2007.
[2] Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, and Giuseppe Serra, “Video event classification using string kernels,” Multimedia Tools and Applications, vol. 48, pp. 69–87, 2010.
[13] Apostol Natsev, Matthew Hill, John R. Smith, Lexing Xie, Rong, Yan Shenghua Bao, Michele Merler, and Yi Zhang, “IBM Research TRECVID-2009 video retrieval system,” Proc. TRECVID Workshop, 2009.
[3] Krystian Mikolajczyk and Cordelia Schmid, “Scale and affine invariant interst point detectors,” International Journal of Computer Vision, vol. 60, no. 1, pp. 63–86, 2004.
[14] Saul B. Needleman and Christian D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443 – 453, 1970.
[4] Saul Needleman and Christian Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
[15] K. Prabhakar, Sangmin Oh, Ping Wang, G.D. Abowd, and J.M. Rehg, “Temporal causality for the analysis of visual events,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, june 2010, pp. 1967 –1974.
[5] Feng Wang, Yu-Gang Jiang, and Chong-Wah Ngo, “Video event detection using motion relativity and visual relatedness,” in Proceeding of the 16th ACM in-
752