Recognizing Complex Behaviors in Aerial Video1 Anthony Hoogs, Michael T. Chan, Rahul Bhotika, and John Schmiederer GE Global Research One Research Circle, Niskayuna, NY 12309
[email protected]
Keywords: Imagery and Video Analysis, Search and Retrieval, Temporal Analysis, IMINT, MASINT, Terrorism, Science and Technology
Abstract Event or activity recognition is a critical element of reconnaissance video analysis for intelligence. While object tracks may be of interest on their own, in many cases it is the specific behavior of objects, under certain semantic and scene conditions, that distinguishes interesting video from the mundane. The automatic detection of salient activities in video, with suitably low false alarm rates, will enable dramatic improvements in the efficiency and feasibility of video filtering, indexing and exploitation. However, in busy scenes with many moving objects, detecting events involving multiple interacting objects presents a significant challenge. In this paper, we present a novel approach for recognizing such complex events in aerial video, using a unique combination of high-level semantic knowledge and probabilistic methods. Our algorithm requires little or no training data, and can handle very large differences in scene conditions and appearance for the same event because we directly model the underlying semantics of the event. Results are shown in aerial video of complex transshipment events across a wide range of scenes.
1. Introduction Research on automating the analysis of aerial video has focused primarily on tracking moving objects and registration. While these are important challenges that must be solved, for many intelligence problems it is insufficient to simply detect and track moving vehicles. In busy areas there may be dozens, hundreds or even thousands of moving objects. Under such conditions it is inadequate to alert an analyst whenever an object moves, or to present an analyst with a display of all object tracks. In this paper, we present a method for recognizing complex behaviors under difficult scene conditions. We define complex to mean that multiple objects are involved in an event of interest, and that their relative spatiotemporal dynamics and semantic context distinguish the event from other activity. There is a complementary class of simple events that can be recognized through a straightforward combination of object tracks, geospatial
Figure 1. Different scenes of cargo transshipment. A good algorithm should be able to recognize all of them with the same model. information, and temporal constraints. An example of a simple event might be “two vehicles passing through this checkpoint within any 5-minute interval”, whereas a complex event would be “two vehicles meeting in parking lot.” The challenge in the latter case is to translate “meeting” into a set of scene-invariant observables that capture the essence of the intended behavior. The automatic recognition of activities in video has been the focus of a number of research efforts in recent years [1][2][3][5][6][9][11]. The major shortcoming of most of these methods is that they do not capture the underlying semantics of the modeled events. Instead, they rely on data-driven learning using a corpus of training video for each event. As a result, the event model learned on one scene, from a fixed viewpoint, cannot be used to recognize the same event from a different viewpoint or in a different scene (see Figure 1). While this may be a moderate drawback for fixed-camera surveillance applications, it is detrimental for aerial video. Training data from a consistent viewpoint is simply not available, and the variance of scene conditions is huge. Furthermore, the most significant events happen infrequently, over a short period of time comprising a small fraction of the video collected. These rare events
suffer from limited training data, for by their nature there are few or sometimes no exemplars. This paper is focused on detecting complex, rare events, using as few as one training example per event type, across a wide range of scene and imaging conditions. The rarity constraint excludes data-driven learning approaches, as we cannot assume that the training data represents the range of variability present in operational situations. Instead, events are modeled using semantic primitives that enable generalization well beyond the training data. The models are constructed manually using domain knowledge, using a pre-defined set of generic semantic spatial and temporal primitives. The primitives induce a set of thresholds on continuouslyvalued object track measurements, separating significant observations from incidental ones. The resulting binary feature vectors are sufficient to train a Hidden Markov Model (HMM) for each event with one exemplar. The models incorporate multiple objects and their spatiotemporal dynamics through multiple HMM states. Without requiring explicit models of normal activities, we distinguish an interesting event from incidental activities by comparing likelihood scores to a set threshold.
Figure 2. An example event in aerial video. The forklift is unloading the cargo container from the airplane (source) to the truck (destination). An example of the elements of a transshipment event model is shown in Figure 2. The event involves two moving objects, the forklift and the cargo container. They are tracked independently. The beginning and end of the cargo track is labeled as the Source and Destination respectively. The goal of event detection here is to find video subsequences in which cargo is moved from one location to another, while ignoring segments showing various other object movements. We use semantic spatial primitives “close” or “adjacent” to capture the important information about the relative positions of objects. In the event type shown in Figure 2, one can threshold the distance between the two moving objects to capture adjacency. Various research efforts have attempted to formally and quantitatively define spatial relations
[13][14]; we have found that simple thresholds on distance are effective in our scenarios. The observation densities in the HMM model are used to capture the uncertainty in the observables, as a result of thresholding for example. Typical HMM formulations use continuous observables such as object positions, distances and velocities as features. This requires much training data to distinguish the invariant relationships of objects from incidental elements, such as the path of the objects between the source and destination. Reducing continuous distance measurements among objects to a binary measure enables one or very few training examples to capture the essential information. Similar event instances with different low-level trajectories can be recognized. Our general approach is in contrast to other approaches that depend on a statistical characterization of normalcy [3], which can be very effective in problem domains such as fixed-camera surveillance, but not in others such as aerial video where data is not readily available. Furthermore, by modeling the events of interest directly, we can detect and distinguish multiple types of rare events, which is difficult when they are detected only as statistical outliers. In closely related work, semantic primitives are defined as Bayesian nets learned from training data [1]. We attempt to define primitives universally to avoid such a dependence on training data. The modeling of the dynamics of the primitives using an HMM-like formulation is analogous to our approach. Other related works include those that employ HMM variants such as semi-HMM [6], time-variant HMM [10], stochastic grammar [5], or other temporal models [1][9]. Our approach uses discrete semantic primitives to model observables in a standard HMM framework.
2. Event representation A user creates the model for an event of interest, by specifying the objects involved in the event, their roles, and their semantic spatial and temporal relations. The spatial relations are encoded in a binarized feature vector representation, whereas the temporal constraints in events are expressed using the HMM framework.
2.1. Hidden Markov models A Hidden Markov Model is a doubly stochastic process consisting of a state transition model, {aij }, 1 ≤ i, j ≤ N , where N is the number of states, and a set of observation density functions. In recognition problems, the objective is to recover the mostly likely sequence of (hidden) states, given a sequence of feature observations { ot : 1 < t < T}. The observation densities
Figure 4. An HMM constructed from 5 states that model a complete unload and load cycle. The observation densities are discrete in our case. Other HMM state compositions are also created to model partial cycles, such as isolated loading or unloading events.
b j (o) , which depend on the state j the process is in at time t, can be continuous or discrete. A major advantage of this representation is the decoupling of the underlying states of interest and the observation models, allowing uncertainty and variation to be incorporated. We use a left-right HMM for representing the temporal constraints in time-series data as in the case of video data. Typical applications of HMMs for recognition involve modeling the trajectories of some observables, often using Gaussian distributions or mixtures of Gaussian. A strength of the approach is that given enough examples of each category to be recognized, the parameters of the HMM can be learned quite effectively; very detailed distributions of temporal trajectories can be learned [4]. However, this advantage is also a drawback; it is difficult for the model to generalize to handle unseen data without adequate training data. Furthermore, the optimal number of states is typically experimentally learned. It is not easy to attach semantic meanings to the states after learning.
2.2. Semantic primitives The objects and spatio-temporal dynamics of the event model are naturally dependent on the domain, to a certain degree. For the recognition of loading/unloading activities, relevant concepts include an object to be transported, an instrument of conveyance, a source and a destination location. We obtain generality beyond the training data, and some degree of domain independence, by defining the spatial relations using semantic primitives. A specific instantiation of the above is illustrated in Figure 3. The event of interest is the transfer of cargo; in this instance, from an airplane to a nearby truck, via a forklift. On the left is the observed, continuous feature vector, which captures the distance from the designated objects to the cargo container. On the right, the distances have been binarized by thresholding, to represent the semantic information that, beyond a certain distance, the cargo is no longer “close” to the other object. Once the distance threshold is exceeded, the exact value of the distance becomes irrelevant.
Figure 3. The mapping from continuous distances (distances from the cargo container) to discrete, semantic features. Si’s denote the primary vector primitives that represent the semantic spatial relations used in the paper. The primary advantage is that very little training data is required, so that rare events can be recognized despite significant variation in appearance and dynamics. A secondary advantage is that event models can be created by users, using intuitive, human-level semantic primitives. Figure 4 shows an HMM composed of 5 states that model a complete unload and load cycle, which is appropriate for the video example illustrated in Figure 2. Other state compositions can be created to model partial activity cycles as well. By using these primitives in an HMM, we can leverage the natural ability of left-right HMMs to enforce temporal constraints and the Viterbi algorithm [7] to perform recognition.
2.3. Model parameters In the generation of binarized features, thresholds need to set for the closeness relation. In principle, they can be universally set based on physical sizes of objects and other calibration information. In practice, there is some uncertainty with respect to the thresholds we use because of image noise. We overcome this by including in the discrete observation pdf for each state a finite probability of noisy observations at the threshold boundary. We can estimate these pdf’s using multiple binarized observation series generated from the original training data at different thresholds, or using some supplemental training data that can be generated to simulate the effects of thresholding in the feature space. We use the Baum-Welch algorithm [7] to concurrently estimate the transition probabilities and observation pdfs of the HMMs.
3. Experiments We validated our approach on a challenging set of video clips, shown in Figure 5. Three clips consisted of unloading and loading of cargo containers by a forklift, and two consisted of only unloading events. Some include random vehicle traffic at those sites.
(a)
(b)
(c)
(d) (e) Figure 5. Examples of object tracks used in event recognition experiments. (a) to (d) show actual motion-compensated object tracks overlaid on the generated mosaics. In (a)-(c) the forklift track is colored red and the cargo container green. In (d) the crane is red and the cargo blue. Trained on the tracks in (a), our method was able to recognize the same activity in (b), despite a change in viewpoint of the same scene, and in (c) and (d) despite scene changes. (e) shows some of the simulated track combinations in our experiments; our model was effective in distinguishing them from the true events using a decision threshold on the likelihood scores.
3.1. Tracking and sensor motion compensation We employed the mean-shift tracker [8] based on color histogram matching to track the moving objects, after manual initialization using ellipses as the tracking templates. Figure 6 shows the snapshots of the tracking results on one of the loading/unloading video sequence. To address the added complexity created by a moving visual sensor in our data, we employed an approach based on homography estimation [12] to compensate for the camera motion. Optionally, we improved visual feature extraction on the ground plane of a 3D scene by applying a color mask as a feature list filter. Figure 5(a)-(c) shows examples of tracking results overlaid on the constructed mosaics. The resulting mosaic images contain some blurred areas that indicate poor registration due to height deviation from the homography plane, illustrating the extra difficulty induced by sensor motion. In our experiments, we assumed that the beginning and end of each instantiated cargo track were the source and destination locations respectively. In the more general case, the source and destination could be determined by some domain-specific process.
3.2. Event recognition experiments We estimated the parameters of the HMM models based on one of the five video clips that consisted of one unloading and loading cycle as described before. The model was then tested on other unloading cycles at the
(a)
(b)
(c)
(d)
(e) Figure 6. Snapshots showing the forklift and cargo container being tracked at different stages of a complete loading/unloading operation cycle: (a) approach the source, (b) unload cargo at source, (c) transport from source to destination, (d) load cargo at destination, and (e) depart from destination. same scene, and unloading at completely different scenes with significantly different resolutions and object configurations as shown in Figure 5. Figure 6 shows the
For comparative purposes, the plot in Figure 8(b) shows the results based on the corresponding HMMs built using Gaussian observation pdfs with continuous features. Poor separability between the event classes can be observed. It demonstrates that the HMM using semantic primitives was capable of generalizing to similar events observed from different viewpoints of the same scene or in different scenes, but the counterpart HMM using continuous observables was not, showing the advantage
Figure 7. The observed distance from each object (Plane – red; Forklift – green; Truck – blue) to the cargo container for one of the video clips. The horizontal black line indicates the threshold we used to define the semantic concept of “close.” (a)
(b) 1.00 True Positive R ate (Sensitivity)
different stages of the event as represented by our primitives. The measured object distances from the cargo container are shown in Figure 7. The indicated threshold value defines the exact meaning of “close,” a semantic concept that indicates a binary spatial relationship between two objects. Because of the limited size of our data set, we realistically simulated “normal” tracks (representing incidental tracks in the scene) to test our system against potential false positive events. We generated a core set of 17 tracked objects; 10 of which were the actual forklift and cargo of interest across 5 different scenes, 2 of which were unrelated objects in the scenes such as cars and trucks driving around, and 5 of which were handgenerated tracks. To expand on this set of tracks, we also added a transformation of all the generated tracks by a random noise function. This noise function entailed subsampling the tracks every 90 frames (or 3 sec), displacing the tracked object’s x-y location by a randomly generated number of pixels uniformly distributed between –20 and 20, and interpolating these resulting tracks using B-splines. Each pair of tracks was considered as an incidental, non-modeled event; 327 pairs of object tracks were used as counter examples. Figure 5(e) shows some representative simulated track pairs. Figure 8 shows histograms of log-likelihood scores and ROC curves for the transshipment event based on the 4 true unloading/loading events and 327 simulated events. The scores were normalized with respect to the number of video frames so that longer sequences would not be unfairly penalized. The Figure 8(a) shows that our model provided good separability between the true and false events; only two false events had scores that fell in the range of scores for the true events. In other words, our model was effective in disregarding most of the irrelevant events by using a likelihood threshold without having to explicitly model the distributions of uninteresting events.
0.90 0.80 0.70 0.60
3D Binarized
0.50
2D Binarized 2D Continuous
0.40
3D Continuous
0.30 0.20 0.10 0.00 0.00
0.20
0.40
0.60
0.80
1.00
False Positive Rate (1-Specificity)
(c) Figure 8. Histograms of normalized log-likelihood scores for the true and simulated events evaluated using two HMM models constructed based on (a) binary semantic features and (b) continuous ones. True events (marked by red arrows) are well separated from the false events using semantic features, but not using continuous features modeled by Gaussians. (c) shows the ROC curves for the 2D and 3D models. of our approach when training data is scarce. Finally, we also projected the 2D object coordinates onto the ground plane, and transformed the model into
3D. This improves performance further, as shown in Figure 8(c).
4. Conclusion We propose an approach to use HMM’s to model spatial-temporal relations of objects participating in a complex event of interest, where observables are a sequence of semantic primitives derived from binarized distance relations. We showed that semantic observables outperform direct continuous observables, in terms of generalizing to unseen data with little training data. This enables the detection of rare events, which have little or no training data. Plans are underway to evaluate the effectiveness of our approach using a larger data set. We also plan to include other semantic primitives as features (e.g., motion-based features) to further improve the discriminative power of the model.
5. References [1] S. Hongeng, F. Brimond, and R. Nevatia, “Representation and Optimal Recognition of Human Activities,” In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Vol. I, pp. 818-825, 2000. [2] S. Intille. And A Bobick, “A framework for recognizing multi-agent action from visual evidence.”, In Proc of National Conference on Artificial Intelligence, pp. 518-525, April 1999. [3] C. Stauffer and E. Grimson, "Learning Patterns of Activity Using Real-Time Tracking", IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(8):747-757, 2000. [4] N. Johnson, D. Hogg, “Learning the Distribution of Object Trajectories for Event Recognition,” Image and Vision Computing, 14:609--615, 1996. [5] Y. A. Ivanov and A. F. Bobick. “Recognition of Visual Activities and Interactions by Stochastic Parsing,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 22(8) pp. 852-872, 2000 [6] S. Hongeng and R. Nevatia, “Large-Scale Event Detection Using Semi-Hidden Markov Models”, In Proc. of International Conference on Computer Vision, pp. 1455-1462, 2003. [7] R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, New Jersey, 1993. [8] D. Comaniciu, V. Ramesh, and P. Meer. “Real-time tracking of non-rigid objects using mean shift.” In IEEE Computer Vision and Pattern Recognition, volume 2, pages 142-149, 2000. [9] S. Gong and T. Xiang. “Recognition of Group Activities using Dynamic Probabilistic Networks.” In Proc. International Conference on Computer Vision, pp. 742-749, 2003. [10] V. Kettnaker. “Time-dependent HMMs for visual intrusion detection," IEEE Workshop on Event Mining: Detection and Recognition of Events in Video, June 2003. [11] N. Vaswani, A.R. Chowdhury, R. Chellappa, “Activity recognition using the dynamics of the configuration of interacting objects,” In Proc. Computer Vision and Pattern Recognition, pp. 633-640, 2003 [12] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision, Cambridge University Press, 1998. [13] T. Dar, L. Joskowicz and E. Rivlin. “Understanding mechanical motion: From images to behaviors,” Artificial Intelligence 112(1), 1999. [14] A.P. Fern, R.L. Givan and J.M. Siskind, "Specific-toGeneral Learning for Temporal Events with Application to Learning Event Definitions from Video". Journal of Artificial Intelligence Research, vol 17, pp. 379-449, 2002.
1 Much of this paper was previously published in the IAPR International Conference on Pattern Recognition, 2004.