Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Video Event Modeling and Recognition in Generalized Stochastic Petri Nets Gal Lavee, Michael Rudzsky, Ehud Rivlin and Artyom Borzin Computer Science Department, Technion-Israel Institute of Technology 32000, Haifa, Israel (gallavee,rudzsky,ehudr )@cs.technion.ac.il,
[email protected] December 15, 2008
1
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Abstract In this paper we propose the Surveillance Event Recognition Framework using Petri Nets (SERF-PN) for recognition of event occurrences in video based on the Petri Net(PN) formalism. This formalism allows us a robust way to express semantic knowledge about the event domain as well as efficient algorithms for recognizing events as they occur in a particular video sequence. The major novelties of this paper are extensions to both the modeling and the recognition capacities of Object PN paradigm. The first contribution of this work is the extension of the PN representational capacities by introducing stochastic timed transitions to allow modeling of events which have some variance in duration. These stochastic timed transitions sample the duration of the condition from a parametric distribution. The parameters of this distribution can be specified manually or learned from available video data. A second representational novelty of the paper is the use of a single PN to represent the entire event domain, as opposed to previous approaches which have utilized several networks, one for each event of interest. A third contribution of this work is the capacity to probabilistically predict future events by constructing a discrete time Markov chain model of transitions between states. The experiments section of the paper thoroughly evaluates the application of the SERF-PN framework in the event domains of surveillance and traffic monitoring and provides comparison to other approaches using the CAVIAR dataset, a standard
2
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
dataset for video analysis applications.
3
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Contents 1 Introduction
6
2 Related Work
9
3 The Petri Net Formalism
14
3.1
Graphical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2
Marking Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.3
Relation Modeling with GSPN . . . . . . . . . . . . . . . . . . . . . . . . .
18
4 Video Event Representation in SERF-PN
19
5 Learning the System Parameters
21
5.1
Timing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
5.2
Marking Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . .
22
6 Surveillance using SERF-PN
24
6.1
Generating Video Annotation . . . . . . . . . . . . . . . . . . . . . . . . .
25
6.2
Event Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
6.3
Video Event Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
7 Constructing a PN model: Illustrative examples
29
7.1
Security Check Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
7.2
Traffic Intersection Scenario . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
8 Complexity Analysis
37
9 Experiments
44
9.1
Security Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
9.2
Traffic Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
9.3
CAVIAR dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
10 Discussion
65
10.1 Comparison to Other Models for Video Event Understanding . . . . . . . .
65
10.2 Temporal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
10.3 Limitations of the PN formalism and future work. . . . . . . . . . . . . . .
68
11 Conclusion
69
5
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
1
Introduction
The ability of humans to recognize events in video sequences greatly exceeds that of the most successful automatic video event recognition approaches. One reason for this may be that humans are equipped with a great deal of semantic knowledge about what constitutes a particular event within an event domain. While it is difficult problem to encode all human knowledge, with a powerful representational formalism we contend that it is possible to distill the knowledge needed to correctly recognize and classify events in a particular event domain, particularly simple domains such as those found in surveillance applications. The Petri Net(PN) formalism is such a representational mechanism which allows us to express knowledge about an event domain in terms of semantic object properties and relationships. We can use the PN description of the event domain to devise efficient event recognition algorithms with good accuracy that are not susceptible to human failings such as slowness , subjectivity, and fatigue (classifying hours of surveillance video can be tedious). In this paper we propose the SERF-PN framework, Surveillance Event Recognition Framework using Petri Nets, for semantic description of the event of interest in a particular domain and for recognition of these events in unlabeled video sequences. Within SERFPN we utilize a representation based on Petri Networks (PN) that captures the semantics of a given domain. Provided such a model of the events in a particular domain, SERF-PN is able to generate
6
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
an event description of an unlabeled video sequence. Additionally, using the statistics gathered from observing a set of training data, SERF-PN is able to make non-deterministic assertions on the likelihood of future events. To enable the process described in this paper, video input must be mapped to our representation using low-level techniques such as background segmentation, object detection, and tracking. Much work has been done in these areas of computer vision. However, this aspect of the system is beyond the scope of this paper. In our experiments we make use of existing solutions to these problems. The major novelties of this paper are extensions to both the modeling and the recognition capacities of Object PN paradigm (which models objects as PN tokens). We first extend the representational capacities by introducing stochastic timed transitions to allow modeling of events which have some variance in duration. These stochastic timed transitions sample the duration of the condition from a parametric distribution. The parameters of this distribution can be specified manually or learned from available video data. A second representational novelty of the paper is the use of a single PN to represent the entire event domain, as opposed to previous approaches which have utilized several networks, one for each event of interest. A third contribution of this work is the capacity to probabilistically predict future events by constructing a discrete time Markov chain model of transitions between states. This model is constructed using dynamic marking analysis from a set of training data. Of course, it is important that any algorithmic approach to recognition of events in video be 7
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
computationally tractable, to this end we provide a formalization and complexity analysis of our proposed algorithm. Previous Petri Net approaches to video analysis have required a separate PN model instance for each event of interest in the scene. Furthermore, they have lacked the representational capacity of stochastic timed transition, the ability to learn the parameters of these transitions, and the predictive ability afforded by a marking analysis of training data. Our approach represents the entire event domain as a single PN, and only keeps a single instance of this PN throughout the recognition process, only augmenting the number of tokens as objects enter the scene. In summary, this paper proposes a framework for constructing event models that capture expert knowledge of the domain as well as allowing parameter tuning from training data. The constructed model can be viewed as a static entity which allows parameter tuning to capture dynamic aspects of specific applications. This representation can then be used as input to an efficient recognition algorithm which allows recognition of events in an unlabeled video sequence. This paper is organized as follows. Section 2 presents related works in video event representation and recognition. Section 3 introduces the Petri Net Formalism and the concept of marking analysis. Section 4 discusses how the PN formalism is used in our SERF-PN model. Section 5 discusses the training of the SERF-PN system. Section 6 gives the components of the SERF-PN process. Section 7 provides detailed examples of a PN event model construction. Section 8 formalizes our algorithm and provides a complexity analysis. 8
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Experiments are discussed in section 9. Finally, we provide some discussion in section 10 and conclude in section 11.
2
Related Work
The specification of a semantic description of events has been an active research topic. R. Nevatia et al. presented the Event Recognition Language (ERL) [1] which can describe hierarchical representation of complex spatiotemporal and logical events. The proposed event structure consists of such units as primitive, single-thread and multi-thread events. More recently, R. Nevatia et al. developed the Video Event Representation Language (VERL) for describing an ontology of events and the complementary Video Event Markup Language (VEML) to annotate instances of the events described in VERL [2]. Another event representation ontology, called CASEE , is based on natural language representation and was proposed by A. Hakeem et al. in [3] and then extended in [4]. Event recognition has also been a much studied research topic. Most current approaches involve defining models for specific event types that suit the goal of a particular domain and developing methods for recognition that correspond to these models. The complexity that exists in the domain of video sequences requires models that can take into account a number of factors including spatial, temporal and logical relations. This has led to the proposal of a number of model formalisms including: Nearest Neighbor [5, 6], Support Vector Machines [7] Neural Networks [8, 9] Bayesian Networks (BN) [10, 11, 12, 13], Hidden
9
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Markov Models (HMMs) [14, 15, 16, 17, 18, 19, 20], Context-Free Grammars [21, 22, 23], Petri Nets [24, 25, 26] and Temporal Constraint Satisfaction Problem (TCSP) solvers [27]. These models have been successfully adopted to deal with many problems in the domain of event recognition. Pattern recognition models attempt to define the event recognition problem as the problem of finding patterns in spatiotemporal data. These approaches usually offer a creative representation of the data along with a distance measure for comparing one representation with another. This allows for classic pattern recognition techniques such as nearest neighbor(NN) [5], support vector machines(SVM)[7] and neural networks [8, 9] to be applied to recognizing an event from an unlabeled sequence. L. Zelnik-Manor et al. propose [6] an event representation based on histogram of gradients at different temporal scales. This representation may be used to compare two events to one another or to cluster a video sequence into event clusters in an unsupervised fashion. Bayesian networks (BN) have been used in many works to recognize simple events [18, 10, 11, 12]. The structure of the BN model is usually tailored to the application domain. The Bayesian networks are able to encode spatial and logical relations, however the BN formalism does not inherently model time dependent events. Hidden Markov models (HMMs) are an extension of BNs over time, constrained structurally to allow computational tractability. This model type is effective for recognizing sequential events with different temporal durations. L. Davis et al. [14] proposed a method to recognize simple periodic events by constructing 10
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
dynamic HMM models of the periodic pattern of human movements. J. Yamato et al. [15] used discrete HMMs to model events corresponding to tennis strokes. A feature vector of a snapshot of a tennis stroke was defined directly from the pixel values of an image. A tennis stroke is recognized by computing the probability that the HMM model produces the sequence of feature vectors observed during the action. Parameterized-HMMs extended HMMs to model parameters that representing intrinsic variance in gesture events. Y. Ivanov and A. Bobick [17] proposed such a model for recognition of events corresponding to size and pointing gestures. For complex event structure such as multiple actors, the state space of the HMM becomes very large and makes this model unfeasible. For this reason, extensions to the HMM have arisen in the domain of video event recognition. Coupled HMMs [19] were introduced to model events with inherent simultaneous structure such as an interaction of two mobile objects. Hierarchichal HMMs [28] are special structures in which every state of the top level HMM is in itself an HMM, which may have further hierarchical structure. This model captures the compositional hierarchy inherent to most video events. While most of the recent work utilizes a single modeling approach, S. Hongeng and R.Nevatia [18] used a combined approach. In this work, lower-level (termed ”primitive”) events were recognized, using a Bayesian Network model. These ”primitive” events,in turn compose higher level (”composite”) events which have a temporal extent and are recognized using a semi-Markov finite state model. This system structure results in more reliable event recognition in noisy video sequences compared to standard HMMs. 11
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Dynamic Bayesian Networks(DBN) are a generalization of the HMM with no structural constraints. These models better model the event structure over time, but are often unfeasible for event modeling and recognition purposes because of their general structure. N. Oliver et al.[29] provides a comparison of HMMs and more general DBNs for detection of events in an office setting. Propagation Networks [30] is a class of DBNs introduced to model parallel streams of action where each state has a duration associated with it. An approximation inference algorithm is used to determine the most likely state sequence (and corresponding durations) of an unknown video sequence (i.e. to recognize an event). The Context Free Grammar (CFG) is another powerful event model explored in recent research. Grammar models naturally encode sequence and hierarchy which are inherent aspects of many video events. M. Brand [23] adopted a deterministic grammar model to enforce temporal constraints in high-level activities. Y. Ivanov and A. Bobick [21] proposed a method for feeding the output of a low level HMM component into a stochastic context free grammar model which outputs the most likely legal parse of the input. Stochastic parsing can be used to select the most likely parse in an ambiguous grammar. This parse corresponds to an event interpretation. The SCFG provides a convenient means for encoding the external (semantic) knowledge about the problem domain and expressing the expected structure of the event. More recently, D.Moore and I.Essa [22] proposed an extension to SCFG models for error detection and recovery, as well as strategies for 12
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
quantifying participant behavior over time. A limitation of SCFG approach is that representing relations other than sequence or hierarchy (e.g. non-sequential temporal relations) between events is not straightforward. Another method for modeling video events is based on the Temporal Constraint Satisfaction Problem (TCSP). C. Pinhanez and A. Bobick [27] addressed the problem of exponential run time in TCSP methods and proposed a simpler 3-valued (past, now, future) domain network (PNF). The PNF network transforms Allen’s temporal relationships [31] into three-valued network to express parallelism and mutual exclusions between different sub-events in a simpler way. Though the PNF algorithm is linear in the number of temporal constraints, its overall computation is still NP-hard. Vu et al [32] offer a similar approach to recognizing video events called temporal scenarios. In this work the efficiency of the recognition algorithm is emphasized and the temporal constraint network topology describing the event is reduced to optimize recognition performance. Petri Nets (PN) are a well studied formalism widely used in many research areas. PNs have recently been suggested as a video event model which define the semantics of state transitions within the domain. Two main classes of approaches using this formalism have emerged. The Plan-based PN has been presented in work by C.Castel et al. [24]. A series of semantically meaningful states related by temporal and logical relations are represented by the PN model. A successful traversal of the conditions encoded by the model indicates an event has been recognized. 13
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
M. Albanese et al [26] propose a plan-base PN approach for the recognition of human activities. In this work a probabilistic extension to the PN framework is used to allow the framework to cope with the uncertainty and errors inherent to the low-level observation layers in many video event recognition systems. Object-Based Petri Nets represent the evolution of object states within the video sequence. Events are defined as known transitions between these object states. N.Ghanem et al. [25] proposed using such PN models for mining of surveillance video. This formalism enables specification of the semantic structure of events using domain knowledge. Hierarchical structure as well as spatial, temporal and logical relations can be specified in a straightforward manner. G. Lavee et al [33] discusses how PN event models can be constructed from ontology descriptions of the event domain.
3 3.1
The Petri Net Formalism Graphical Representation
The Petri Net model is represented as a bipartite graph that is comprised of two node types (places and transitions) and two arc types (regular and inhibit arcs). In a graphical representation, the places are drawn as circles and the transitions are drawn as squares or rectangles. Arcs are drawn connecting place nodes to transition nodes (input arcs) or
14
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 1: Petri net components transition nodes to place nodes (output arcs). Regular arcs are drawn with arrow heads. Inhibit arcs are drawn with dot heads. Arcs are associated with a weight, also called the arc’s multiplicity. Arc multiplicity is taken to be one if not specified. Places connected to a transition by input arcs are called that transition’s input places (or input set). Similarly, places connected to a transition by output arcs are called that transition’s output places (or output set). A place node may contain a number of tokens (another graph component). Tokens are visualized as black dots within the place node which contains them. Figure 1 illustrates a simple Petri Net. At any given time, the state of the PN model is defined by the number of tokens in each of the model’s place nodes. This configuration is called a marking. The marking can be easily represented as a vector in which the ith entry corresponds to the number of tokens in the ith place. Transition from one marking to another occurs when one or more transition nodes ”fire”. Transition nodes can only fire once they become ”enabled”. A transition’s ”enabling rule” is considered satisfied when 1: all the input places connected
15
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
to the transition by regular arcs contain at least as many tokens as the multiplicity of the arc connecting them to the transition and 2: all the input places connected to the transition by inhibit arcs contain a number of tokens strictly less than the multiplicity of the arc connecting them to the transition. When the enabling rule is satisfied a transition is considered ”enabled”. In extensions to the PN formalism, an enabling rule can be modified to contain conditions on the properties of tokens in the input places. Once a transition node becomes enabled it may fire. The modification to the state of the PN model resulting from the firing of a transition is defined by a ”firing rule”. A transition’s firing rule is defined as 1:removing a number of tokens from each of the transition’s input places equal to the multiplicity of the arc connecting that place to the transition and 2: creating a number of new tokens in each of the output places equal to the multiplicity of the arc connecting that place to the transition. Occasionally, more than one transition may become enabled at the same time. Depending on the PN structure, this situation can create a conflict, as firing of one transition may immediately disable another transition. Conflicts may be resolved in a controllable way by adding a priority parameter to each transition node. In case of conflict, the higher priority transition will always fire before the lower priority one. The dynamic behavior of the PN model is defined by the enabling and firing rules associated with each transition node in the model. For a formalized comprehensive treatment of the PN model interested readers are referred to [34] (Chapter 2). Timed Petri Nets extends the PN model by introducing timed transitions which have an 16
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
associated duration parameter describing the interval of time needed to ”carry out” the state change described by the transition. A timed transition may only fire if the period of time since it became enabled is greater than this duration parameter. The duration parameter is sometimes known as a ”firing delay”[34] (Chapter 3). In some cases, a PN model with strictly defined timed transition delays does not accurately describe the behavior of the underlying system. Another extension to the PN model that adopts timed transitions with stochastic distributions over the duration parameter is known as a Stochastic Petri Net (SPN). Each timed transition duration is modeled using the negative exponential probability density function. A parameter indicating the rate of the distribution (the inverse of the mean) is associated with each individual transition. In modeling real dynamic systems it is desirable to use immediate transitions together with timed transitions. This formalism is called a Generalized Stochastic Petri Net (GSPN)[34] (Chapter 4).
3.2
Marking Analysis
Prior to involving any PN model dynamics, it is possible to compute the set of all markings reachable from the initial marking and all the paths that the system may follow to move from state to state. The set of all reachable markings define the reachability set of the PN graph. The reachability graph consists of nodes representing reachable markings and arcs representing possible transition from one marking to another. An example of reachability
17
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 2: Reachability graph example. graph is illustrated on Figure 2. The reachability set represents all possible model states and the reachability graph represents all possible state transitions. However, the reachability graph is only finite if the model and the number of involved tokens are finite. In practice, this means that it is possible to calculate the full reachability graph only if we know the maximum number of the expected tokens in the system. This information is not always available which limits the extent a reachability graph of this sort can be used. For this reason, in our approach we apply a dynamic marking analysis that only considers markings which appear in the training set (see section 5).
3.3
Relation Modeling with GSPN
A major strength of the GSPN formalism is its ability to model many of the semantic relations that are used by humans to define video events. These include logical, temporal and spatial relations. Due to space restrictions we will not discuss the details of representing
18
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
each of these relations in this paper. Section 7 offers an illustrative example of a GSPN model construction which uses many of these relations. For more information regarding modeling semantic relations in the GSPN framework, interested readers are referred to [34].
4
Video Event Representation in SERF-PN
Our representation model follows the Object-Based PN paradigm. Within this framework, PN components have the following roles: • Tokens represent actors or static objects detected in the video sequence. Each token has a set of properties associated with it corresponding to object properties. • Places represent the possible object states. Each place containing more than one token indicates a group of objects in the same state. • Transitions represent video events that define dynamics of the event model. Transition node firing can be equivalent to the object state change in the real scene or can be the result of a satisfied relation constraint. In SERF-PN, tokens are associated with properties such as tracking and classification information provided by an intermediate video processing layer. Enabling rules may require specific values for token properties (e.g. movement, position, appearance, etc.) or enforce
19
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
relations between tokens participating in the transitions. More exactly, a transition is enabled if the following is satisfied: • Each input place connected to the transition by a regular arc contains token(s) that satisfy enabling rule conditions and the number of such tokens is greater than or equal to the multiplicity of the arc connecting that place to the transition. • Each input place connected to the transition by an inhibit arc contains a number of tokens is strictly less than the multiplicity of the arc connecting that place to the transition. SERF-PN does not modify the firing rule definition. That is, firing transitions delete from each input place as many tokens as the input arc multiplicity and then add to each output place as many tokens as the output arc multiplicity. State changes in SERF-PN are usually the result of the initiation or completion of some event or the verification of a condition. SERF-PN utilizes a GSPN model which includes immediate transitions as well as stochastic timed transitions whose firing delays are distributed according to the negative exponential distribution. Generally, time consuming events corresponding to real world occurrences will be modeled as timed transitions while condition or relation verifications will be modeled as immediate transitions. Examples of these modeling practices are given in the section 7.
20
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
5
Learning the System Parameters
In SERF-PN, the structure of the model is static and designed manually using knowledge of the domain and of the PN formalism by a knowledge engineer. Initial work towards automating this process to remove subjective interpretation of the domain, has been done in [33]. Parameters of the system, however, are dynamic and may be learned to assist us in predicting events based on the current system state. One set of these parameters include duration parameters (i.e. exponential distribution rates) for each of the timed transition in the PN model. Another set of parameters are the transition likelihoods between directly reachable markings. To learn these parameters we require a set of examples which cover the typical occurrences in the domain of interest as well as a specification of our PN event model. In most surveillance tasks the existence of such a set of examples is a realistic assumption due to the repetitive nature of many events in the surveillance domain. Note also, that the knowledge used by the engineer in constructing the static structure of the model is also based on experience with typical events.
5.1
Timing Parameters
The probability density of the firing delay of each timed transition is given by the PDF function:
21
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
−tn
Dn = 1 − e µn
(1)
where tn is an enabling period of timed transition n and µn is an average delay of timed transition n. During the training process we wish to estimate the µn parameters for each transition n. This is achieved by not allowing timed transitions to fire and calculating the average time each of these transitions is enabled over the course of the training sequence. Once these parameters are learned the system can model the occurrence of a timed event relative to the training data.
5.2
Marking Transition Probabilities
The combination of the PN model and the initial marking allows the construction of a reachability graph for the domain. Each marking node in this reachability graph represents a legal state of the system. An outgoing link from marking node, M1 , to another marking node, M2 , indicates that marking M2 is directly reachable from the marking M1 . The reachability graph defines the space of possible states within the model. The probability mass in this space, however, is not uniformly distributed. In fact, the majority of the probability mass is concentrated in a just a few markings relative to the large number of possible markings. In other words, while the theoretical number of possible markings is large (infinite in the case of an open model), in practice only a small subset of those
22
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
markings are observed. More specifically the possible number of markings for a PN model with n place nodes and t tokens is given by the recursion formula:
Π(t, n) = 1 +
t X
Π(i, n − 1)
(2)
i=1
where the base case is: Π(t, 1) = 1
(3)
The probability distribution over the marking set can be estimated using observed data, which may be viewed as a sample from this distribution. This is contingent upon the sample being representative of the underlying distribution, which we have claimed is a reasonable assumption in surveillance applications. Another simplifying assumption made at this stage is the Markov assumption, which states that at each discrete time slice the current marking depends only on the previous marking. This allows us to construct a discrete time Markov Chain (DTMC) representing the joint probability of a sequence of particular markings. This formulation may be used to predict future states or answer queries on the probabilistic distance from a particular marking. Our proposed approach is to construct this DTMC dynamically from the training data. Again, we assume that the training sequence is a representative sample of the underlying probability distribution over event patterns. Thus the training sequence covers the significant reachable markings in this domain. Each encountered marking in the training 23
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
sequence corresponds to a state the Markov chain may take on. The probabilities for state transitions in the Markov chains are then estimated directly from the training data according to the formula: λn,k =
Nn,k Nn
(4)
where λn,k is probability to move to marking Mk from marking Mn , Nn,k is number of detected transitions from marking Mn to marking Mk , and Nn is number of Mn marking occurrences.
6
Surveillance using SERF-PN
Our system was designed to aid in the human task of monitoring surveillance video. As such, given a description of the events of a surveillance domain (in a PN format) and an unlabeled (annotated) video sequence, the system should be able to notify when these events have occurred in the video sequence as well as predict the probable next event (or alternately provide the probability of an undesirable event). For this process to occur we must allow our system to 1. accept video sequences annotated with mid-level semantic information by an intermediate video processing layer. 2. Allow the user to input an event description in a PN format. 3. Train the specific parameters of the representation to apply to a particular scenario from training data (section 5). 4. Run on unlabeled (annotated) video data and give descriptions of events as they occur as well as predict future events. These processes are described below. 24
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 3: Surveillance system components A schematic of the SERF-PN components is shown in Figure 3.
6.1
Generating Video Annotation
The event analysis of the SERF-PN system is predicated upon video which has been annotated with mid-level semantic concepts such as object locations, trajectories and classes. Many works in object detection, tracking, and classification exist that focus on achieving this annotation from raw video data. In this work we have put the focus on event analysis and representation and have made use of existing low-level methods to create this annotation from real video sequences. This is referred to as the intermediate video processing unit. To make our system modular to the process that created the video annotation we have made the video event recognition component of SERF-PN fully compliant with the CAVIAR [35] standard for annotation of video. Thus, the CAVIAR dataset (which has been manually annotated) and any other dataset in this format can be easily evaluated by our system. The standardization of the video annotation format also allows the use of a synthetic video 25
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
tool called Video Scene Animator. This module takes object parameters as input and generates a CAVIAR compatible annotated video. This tool has been used to efficiently simulate real video sequences captured by a video camera, allowing us to extend the available dataset of real video clips. Furthermore, to train the parameters of the event model a large volume of ”normal” video sequences are needed. Video Scene animator, running in scenario mode, allows efficiently generating a number of similar scenes whose variance in object location and other properties is defined by a set of parameters. This feature may be used to generate the data required to train the system parameters from a small set of representative sequences (along with real videos). A large body of data also allows us to better evaluate the system on combined sets of real and synthetic annotated video sequences. The synthetic videos are created by fully specifying all the properties of each object (including height, width, coordinates of bounding box, appearance, movement, and the remaining categorical properties) at each frame in the animations. The Video Scene Animator tool provides an interface for making this process more efficient and for adding certain randomization to this process. For example, one may specify a random (uniformly distributed) value within a certain range for the duration of time that an object takes to travel from one point to another. This is somewhat different, however, from the approach taken to generate synthetic data taken in [19] which creates object agents and allows them to ”choose” their own behavior by reacting to other objects in the scene. Despite the presented advantages, improper synthetic sequence usage can be misleading 26
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
and produce unreasonable conclusions. It is user’s responsibility to ensure that synthetic scenes are modeled after real scenes before proceeding with the model training and event analysis. The developed Video Scene Animator is a stand alone application. A visualization of an animated scene created using Video Scene Animator is seen in Figures 7 and 10.
6.2
Event Modeling
The event modeling component allows the knowledge engineer to construct the PN event model using a graphical interface. All place and transition nodes are specified according to knowledge of the domain along with enabling rules corresponding to each transition. Stochastic timed transition parameters may be specified or left to be learned from the data using the training mode of the Video Event recognition component. A visualization of a model created using the event modeling component of SERF-PN is seen in Figures 4 and 5. The place and transition nodes and their connections can be seen in the figure. Associated enabling rule conditions are listed in the corresponding tables 1,2, respectively. These condition specify the properties of the input tokens necessary for that transition to become enabled. This information, along with the model structure defines the full event model. Detailed examples of constructing event models are offered in section 7.
27
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
6.3
Video Event Analysis
The Video event recognition component of the SERF-PN system is used to aid a human analyst in monitoring video sequences online or used to annotate video sequences offline for content based retrieval. The input to this system is the PN event model, along with the CAVIAR format annotated video sequence. All objects appearing in the video annotation are entered as tokens into the PN event model in a specially marked ”root” place node. From there, transitions which become enabled according to their enabling rules are allowed to fire. As the video sequence processing progresses the properties of tokens corresponding to objects are updated to reflect the properties of the object. Transitions which represent events of interest (indicated as such in event model) are output to the video analyst (or stored as annotation for the video sequence in an offline application). This annotation includes information on the event occurred (transition), the objects involved (tokens) and the frame number in the sequence. If available, marking analysis data can be used to predict what is the next likely state of the system or what is likelihood of an abnormal or suspicious event given the current configuration. Currently the most probable next system state is output along with the event annotation. The results of the interpretation are presented in the log window of the graphical user interface and may then be stored in a text file.
28
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
To learn the parameters of the models the video event recognition module may run in training mode. In training mode this module disables timed transitions and learns their parameters by observing their average enabling time. It also constructs the dynamic reachability graph and collects statistics on transitions between markings. (see Section 5)
7
Constructing a PN model: Illustrative examples
In this section we provide two comprehensive examples of using the qualities of the GSPN formalism described in previous sections to construct a model describing the semantics in a particular event domain. Our construction remains firmly within the Object PN paradigm. That is, each token corresponds to a scene object, each place node corresponds to an object state and each transition corresponds to a change in object state or a relation verification. It is important to note that a PN model structure along with the conditions defined on each of the transitions within the model is a reduction of a human decision process in a particular event domain. As such, it may be possible to define several different models that provide a reasonable semantic summary of an event domain. Determining if a particular model is sufficiently accurate can be done empirically by comparing the results of the model applied to a particular video interpretation to the interpretation provided by a human observer.
29
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Transition T1 T2 T3
Name Person Appeared Person Disappeared Guard Post Manned
T4
Guard Post Unmanned
T5
Visitor Not Checked
T6
Visitor Entered
T7
Guard Met Visitor
T8
Meeting Is Over
T9(timed)
Security Check Too Long
T10
Meeting Is Over(after Delay)
T11
Guard Post Manned(After Check)
T12
Visitor Continues(After Check)
T13
Person Disappeared(After Check)
Condition Token.Appearance=’appear’ Token.Appearance=’disappear’ Token.loc overlaps ”Guard Post” Zone Token.loc not overlaps ”Guard Post” Zone Token.loc overlaps ”Not Checked” Zone Token.loc not overlaps ”Guard Post” Zone distance(Token(p5 ).loc,Token(p6 ).loc) < Tmeeting start distance(Token1(p0 ).loc,Token2(p0 ).loc) > Tmeeting end distance(Token1(p0 ).loc,Token2(p0 ).loc) < Tmeeting end for a period exceeding a value sampled from D9 distance(Token1(p0 ).loc,Token2(p0 ).loc) > Tmeeting end Token.loc overlaps ”Guard Post” Zone Token.loc not overlaps ”Guard Post” Zone Token.Appearance=’disappear’
Table 1: Elaboration on the model structure pictured in figure 4. Each transition is given a meaningful semantic name and associated with a condition on the properties of its input tokens. Token(px ) denotes a token whose origin is in place px . In situations when a transition has only one input place node, this notation is dropped and we simply write Token. In the case of multiple tokens from the same place node, we denote these Token1(px ), Token2(px ). Token.attributex denotes the value of attributex in the particular input token. In this example we use the attributes ”loc” (token location) and ”Appearance”, (which can take on the values {”appear”,”visible,”disappear”}). See text for details of this construction.
30
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 4: Security Check Model (see table 1 for more details)
7.1
Security Check Scenario
In this section we consider the scenario where a single person guards the entrance to a particular location and is tasked with checking the bags of all those who enter the area in a time-efficient manner. We are particularly interested in two main events: • A visitor has entered the area without being checked • The duration of the security check is excessively long. To describe this scenario using the PN formalism we define two semantic areas within our camera view: the ”Guard Post” zone and the ”Not Checked” zone. The former zone is rather self-explanatory, it is the area in the camera view taken by the guard. The latter zone refers to an area in the camera view to which a visitor should only gain access after being checked.
31
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
All objects detected by the system are mapped to tokens and each of the object’s attributes are associated with the corresponding token. In the remaining part of this discussion we will refer to objects and tokens interchangeably. Initially tokens are placed in the ’Root’ place node. The transition ’Person Appeared’ becomes enabled when any tokens that meet its enabling condition are present in the ’Root’ place node. Specifically, the ’Person Appeared’ transition is conditioned on the ”Appearance” attribute of its input tokens and will be allowed to fire if this attribute takes on the value ”appear”. Upon firing of this transition the corresponding input tokens will be removed from the ’Root’ place node and placed in the ’P3’ place node (see figure 4). The ”Guard Post Manned” transition is similarly conditioned on the ”loc” (object location) attribute of its input tokens. If an input token located in the ”Guard Post” Zone exists, the transition will be allowed to fire. The transitions ”Visitor Entered”, and ”Visitor Not Checked” are similarly defined by conditions tied to the location attribute. The ”Guard Met Visitor” transition requires input tokens in two separate place nodes (representing the guard and the visitor). It also has a condition on the distance between these two tokens’ locations. If this distance is below a certain threshold , denoted Tmeeting start , this transition is allowed to fire. In firing, it moves both of its input tokens to its output place node, ’P0’. Transitions ”Meeting Is Over” and ”Meeting Is Over(After Delay)” are defined similarly with a maximum distance threshold, denoted Tmeeting end .
32
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Transition ”Security Check Too Long” is a stochastic timed transition with parameter µ9 which indicates the average firing delay of the transition. This transition becomes enabled when two tokens are in input place ’P0’ (the two token requirement is captured by assigning a multiplicity of two to the input arc of this transition). Once the enabling condition is met the transition may only fire once a period of time, sampled from the distribution D9 (which is determined by parameter µ9 ), has elapsed. Note also that this transition becomes enabled at the same time as the transition ”Meeting Is Over”. The fact, that firing one of these transitions will disable the other, is called a conflict in Petri Net terminology. Our condition on the length of the meeting uses this. That is, once the security check has begun we will only allow a specific time interval (determined by parameter µ9 ) to elapse before alerting that the security check has gone on too long. Of course if the meeting ends before this interval elapses, we do not need to provide such an alert. The remaining transitions ”Guard Post Manned(After Check)”, ”Visitor Continues(After Check)”,”Person Disappeared(After Check)” are immediate conditional transitions conditioned on the ”loc” and ”Appearance” attributes of their input tokens. The PN graph corresponding to this model is illustrated in Figure 4. The transition names and conditions are enumerated in Table 1.
33
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 5: Traffic Intersection Model (see Table 2 for more details)
7.2
Traffic Intersection Scenario
In this section we consider a four-way intersection with no traffic light or stop sign. Cars must mind oncoming traffic before entering the intersection. More specifically, cars that wish to turn in the face on oncoming traffic must do so a reasonable time before the oncoming traffic enters the intersection. In the model discussed below we have generalized this model to also include cars crossing the intersection. Of primary concern to us in this model: • An interaction between two cars during which one of the cars is directly in the path of the other. We shall use the orientation attribute of the objects to determine if their trajectories are on a collision course (i.e. one car is in the path of another). Additionally, we shall model a ”reasonable time” using a stochastic timed transition. Furthermore, we shall define five 34
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Transition T1 T2{a,b,c,d} T3
Name Car Appear Car Approaching Intersection (from {North,South,East,West}) Car Entered Intersection
T4
Unsafe Interaction Occuring
T5 (timed)
Safe Interaction Occured
T6
Car Leaves Intersection
T7
Car Continues (After Safe Interaction) Car In Intersection (After Safe Interaction) Car Leaves Intersection (After Safe Interaction) Car Disappears
T8 T9 T10
Condition Token.Appearance=”appear” Token.loc overlaps one of 4 ”Approach” Zones Token.loc overlaps ”Intersection” Zone Token1(p3 ).orientation is perpendicular to Token2(p3 ).orientation Token(p2 ).orientation is perpendicular to Token(p3 ).orientation for a period exceeding a value sampled from D5 Token.loc not overlaps ”Intersection” Zone Token.loc not overlaps ”Intersection” Zone Token.loc ”Intersection” Zone Token.loc not overlaps ”Intersection” Zone Token.Appearance=”disappear”
Table 2: Elaboration on the model structure pictured in figure 5. See figure 4 caption for explanation of notation. In addition to the ”loc” and ”Appearance” attributes used in the previous example, this example also uses the ”orientation” attribute which gives the direction of the object’s velocity vector. See text for details of this construction.
35
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
zones of interest, an ”Approach” zone from each direction surrounding the intersection and the ”Intersection” zone. These zones will describe which part of the camera view is taken up by the associated semantic area. Similarly to the first example, all detected object tokens are placed in the ’Root’ place node initially. The ”Car Appeared” transition will move these tokens into the ’P1’ place node if their ”Appearance” attribute takes on the value ”appear”. The ”Car Approaching” set of transitions (T2{a-d} in the figure) will compare the ”loc” attribute (i.e. object location) to each of the four ”Approach” Zones defined and fire if the ”loc” value is found to be within one of these. The stochastic timed transition ”Safe Interaction Occured” becomes enabled when there exists a token in each of the two place nodes ’P2’ and ’P3’ (representing a car inside the intersection and a car approaching the intersection,respectively.) If this transition remains enabled for a an interval of time, whose length is sampled from distribution D5 (determined by parameter µ5 ), it will allowed to fire. Again, we place our timed transition in conflict with an immediate transition (two in this case). In this model, the firing of either the the ”Car Entered Intersection” transition or the ”Car Leaving Intersection” will disable the timed transition. These transitions check if the ”loc” attribute is inside/outside the ”Intersection” zone. The ”Unsafe Interaction Occuring” transition has an input arc of multiplicity two. This means at least two tokens must exist in its input place node, ’P3’, for this transition to become enabled. Furthermore it has a condition on the difference in ”orientation” attribute 36
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
values of each of the tokens participating in the enabling. The ”orientation” attribute values must be perpendicular (a difference of 90 ± 15 degrees) for this transition to fire. If the ”reasonable” amount of time has elapsed between the time a particular car has en-
tered the intersection in the path of another oncoming car (i.e. the transition ”Safe Interaction Occured” fires), our knowledge of the event domain states that an unsafe situation is less likely to occur and hence we can allow perpendicular trajectories within the intersection (e.g. a car is existing to the west just as another enters from the north). The remaining transitions ”Car Continues(After Safe Interaction)”,Car Leaves Intersection(After Safe Interaction)” and ”Car Disappears” are conditioned on the ”loc” and ”Appearance” attributes. The PN graph corresponding to this model is illustrated in Figure 5. The transition names and conditions are enumerated in Table 2.
8
Complexity Analysis
In this section we aim to give a formal treatment to our algorithm for event recognition using the Petri Net modeling formalism. To this end we must define some notations and concepts. The notations described in the following are based on notations in [26, 34]. First, we will dfine a video sequence as the set of frames F , the function ρ which maps a particular frame to a set of objects in the video, and the function l which maps a particular set of objects to a set of features. An example of possible features would be the
37
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
CAVIAR format features which include continuous valued variables h, w, x, y (respectively representing the height, width, and centroid coordinates of the object bounding box) as well as discrete finite-domain categorical variables such as movement, role, situation, and context. Essentially the l function allows us to retrieve the semantic properties of each object (or set of objects). A Petri Net event model can be described using P , the set of all places and T the set of all transitions. Additionally, the function f low indicates the arc flow of the network. That is, if y ∈ f low(x) there is an arc from node x to node y. The δ function describes the condition associated with each transition, by associating each transition t with a set of possible feature values. That is a set of objects, p, whose labels l(p) are a subset of δ(t) can be said to fulfill the condition of transition t. Function r, returns the arity of a particular transition. The arity is defined as the number of tokens to which the transition’s condition is applied. For example, a transition which verifies a single object(token) such as (appearance = ’visible’) has an arity of one. Similarly a transition that verifies a relationship between two objects (tokens) , such as (distance(t1,t2) > thresh) has an arity of two. Finally, the variable rootnode indicates the ’root’ place node in which tokens corresponding to new objects are initially placed. Table 3 summarizes the notation used in this section. The algorithm for detecting a particular event in a video sequence is presented in pseudo code in Algorithm 1. The algorithm accepts as input the video sequence and PN model parameters, as well as the variable event transition, which indicates which transition cor38
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Definitions: P T S f low : P ∪ T → 2P ∪T ·x = {y|x ∈ f low(y)} x· = {y|y ∈ f low(x)} δ : T → 2S r:T →N rootnode
set of places set of transitions set of features function indicating the direction of arc flow input set output set associates each transition with a set of features values needed to fire function from a transition to their condition arity place in which tokens are initially placed
F O ρ : F → 2O l : 2O → 2S
set of frames set of objects set of objects in frame function from set of objects to labels
K µ : P → 2K g:K→O f :O→K G(i, X) = {xr1 , xr2 , ..., xri | xr1 , xr2 , ..., xri ∈ X, r1 < r2 < ... < ri }
set of tokens mapping from p to a set of tokens function from tokens to objects function from objects to tokens set of all subsets of X having exactly i elements
Table 3: Notation used in formal description of algorithm. 2X denotes the powerset of set X.
39
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Algorithm Input: F, l, ρ //video and object labeling P, T, f low, δ, r, rootnode //PN model event transition //transition corresponding to particular event we are interested in Output: true if event occurs in video sequence, f alse otherwise 01://Initialization: 02:K ← ∅ 03:for all places p in P do 04: µ(p) ← ∅ 05:end 06: 07:for all frames f in F do 08: //Add tokens for appeared objects: 09: for all objects o in ρ(f ) \ (K ∩ ρ(f )) do 10: k ← new token 11: K ← K ∪ {k} 12: g(k) ← o 13: f (o) ← k 14: µ(rootnode) ← µ(rootnode) ∪ {k} 15: end 16: 17: //Determine enabled transitions: 18: enabled transitions ← ∅ 19: for all transitions t in T do 20: enabled ← true 21: for all places p in ·t do 22: if µ(p) = ∅ 23: enabled ← f alse 24: end if 25: end 26: if enabled = true 27: enabled transitions ← enabled transitions ∪ t 28: end if 29: end Table 4: Algorithm 1: detecting a particular event in a video sequence
40
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Algorithm ctd. 30: for all t in enabled transitions do 31: Q = µ(·t) // all tokens in t’s input set 32: H=∅ 33: for all q in Q do 34: H ← H ∪ g(q) //H is the set of all 35: //objects corresponding to tokens in t’s input set 36: end 37: y ← r(t) // arity of transition t’s condition 38: for all subsets of objects p in G(y, H) do 39: if l(p) ⊆ δ(t) //objects meet transition condition 40: //fire transition 41: for all objects o in p do 42: k ← f (o) 43: for all places m in ·t do 44: if k ∈ µ(m) 45: µ(m) ← µ(m) \ {k} 46: break 47: end if 48: end 49: for all places m in t· do 50: µ(m) ← µ(m) ∪ {k} 51: end 52: end 53: if t = event transition //fired transition corresponds to event we are interested in 54: return true 55: end if 56: end if 57: end 58: end 59:end 60:return f alse Table 5: Algorithm 1 continued
41
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
responds to the particular event we are interested in detecting. After initialization of temporary variables, the algorithm proceeds to loop over all frames in the video. At each frame the algorithm assigns tokens in the network to each object, determines which transitions are enabled, and loops over these enabled transitions to determine if their conditions for firing have been met. In order to make this determination, the algorithm must check each possible combination of tokens of size y in the input set of each transition, where y is the arity of the transition’s condition. Thus, we have a loop over all possible (size y) combinations of tokens nested inside a loop over all enabled transitions, nested inside a loop over all frames in the video sequence. The number of outer loops is bounded by |F |, the number of enabled transitions is bounded by the total number of transitions |T | which is a known small constant relative to the number of frames in the average input sequence. The number of token combinations of size y is given by
¡k ¢ y
, where k is the number of tokens in
the transition’s input set. Clearly k is bounded by |K| = |O|, the number of objects/tokens in our net. y in this expression is bounded by Y = maxt∈T {r(t)}. Thus the worst case complexity of this algorithm is given by O(|F | ·
¡|K|¢ Y
). That is, the running time is de-
pendent on the number of objects and the maximal arity of the transition conditions. In practice, however the number of objects simultaneously appearing in a frame is very small compared to the number of frames in the video sequence. In the experiments reported in this paper the maximal number of objects in a single frame was 7. On average this number is significantly less than this. Similarly, when designing our PN event models we seldom used transition conditions greater than two (e.g. Guard Met Visitor). Most often, 42
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
we made use of transition conditions with an arity of one (e.g Car Entered Intersection). Thus, under the assumption that both |K| = |O| and Y are such small numbers compared with the length of the video sequence in frames we can make the claim that the algorithm’s time complexity is given by O(|F |). That is, the algorithm is linear in the length of the video sequence. Such assumptions has been previously made for complexity analysis in other works in the domain of video event analysis [32]. The complexity of related task including learning the timing parameters, learning marking transitions, detecting multiple events of interest, and counting events of interest can be accomplished in the same time complexity as above. For the learning of timing parameters task we add a counter that keeps track of how long each timed transition is enabled. Instead of firing timed transitions when their conditions are met (line 41 in the algorithm) we increment this counter. The few additional lines of code needed to average these duration values would also have linear complexity in the number of frames in the worst case (i.e. the transition cannot be activated more times than there are frames in the video sequence). The learning of marking transitions can be done by adding a few additional lines to our algorithm to keep track of visited markings. We would then add some lines after the firing of an enabled transition to determine if the resulting marking is new and update the transition probabilities from the previous marking to the current marking. This requires us to loop over all markings we have previously seen. This number is bounded by the total number of markings which is given by |P ||K| where |P | is the number of places in the net 43
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
and |K| is the number of tokens. This makes this problem complexity exponential in the number of tokens/objects in the scene. However, again applying our assumptions on a low |K| relative to |F | brings us back to linear time complexity in |F |. To check for more than one event we can generalize the event transition input variable to a set of transitions and modify line 53 accordingly. To count the number of occurrences of a particular event(set of events) we simply add a counter (counters) variable and increment this counter in line 54 (instead of return true). We would also modify the final line to return this counter (counters) variable.
9
Experiments
In this section we will evaluate the merits of our approach in a real world video event domain. This is achieved by studying how the interpretation results output by the system based on a particular PN event model compare to a human observer generated interpretation. Furthermore, we will provide a comparison with other approaches used for recognition in the domain of video event understanding. For our first two experiments we captured a number of real video clips containing our events of interest. We then applied a tracker, recently developed in our lab [36], to obtain the tracking information. This tracker makes use of constancy of color and color edge features. As expected the tracking results were good but not without error. Object tracks were occasionally lost due to occlusions, lighting variation and other reasons.
44
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
In parallel we also generated a set of animated clips corresponding to the video events captured on camera. This allowed us to increase the volume of data available to test our system and to introduce unusual events that are improbable in real video (but which are usually the object of interest). In generating these animated clips some randomness was added in the object sizes and locations, mimicking the variability observed in the real videos. As stated previously, the ground truth video data was obtained by allowing a human observer to determine whether a specific event is occurring in a video sequence. The system output for a particular clip is then checked for this particular event. The third experiment is intended to compare our approach to other representation and recognition approaches. In this experiment we used available ground truth information in place of the tracker to supply low-level object information such as location and bounding box properties. We used high-level ground truth information to evaluate the classification results of our model.
9.1
Security Check
For the security check experiment we had a total of 19 real video clips and 100 animations, each containing events of interest. Clip lengths ranged between 11-48 seconds. The mean and median lengths over all clips used in this experiments were 23 seconds. In this experiment we have defined two events of interest among the possible events: ”Security Check
45
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 6: Keyframes from a typical real video sequence used in the Security Check Experiment. Ellipses denoting object locations generated by the tracker are shown. Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame
#4757 #4757 #4798 #4798 #4798 #4913 #4913 #4973 #4973 #4976 #4998 #5298
-
Transition Transition Transition Transition Transition Transition Transition Transition Transition Transition Transition Transition
’Person Appeared’ fired on objects 0 ’Guard Post Manned’ fired on objects 0 ’Person Appeared’ fired on objects 1 ’Visitor Entered’ fired on objects 1 ’Guard Met Visitor’ fired on objects 0, 1 ’Person Appeared’ fired on objects 2 ’Visitor Entered’ fired on objects 2 ’Visitor not checked’ fired on objects 2 ’Guard Post Manned’ fired on objects 2 ’Guard post Unmanned’ fired on objects 2 ’Person Disappeared’ fired on objects 2 ’Security Check Too Long’ fired on objects 0, 1
Table 6: A semantic summary corresponding to the video clip in Figure 6 generated by the video event recognition component of SERF-PN. Events mentioned in summary correspond to transitions in the model shown in Figure 4. 46
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 7: Keyframes from a typical animation clip sequence used in the Security Check Experiment. Hollow boxes are scene objects. The solid blue gray box indicates the ”Guard Post” zone . Too Long” and ”Visitor not checked”. Not all clips include one of these event of interest. Some clips included multiple events of interest. More specifically in the real video clips both the ”Security Check Too Long” event and the ”Visitor not checked” event occurred 5 times. They occurred within the same clip twice. In the animations the ”Security Check Too Long” event occurred 43 times and the ”Visitor not checked” event occurred 14 times. The events occurred together in the same synthetic video sequence 7 times. We built the model as described in detail in section 7 and calculated precision and recall statistics on our events of interest . In Table 7 we report the results. This experiment was carried out on both the set of real videos with tracking info, the set of animations only and the combined data set. The length of the average activation was manually set using knowledge of the event domain (µ9 ≈ 20 sec). It is important to note that treating this model as an event occurred/hasn’t occurred 47
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Security Check Too Long Precision Recall Real .92 1.0 Animated .98 .91 Combined .97 .92
Visitor Not Checked Precision Recall 1.0 1.0 1.0 1.0 1.0 1.0
Table 7: The recall and precision for the Security Check Model on real video clips and corresponding animations. The events considered were ”Security Check Too Long” and ”Visitor not Checked” classifier was done for evaluation purposes and that the output of the system is, in fact, a semantic description of the events in a particular video sequence. Such a semantic description, corresponding to the event shown in Figure 6, is shown in Table 6. Qualitative assessment of this description shows it to be quite similar to a possible human description of the same video sequence. As the table indicates the results are quite good in both the animation set, which corresponds to perfect tracking and detection, and in the real video dataset with the automatic tracker applied which is less than perfect. Our experiments showed that minor tracking error has little affect the high-level analysis results. While an object completely lost by the tracker cannot be reasoned upon by the high-level system, the system can still provide an analysis up to the point of object loss. That is, we can still get meaningful semantic description from only partially correct tracks. Prediction using the mechanism of marking analysis can also be illustrated using this example. We applied a dynamic marking analysis using our dataset of real and animated clips. The marking graph generated with meaningful names for each marking and associated transition probabilities is shown in Figure 8. This Figure reduces the full marking 48
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
graph by consolidating markings with similar semantic meanings and neglecting markings which are reachable with very low probability. Note that in the full marking graph outgoing arrows must sum to one. Hence, we renormalize all remaining outgoing arcs from the same marking node to sum to one. A marking node with no outgoing arrows indicates that the system did not observe a transition from this marking to another during the training process. Using this information to construct a Markov process as described in section 5.2 allows answering such queries as what is the next most likely state given the current state and what is the probability of a particular state given the current state. For example, note that given that a check of a visitor has started there is only a small probability that the check will be too long and a moderate possibility of an unchecked visitor. However, once the check has gone over the time limit, the probability of an unchecked visitor entering the secure area increases. It is also interesting to note that although the theoretical number of markings is exponential in the number of tokens, by using the dynamic marking analysis we can concentrate on the markings in which most of the probability mass is concentrated. In our example, if we assume three actors only, out of a possible 120 markings (obtained by plugging Π(3, 8) into equation 2 ) our marking analysis observed only 18 markings in our training data which we reduced to 6 with semantic meaning and significant probability mass.
49
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 8: Reduced marking graph obtained by dynamic marking analysis. Frame Frame Frame Frame Frame Frame Frame Frame Frame
#61350 #61355 #61394 #61410 #61414 #61418 #61418 #61437 #61463
-
Transition Transition Transition Transition Transition Transition Transition Transition Transition
’Car Appeared in int’ fired on objects 0 ’Car Approaching Intersection(East)’ fired on objects 1 ’Car Approaching Intersection(South)’ fired on objects 2 ’Safe Interaction Occured’ fired on objects 0, 2 ’Car Enters Intersection’ fired on objects 2 ’Safe Interaction Occured’ fired on objects 1, 2 ’Car Approaching Intersection(North)’ fired on objects 3 ’Car Leave Int After Safe Turn’ fired on objects 2 ’Car Enters Intersection’ fired on objects 3
Table 8: A semantic summary corresponding to the video clip in Figure 9 generated by the video event recognition component of SERF-PN. Events mentioned in summary correspond to transitions in the model shown in Figure 5
9.2
Traffic Intersection
The Traffic Intersection scenario is slightly more complex semantically than the previously discussed security check. We built the model as described in detail in section 7. The parameter of the stochastic timed transition modelling a safe interaction between two vehicles with collision course trajectories may not be obvious even to an expert in the domain. To this end we apply our timed parameter training approach as discussed in 50
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 9: Keyframes from a typical real video sequence used in the Traffic Intersection Experiment. Ellipses denoting object locations generated by the tracker are shown.
Figure 10: Keyframes from a typical animation clip sequence used in the Traffic Intersection Experiment. Filled circles model cars in the scene. The solid black polygon marks the ”Intersection” zone . 51
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
section 5.1. We applied this method to several subsets of available data (animations and real) using 4-fold cross validation. That is, we divided the available labeled data (including real and synthetic data) into 4 roughly equal sets. We then trained the timing parameters using one of these sets, and applied the trained event model to the remaining three sets. We repeated this process for each of the sets (each time training on one and testing on the rest) and averaged the accuracy results over all four sub-experiments to arrive at the final accuracy figure reported in the table. Once this parameter is obtained we again take the approach of viewing our summary output as a class/nonclass classifier for a particular event, in this case the event ”Safe Interaction Occurred”. The results are reported in Table 9. As can be observed from the table, the precision and recall rates are slightly lower than those of the previous example. This can be attributed to the increased semantic complexity. In inspection of the results we found that those clips marked by humans as safe are also marked as safe by the model. However, the model also has a tendency to mark human ”not safe” events as safe. Of course this can be remedied by reducing the semantic definition into relations that fall into the vocabulary of the Petri Net formalism and adding them to the model (i.e. change the PN model). For this experiment we made use of 48 real video clips and 77 animations. Each clip had at least one ”safe” or ”not safe” event. In the real clips we captured 44 include a ”safe” event and the remaining 4 included ”unsafe” events. In the synthetic video clips 38 clips included ”safe” events and the remaining 39 include ”unsafe” events. The maximum and minimum clip lengths used in this experiment are 11 and 3 seconds, respectively. The 52
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Real Animated Combined
Safe Interaction Occurred Precision Recall .91 .75 .80 .97 .85 .85
Table 9: The recall and precision for the Traffic Intersection Model applied real video clips and corresponding animations for the event ”Safe Interaction Occurred” mean length length over all clips is about 6 seconds. The median length is 7 seconds. The animations were particularly useful in simulating ”not safe” scenarios which were difficult to come across in the real video sequences. At this point, we would like to note that we did not take an approach (similar to the one proposed in [19]) of training on the synthetic data and testing on the real data. The reason for this is that the training data is created by specifying semantically meaningful parameters including event timing. Were we to train the timing parameters of the model using only this synthetic data (as we have in the ”animated only” part of the experiment), we would simply recover the original parameters we put into the algorithm for creating the synthetic data. This is equivalent to manually setting the timing parameters using domain knowledge. This is unlike the approach described in [19] which attempts to learn abstract (i.e. not semantically meaningful) parameters from the synthetic data. Since these parameters are abstract they cannot be easily specified using semantic knowledge and, hence,must be learned from synthesized example(s) of the event.
53
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
9.3
CAVIAR dataset
In our third experiment, we evaluated our approach to modeling events on a portion of the CAVIAR dataset [35], that includes surveillance video of a shopping center in Lisbon, for the purpose of comparing our approach to others tested on the same dataset. The CAVIAR(Context Aware Vision using Image-based Active Recognition) dataset covers the potential events that can occur in a shopping mall surveillance context. It defines a strict semantic hierarchy of events. In the CAVIAR ground truth [37], each object is assigned a movement, role, situation, and context. Each of these variables can take on a value from a discrete set of values. Also each of these categories represents a more specific/general semantic level of information. Movement, is the lowest level of information and can take on the states running, walking, active, inactive or unknown. Role, describes the role an object plays in a higher level event (Situation or Context). Role can take on such states as walker, fighter, and left object. Situation is a higher semantic level and can take on values such as moving, browsing, shop enter, shop exit and so on. Context can be described as the overriding semantic purpose of the object in question. In the CAVIAR ground truth each object is assigned only one context throughout the duration of its appearance in the scene. Context values may be shop enter, shop exit, shop re-enter, walking , windowshopping, etc. CAVIAR also defines each particular context using a number of situations. For instance the windowshopping context may consist of the situations moving, browsing, and then moving. That is we can define situations using the movement information, and contexts
54
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
according to the situation information. As opposed to our previous experiments, where we attempted to a classify whether an event has occurred or not (i.e. a particular transition has fired), in this experiment we attempt to classify the state of each object at every frame (i.e. the place node the object token is residing in). This is an equivalent problem as it is the events (i.e. transitions) that cause the changes in object states (place nodes containing tokens). We have chosen to present our results in this fashion to allow a more straightforward comparison. Thus to evaluate our PN representation approach we built two PN Models, one to evaluate the situations based on object location and movement information. In principle, this information can be given by a tracker, as in our previous experiments, but for comparison purposes we have chosen to use the ground truth information provided as done in other approaches. The other PN model evaluates the context of each object using the object information along with the situation label obtained from the ground truth. The situation network built for this experiment is shown in figure 11. In this network the transition conditions are based on all object properties available in the ground truth except for the situation and context labels. In the majority of cases , we have found it sufficient to use the appearance, movement and location properties. Additionally, we have defined some important zones in the scene including ”shop exit” , ”shop enter”, and ”windowshop” zones. These zone definitions allow us to condition transitions on whether an object location is inside/outside a particular zone. The place node names of our network 55
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Transition Name Situation: Shop Exit Situation: Moving Situation: Browsing
Condition Token.loc overlaps ”Shop Exit” zone Token.Appearance=’visible’ and Token.Movement=’walking’ Token.Appearance=’visible’ and Token.Movement=’active’ and Token.loc overlaps ”windowshop” zone Situation: Inactive Token.Appearance=’visible’ and Token.Movement=’inactive’ Situation: Shop Enter Token.Appearance=’visible’ and Token.Movement=’active’ and Token.loc overlaps ”Shop Enter” zone Stops in Place Token.Movement=’active’ or Token.Movement=’inactive’ Disappear Token.Appearance=’disappear’ Table 10: Elaboration on the situation model structure pictured in figure 11. Each transition is associated with a condition on the properties of its input tokens. Token(px ) denotes a token whose origin is in place px . In situations when a transition has only one input place node, this notation is dropped and we simply write Token. Token.attributex denotes the value of attributex in the particular input token. In this example we use the attributes ”loc” (token location),”Movement” (which can take on the values {”walking”,”running”,”active”,”inactive”}) and ”Appearance”, (which can take on the values {”appear”,”visible,”disappear”}). Note that the same transition appears multiple times, sometimes with a temporal duration constraint attached to the condition.
correspond to possible situation values. We made use of this in the experiment to compare the ground truth label of each object at each frame with the label of the place in which the corresponding token is residing. A detailed description of the conditions on each transition is given in Table 10. The context network is an event model, with the events being the changes in the context classification of the particular object. Like in the situation network we condition the transitions in this net on the properties of objects including location, appearance, and movement. Because this network models a higher semantic level than the situation network we also allow the use of the situation property in these transition conditions. Again, we
56
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 11: CAVIAR Situation Petri Net Model. See Table 10 for specification of transition condition.
57
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Transition Name Context: Shop Exit Context: Window Shopping Context: Context: Context: Context:
Walking Immobile Shop Enter Shop Re-Enter
Condition Token.situation=”Shop Exit” Token.situation=’browsing’ and Token.loc overlaps ”windowshop” zone Token.situation=’moving’ Token.situation=’inactive’ Token.situation=’shop enter’ Token.situation=’shop enter’
Table 11: Elaboration on the context model structure pictured in figure 12. Each transition is associated with a condition on the properties of its input tokens. Token(px ) denotes a token whose origin is in place px . In situations when a transition has only one input place node, this notation is dropped and we simply write Token. Token.attributex denotes the value of attributex in the particular input token. In this example we use the attributes ”loc” (token location),”Movement” (which can take on the values {”walking”,”running”,”active”,”inactive”}) ,”Appearance” (which can take on the values {”appear”,”visible,”disappear”}and situation (which can take on the values {”moving”,”inactive,”browsing”,”shop enter”,”shop exit”} ).
have made use of the definitions of important zones in the scene including the ”Shop Enter” ”Shop Exit” and ”windowshop” zones. The place node names corresponds to the various values the context property may take on according to the CAVIAR specification, to allow comparison with the ground truth on a frame by frame basis. A detailed description of the conditions on each transition is given in Table 11. The situation of an object may change throughout the scene, and so a frame by frame evaluation of the resulting situation against the situation label given in the ground truth is warranted. To this end we compared the situation label against the label of the place node in which the token corresponding to the object resided. Of course, each of the place node labels in this network corresponds to a possible situation value.
58
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
Figure 12: CAVIAR Context Petri Net Model. See Table 11 for specification of transition condition. Context may also be measured in this frame by frame manner. However, when considering some types of contexts it is not reasonable to expect their detection from the first appearance of the object in the scene. For example, a context such as enter store can only be detected (by a human or an automatic system both) when the object enters the store. Thus, when compared to the oracle-like nature of the ground-truth context on a frame by frame basis even the best system would produce low classification performance on these types of context. For this reason it is beneficial to use the frame by frame situation classification in a more sophisticated way to determine the context for the entire duration of the object’s existence within the scene. In our experiments we report results on both the frame by frame classification of the context and a simple but effective method of looking at the entire sequence 59
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
to determine the overall context, using the last frame context classification. For comparison we offer two approaches with results described in [35]. The first is that of a rule based approach. This approach used semantic rules on both the role and movement classifications in the ground truth to arrive at situation and context classifications. This approach is somewhat similar to the use of semantic knowledge in our approach, but is missing the basic formalism (i.e. Petri - Net ) to allow us to model these rules efficiently and determine if they apply in a straightforward manner. This rule base also lacks the notion of states intrinsic to Petri net markings. All this means that rules applied are less powerful and robust and less likely to capture the semantics of the event domain appropriately. Another approach is that of a probabilistic model with parameters that can be tuned to the event domain and offers a probabilistic interpretation of the uncertainty in the event classification. Another strength of the probabilistic approach is its ability to accept soft probabilistic evidence as opposed to hard evidence expected by semantic rule based approaches including our approach. The particular probabilistic graphical model explored for the CAVIAR context interpretation problem is the Hidden Semi-Markov Model (HSMM)[38, 18]. HSMMs extend the standard Hidden Markov model with an explicit duration model for each state. Using the situation values as the possible values of hidden states, and training a separate HSMM for each context model, the model is able to answer questions like what is the most likely state sequence given a particular observation, and also which is the most likely context model. Apart from the results offered in [39] the HSMM approach for event detection is described in [38]. Essentially the probabilities are 60
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
used to determine which is the most likely context given a sequence of situations. We ran our experiments on a total of 26 CAVIAR sequences ranging in length from 10 to 123 seconds. The mean length of these sequences was about 47 seconds and the median length about 49 seconds. Overall we considered 235 objects, 5 possible situations, and 7 possible contexts. There were several sequences where no objects were visible for a significant portion of the sequence, as well as several sequences where a number of objects appeared simultaneously. We built two PN event models, one to classify situations and one to classify contexts. For our first experiment we evaluated the PN model’s ability to classify situations per frame. Each object’s situation was set to be equal to the label of the place node in the PN model in which the token corresponding to this particular object is residing. The accuracy of this classification is obtained by comparing this label to the situation label in the ground truth. The confusion matrix for this experiment is shown in Table 12. Overall the frame by frame situation classification accuracy achieved is 91%. Another observation that can be made by examining the table is that some situations are better detected than others. Specifically, ”shop enter” is often confused with other situations such as ”walking”. However, this error is subjective to interpretation and as we shall see does not greatly effect the eventual higher level context results that use these situation classifications. The context evaluations are shown both using a frame by frame approach similar to that of the situation evaluations, and using a simple evaluation over the series of situations, 61
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
moving inactive browsing shop enter shop exit
moving 103883 77 182 1224 2123
Total: Overall Acc:
132673 0.91
inactive 327 5695 1118 0 0
browsing 1349 1397 9173 0 0
shop enter 72 0 0 438 0
shop exit 3085 214 211 45 1390
unknown 301 226 78 62 3
Total 109017 7609 10762 1769 3516
Accuracy 0.9529 0.7485 0.8524 0.2476 0.395
Table 12: Situation frame by frame classification
namely selecting the last frame in the object life-span and using the context evaluation in that frame as the context of the object. In this evaluation, we thus evaluate the context of each object against its ground truth, as opposed to the frame by frame approach which evaluates each frame of each object. For this reason the total contexts evaluated in this ”last frame” approach are significantly less than those shown in the frame by frame approach. It can be seen however that this ”last frame” approach has a large improvement of context classification results over those of the frame by frame approach. This is attributed to the fact that many contexts (e.g. ”shop enter”, ”shop reenter”) only become apparent towards the ends of the object’s life in the scene. Tables 13 and 14, show the results for context evaluation, for frame by frame and ”last frame” approaches respectively. These results are based on the application of the context PN model to the ground truth situation information. For the frame by frame approach we achieve an overall accuracy of 69%. For the ”last frame” classification we achieve a significantly better 86%. The next experiment is designed to evaluate the effect of the errors in the situation evaluation on the final context classification. To this end we repeated the experiment, this
62
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
browsing immobile walking windowshop shop enter shop exit shop reenter
browsing 0 0 0 0 0 0 0
Total: Overall Acc:
132673 0.69
immobile 0 3290 0 0 0 0 0
walking 2323 8865 45993 4804 17402 371 0
windowshop 4684 224 1103 15078 0 0 0
shop enter 0 0 0 79 1533 0 0
shop exit 432 232 58 150 0 25166 663
shop reenter 0 0 0 36 0 0 96
unknown 66 0 0 0 25 0 0
Total 7505 12611 47154 20147 18960 25537 759
Accuracy 0 0.2609 0.9754 0.7484 0.0809 0.9855 0.1265
Table 13: Context frame by frame classification (using situation ground truth)
browsing immobile walking windowshop shop enter shop exit shop reenter
browsing 0 0 0 0 0 0 0
Total: Overall Acc:
235 0.86
immobile 0 2 0 0 0 0 0
walking 0 10 75 0 3 2 0
windowshop 6 2 2 12 0 0 0
shop enter 0 0 0 2 52 0 0
shop exit 1 2 0 0 0 57 0
shop reenter 0 0 0 1 0 0 5
unknown 0 0 0 0 1 0 0
Total 7 16 77 15 56 59 5
Accuracy 0 0.1250 0.9740 0.8000 0.9286 0.9661 1.0000
Table 14: Context ”last frame” classification (using situation ground truth)
time replacing the ground truth situations with those yielded by the situation network. As can be seen in Tables 15 (frame by frame) and 16 (”last frame”), the classification of the context is downgraded a smaller amount then the actual error in classification (10%). For comparison purposes we evaluate the ”rule-based” and HSMM algorithms for evaluating both situation and context. Table 17 shows the results reported in [39] on the CAVIAR dataset as compared to our results on the same set. As is shown in the table our frame by frame situation evaluation is a significant improvement over both the ”rule-based” and the browsing immobile walking windowshop shop enter shop exit shop reenter
browsing 0 0 0 0 0 0 0
Total: Overall Acc:
132673 0.64
immobile 59 3073 0 0 0 0 0
walking 2642 6872 42033 6173 17372 7 0
windowshop 3804 2197 1103 13746 11 0 0
shop enter 0 0 872 26 694 0 0
shop exit 921 232 3073 178 783 25529 752
shop reenter 0 0 0 8 52 0 7
unknown 79 237 73 16 48 1 0
Total 7505 12611 47154 20147 18960 25537 759
Accuracy 0 0.2437 0.8914 0.6823 0.0366 0.9997 0.0092
Table 15: Context frame by frame classification (using PN derived situation information)
63
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
browsing immobile walking windowshop shop enter shop exit shop reenter
browsing 0 0 0 0 0 0 0
Total: Overall Acc:
235 0.81
immobile 0 2 0 0 0 0 0
walking 1 7 70 1 7 1 0
windowshop 4 5 2 11 0 0 0
shop enter 0 0 1 2 47 0 0
shop exit 2 2 4 0 0 58 3
shop reenter 0 0 0 1 2 0 2
unknown 0 0 0 0 0 0 0
Total 7 16 77 15 56 59 5
Accuracy 0 0.1250 0.9091 0.7333 0.8393 0.9831 0.4000
Table 16: Context ”last frame” classification (using PN derived situation information)
Situation Classification Context Classification
”Rule-Based” Method 70% 57%
HSMM 74% 65%
Our Method (FbF Context) 91% 64%
Our Method (LF Context) 91% 81%
Table 17: Comparison
HSMM situation results. Although a PN model can be considered a type of rule- based model based on semantic knowledge, the framework it provides allows us to keep track of object states and model relationships, thus allowing for a more robust rule set and hence the improvement in results. With additional semantic information specifically orientation and speed information these results possibly may be improved. In the frame by frame context evaluation we show results comparable to those of the HSMM context classifier. We should note however that the HSMM classifier does not take a naive approach to classifying each frame and in fact performs inference on all the situation values at each frame to determine what is the most likely context. If we apply a similar approach, although admittedly much simpler, of simply taking the last frame of each object and using the context label as the context label for the entire sequence, our ”last frame” classification strategy, we achieve significantly better classification using the PN model approach. This due to the fact that the classification are made based on a series of rules reflecting semantic
64
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
knowledge of the domain as opposed to a set of abstract state transition parameters (as in the HSMM) with no inherent semantic value.
10 10.1
Discussion Comparison to Other Models for Video Event Understanding
The GSPN formalism for video event understanding presented in this paper offers a way to apply semantic knowledge to automatic interpretation of video events. It is a somewhat different approach than those that model the statistics of video abstractions such as Hidden Markov Models, Bayesian Networks and Conditional Random Fields. In those methods it would be difficult to recognize events defined by semantic relations such as ”Guard Meets Visitor” or ”Car Enters Intersection” due to the fact that these events can have very large variance in their appearance. Other Petri Net approaches have been suggested. Castel et al. [24] suggests keeping track of a PN ”plan” for each event. One of these plans must exist for each object (or set of objects) participating in a particular event. By adopting the Object PN paradigm we avoid large overhead of keeping track of many existing plans that might hamper even a simple model. Another more recent framework which makes use of the plan PN paradigm is proposed
65
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
by M. Albanese et al. [26]. This work offers a probabilistic extension to the framework by assigning each token a probability and updating this probability using the probabilities assigned to fired transition as the token propagates through the network towards the events ”terminal” state. At each conflict (two or more transitions sharing an input place) each of the possible transitions is assigned a probability of firing summing to one. To cope with the possibility that none of the transitions will fire a special ”skip” transition is attached to each place node that has that place node as both its input an output place. The ability to cope with uncertain input is important, however due to the inevitable pruning of low-probability token necessary in order to keep the inference in this framework tractable, this framework, in practice, is very similar to the deterministic method taken in our approach. Furthermore, the assumption taken in this approach that a transition fires at each time step (i.e. frame), each time reducing the probability, coupled with the aforementioned pruning enforces an implicit duration model on the event. While in some cases this may be useful, in general, events may have variable length and thus this aspect of the probabilistic framework may present difficulties in recognizing certain types of events. In our approach, we use semantic rules to fire transitions. While this may include duration information (timed transitions), in general we do not assume a specific time frame for every event (transition).
The Object PN paradigm was first introduced in Ghanem et al [25]. Although our work is based on theirs, we have made several meaningful extensions in our work. Along with 66
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
the major extensions (parameter learning, and marking analysis), discussed earlier we have embraced an approach that uses a single PN graph to capture a particular event domain as opposed to using a single graph for each possible event. This allows a more condensed representation of our domain and captures the shared properties of many events.
10.2
Temporal Segmentation
Many of the recently explored event recognition approaches are actually event classification approaches in the sense that, as input, they are given a video segment in which an event occurs. The problem of event recognition is then reduced to determining which of the known kinds of events is the one that occurred. Probabilistic formalisms such as the HMM usually take this approach, though some systems adjust for this bias by adding an additional ”unknown” or ”nothing happened” event model and comparing its computed likelihood on the observation to that of the other known event models. The approach presented in this paper does not make the assumption that it is presented with a video sequence input that includes a specific number of events, or any events at all. This approach utilizes a single PN model of the event domain to describe everything that can occur and recognizes events by applying semantic knowledge to an object abstraction of the video sequence input. This empowers it to receive as input video sequences that have any number of interesting semantic events included in them. We have seen this in our experiments, from the security scenario in which a visitor evades the guard at the same
67
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
time as a security check on another visitor is in progress. We have also seen this in our CAVIAR dataset experiments, where we have several objects simultaneously interacting with the environment within the same video sequence in some of the input video sequences, while still others in which there are no objects in the frame at all. In all these cases our approach is able to classify the multiple events of the video sequence correctly. Furthermore, we have seen that the temporal extent of a particular event does not have to be constant in each instant of the event. For example, in our security check experiment we are able to detect particular security check events even as they contain significant variance in their duration. This robustness to temporal extent and natural temporal segmentation is achieved by defining events using semantic knowledge about the nature of their composition rather by relying on a statistical model with abstract states whose parameters are tuned in a data-driven process which is much more sensitive to variance in duration of the event.
10.3
Limitations of the PN formalism and future work.
Dealing with uncertain observations, scaling to larger problems, and automating model learning are all challenges facing the current specification of the PN formalism. Uncertain observations which are handled well in fully stochastic models such as HMMs, may cause problems in the deterministic setting of the PN formalism. One direction of possible future work to address this is to allow the marking to be represented as a stochastic
68
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
variable according to probabilities associated with the confidence in the observation of the tokens. Scalability to larger problems may be handled by considering modular Petri Net fragment analyzing simple situations, nested within one another to consider a more complex situation. Although a PN model is specified by semantic human knowledge, in some cases semantic relations can be inferred from data using supervised learning techniques. Another area of future work is isolating important relations among scene entities involved in an event and utilizing this information to semi-automate the human process of constructing a PN event model (which is itself based on the human’s supervised learning of semantic relations).
11
Conclusion
In this paper we propose , SERF-PN, a framework for the representation and recognition of video events based on the Petri Net (PN) formalism. This formalism allows straightforward specification of semantic event domain knowledge including object properties and relationships. We extend the representational capacity to allow some variance in the duration of the events by allowing parameters of stochastic timed transitions to be learned from the data. We also propose learning a predictive mechanism using the method of dynamic marking analysis. SERF-PN is aimed at aiding online monitoring of surveillance video and offline annotation
69
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
for content based retrieval. This system is based on a Petri Net (PN) event representation. We show, using extensive examples, how the proposed formalism may be used to construct an event model in a particular domain. We then formalize the algorithm for event recognition and offer a formal complexity analysis which shows that , under reasonable assumptions regarding the number of objects simultaneously appearing in the video sequence, our proposed approach is efficient for this task. In the experimental section, we illustrate the strengths of the proposed approach to recognize events in unlabeled video and provide a comparison of the results achieved by our approach with those of other recognition approaches on the same dataset. The PN formalism allows a expression of complex semantic knowledge regarding an event domain in a straightforward way. It is our contention that for this reason it is more suitable for representing event domains where there exists great variance in appearance and duration of events. The nature of these events can usually be specified using a few semantic rules. However, the recognition of these types of events is often challenging for formalisms that make use of abstract parameters learned from data.
References [1] R. Nevatia, T. Zhao, and S. Hongeng, “Hierarchical language-based representation of events in video streams,” in Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’03), vol. 4, 2003, p. 39. [2] R. Nevatia, J. Hobbs, and B. Bolles, “An ontology for video event representation,” in Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’04), vol. 7, 2004, p. 119.
70
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
[3] A. Hakeem, Y. Sheikh, and M. Shah, “Casee: A hierarchical event representation for the analysis of videos,” in Eighteenth National Conference on Artificial Intelligence (AAAI), 2004, p. 263. [4] A. Hakeem and M. Shah, “Multiple agent event detection and representation in videos,” in Nineteenth National Conference on Artificial Intelligence (AAAI), 2005, p. 89. [5] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as spacetime shapes,” in ICCV ’05: Proceedings of the Tenth IEEE International Conference on Computer Vision. Washington, DC, USA: IEEE Computer Society, 2005, pp. 1395–1402. [6] L. Zelnik-Manor and M. Irani, “Statistical analysis of dynamic actions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 9, pp. 1530–1535, 2006. [7] M. Pittore, C. Basso, and A. Verri, “Representing and recognizing visual dynamic events with support vector machines,” 1999, pp. 18–23. [8] A. Howell and H. Buxton, “Gesture recognition for visually mediated interaction,” 1999, p. 141. [9] W. N. Lie, T. C. Lin, , and S. H. Hsia, “Motion-based event detection and semantic classification for baseball sport videos,” in IEEE International Conference on Multimedia and Expo, vol. 3, 2004, p. 1567. [10] S. Intille and A. Bobick, “Representation and visual recognition of complex, multiagent actions using belief networks,” in IEEE Workshop on the Interpretation of Visual Motion, 1998. [11] H. Buxton and S. Gong, “Visual surveillance in a dynamic and uncertain world,” Artificial Intelligence, vol. 78, no. 1, p. 431, 1995. [12] S. R. J. Binder, D. Koller and K. Kanazawa, “Adaptive probabilistic networks with hidden variables,” Machine Learning, vol. 29, no. 213, 1997. [13] R. Higgins, “Automatic event recognition for enhanced situational awareness in uav video,” in Military Communications Conference (MILCOM 2005), vol. 3, 2005, p. 1706. [14] L. Davis, R. Chelappa, A. Rosenfeld, D. Harwood, I. Haritaoglu, and R. Cutler, “Visual surveillance and monitoring,” in DARPA Image Understanding Workshop, 1998, p. 73.
71
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
[15] J. Yamato, J. Ohya, , and K. Ishii, “Recognizing human action in time-sequential images using Hidden Markov Model,” in Computer Vision and Pattern Recognition (CVPR ’92), 1992. [16] T. Starner and A. Pentland, “Real-time american sign language recognition from video using hidden Markov models,” in International Symposium on Computer Vision (ISCV’95), 1995, p. 265. [17] A. Wilson and A. Bobick, “Recognition and interpretation of parametric gesture,” in Sixth International Conference on Computer Vision (ICCV’98), 1998, p. 329. [18] S. Hongeng and R. Nevatia, “Large-scale event detection using semi-hidden Markov models,” in Ninth IEEE International Conference on Computer Vision (ICCV’03, 2003, p. 1455. [19] N. Oliver, B. Rosario, and A. Pentland, “A bayesian computer vision system for modeling human interactions,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, 2000, p. 831. [20] N. Oliver, E. Horvitz, and A. Garg, “Layered representations for recognizing office activity,” in IEEE International Conference on Multimodal Interaction, 2002, p. 3. [21] Y. Ivanov and A. Bobick, “Recognition of visual activities and interactions by stochastic parsing,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’98), vol. 22, 1998, p. 852. [22] D. Moore and I. Essa, “Recognizing multitasked activities from video using stochastic context-free grammar,” in National Conference on Artificial Intelligence, 2002, p. 770. [23] M. Brand, “Understanding manipulation in video,” in 2nd International Conference on Automatic Face and Gesture Recognition (FG ’96), 1996. [24] C. Castel, L. Chaudron, and C. Tessier, “What is going on? a high level interpretation of sequences of images,” in Proc. of the Workshop on Conceptual Descriptions from Images (ECCV, 96), Cambridge, UK, 1996, pp. 13–27. [25] N. Ghanem, D. DeMenthon, D. Doermann, and L. Davis, “Representation and recognition of events in surveillance video using petri nets,” in CVPRW ’04: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’04) Volume 7. Washington, DC, USA: IEEE Computer Society, 2004, p. 112. [26] M. Albanese, R. Chellappa, V. Moscato, A. Picariello, V. S. Subrahmanian, P. Turaga, and O. Udrea, “A constrained probabilistic petri net framework for human activity 72
Technion - Computer Science Department - Tehnical Report CIS-2008-08 - 2008
detection in video,” IEEE Transactions on Multimedia, vol. 10, no. 6, pp. 982–996, October 2008. [27] C. Pinhanez and A. Bobick, “Human action detection using pnf propagation of temporal constraints,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’98), 1998, p. 898. [28] N. T. Nguyen, D. Q. Phung, S. Venkatesh, and H. Bui, “Learning and detecting activities from movement trajectories using the hierarchical hidden Markov models,” in CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2. Washington, DC, USA: IEEE Computer Society, 2005, pp. 955–960. [29] N. Oliver and E. Horvitz, “A comparison of hmms and dynamic bayesian networks for recognizing office activities.” in User Modeling, 2005, pp. 199–209. [30] Y. Shi, Y. Huang, D. Minnen, A. Bobick, and I. Essa, “Propagation networks for recognition of partially ordered sequential action,” cvpr, vol. 02, pp. 862–869, 2004. [31] J. F. Allen, “Maintaining knowledge about temporal intervals,” Communications of the ACM, vol. 26, p. 832, 1983. [32] V. T. VU, “Temporal scenarios for automatic video interpretation,” Ph.D. dissertation, Universite de Nice Sophia Antipolis, France, 2004. [33] G. Lavee, A. Borzin, M. Rudzsky, and E. Rivlin, “Building Petri Nets from video event ontologies,” in International Symposium on Visual Computing, 2007. [34] M. Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis, Modeling With Generalized Stochastic Petri Nets. John Wiley and Sons Press, 1995. [35] “Fisher et al., CAVIAR project: homepages.inf.ed.ac.uk/rbf/caviar/ , 2004.” [36] I. Leichter, M. Lindenbaum, and E. Rivlin., “Visual tracking by affine kernel fitting using color and object boundary,” 2007, pp. 1–6. [37] R. B. Fisher, “Pets 04 surveillance ground truth data set,” in Sixth IEEE Int. Work. on Performance Evaluation of Tracking and Surveillance (PETS04), 2004, pp. 1–5. [38] D. Tweed, R. Fisher, J. Bins, and T. List, “Efficient hidden semi-Markov model inference for structured video sequences,” in 2nd Joint IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, (VS-PETS), Beijing, Oct 2005, pp. 247–254. [39] “Fisher et al., CAVIAR hidden semi-Markov model behaviour recognition, http://homepages.inf.ed.ac.uk/rbf/caviar/hsmm.htm.” 73