IEEE TRANSACTIONS ON IMAGE PROCESSING
1
A Distributed Topological Camera Network Representation for Tracking Applications Edgar Lobaton, Member, IEEE, Ramanarayan Vasudevan, Student Member, IEEE, Ruzena Bajcsy, Fellow, IEEE, and Shankar Sastry, Fellow, IEEE
Abstract—Sensor networks have been widely used for surveillance, monitoring, and tracking. Camera networks, in particular, provide a large amount of information that has traditionally been processed in a centralized manner employing a priori knowledge of camera location and of the physical layout of the environment. Unfortunately, these conventional requirements are far too demanding for ad-hoc distributed networks. In this article, we present a simplicial representation of a camera network called the Camera Network Complex, CN -Complex, that accurately captures topological information about the visual coverage of the network. This representation provides a coordinate-free calibration of the sensor network and demands no localization of the cameras or objects in the environment. A distributed, robust algorithm, validated via two experimental setups, is presented for the construction of the representation using only binary detection information. We demonstrate the utility of this representation in capturing holes in the coverage, performing tracking of agents, and identifying homotopic paths. Index Terms—Smart camera networks, network coverage, simplicial homology, multitarget tracking, sensor networks.
I. I NTRODUCTION Ever increasing improvements to the resolution and frame rates of cameras has been problematic for camera networks, wherein data has traditionally been processed in an entirely centralized manner. Due to the high cost of transferring data, much of the processing has taken place off-line; thus, compromising the overall effectiveness of the network. This observation has driven the deployment of distributed camera networks. An adhoc network is a distributed sensor network that is setup by placing sensors at random locations. Unfortunately, most vision-based algorithms for distributed networks demand a priori knowledge of camera and static object location in the environment, which is generally either too expensive, too error prone, or too time consuming to recover. Given the desire to maintain the inherent adaptability of ad-hoc networks, it becomes essential to develop distributed algorithms that perform tasks such as coverage verification and tracking without explicit localization information. In this article, we consider a sensor network wherein each node is a camera capable of performing local computation Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to
[email protected]. E. Lobaton is with the Department of Computer Science, University of North Carolina at Chapel Hill, NC, 27599 USA (e-mail:
[email protected]). R. Vasudevan, R. Bajcsy and S. Sastry are with the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, 94720 USA (e-mails: {ramv, bajcsy, sastry}@eecs.berkeley.edu.) Manuscript received August 28, 2009.
Fig. 1. The CN -Complex for a network of three cameras constructed using the methodology presented in this article: the views from different cameras (bottom-left), planar projection of the coverage of three cameras and their decomposed views (each region is given a different letter) due to occluding objects (top), the simplicial complex built by finding the overlap between the decomposed views of the cameras (bottom-right). Edges between decomposed views denote pair wise overlap in coverage and blue triangles denote three way overlap in coverage. The simplicial complex, correctly, contains a single hole (i.e. the loop with vertices 1a, 1b, 3b, 3c and 3d) that corresponds to the column which acts as an occluding object in the physical coverage.
to extract discrete observations corresponding to agents either entering or exiting its field of view that maybe transmitted to other nodes for further processing. These observations are used to build a representation of the network coverage without any prior localization information about the cameras or objects. This representation is an abstract simplicial complex referred to as the Camera Network Complex, or CN -Complex. Our contributions in this article are three-fold: first, we develop the CN -Complex to accurately capture all topological information about the visual coverage of the camera network; second, we present and provide experimental validation of an algorithm to distributedly construct the representation; third, we describe the utility of the representation by constructing
IEEE TRANSACTIONS ON IMAGE PROCESSING
algorithms to perform tracking of multiple agents within the camera network. An example of the representation constructed by employing the algorithms proposed in this article can be found in Figure 1. Importantly, notice that the representation is able to detect the hole in coverage corresponding to the static pillar in addition to correctly capturing overlap in coverage between cameras. This article is organized as follows: a taxonomy of representations used to capture network coverage is presented in Section II; the tools from algebraic topology used throughout this article are reviewed in Section III; the mathematical models and assumptions of the environment under consideration are described in Section IV; the CN -Complex and the distributed algorithm for its construction are presented in Sections V and VI, respectively; finally, an overview of how to perform homotopic path identification and tracking using our simplicial representation are introduced in Section VII. II. R ELATED W ORK In this section, we provide a taxonomy of the various camera network coverage representations. These representations can be placed on a spectrum according to the amount of geometric information they provide: at one extreme are Vision Graphs which only provide information about coverage overlap between pairs of cameras and at the other extreme are fullmetric 3D models which explicitly capture camera and object localization. Though we employ this notion of information to distinguish between these various representations, the methods chosen to recover these representations (usually either an appearance or co-occurrence algorithm) are not mutually exclusive. Vision Graph A Vision Graph is a graph where each node represents a camera’s coverage and edges specify an overlap in coverage. The graph provides connectivity information about the network, but provides no other geometric information about the network coverage (e.g. the holes in the coverage). Cheng et al. [1] build a Vision Graph distributedly by broadcasting feature descriptors of each camera view through the network to establish correspondences between cameras. In contrast, Marinakis et al. [2], [3] build a Vision Graph by comparing reports of detections between cameras and then use a Markov model for modeling the transition probabilities and minimize a functional using Markov Chain Monte Carlo Sampling. Simplicial Representation Several have attempted to improve upon the connectivity information provided by the Vision Graph by incorporating geometric information from the environment into the representation. This work has focused on the detection and recovery of holes in the coverage due to the environment. Prior work has relied mostly on considering symmetric coverage (explicitly or implicitly) or high density sensor coverage in the field. Vin de Silva et al. [4] obtain the Rips complex based on the communication graph of the network and compute homologies using this representation. Their method assumes some symmetry in the coverage of each sensor node (such
2
as circular coverage), however, this assumption is invalid in camera networks. Muhammad et al. [5] have also worked on the distributed computation of simplicial homology for more general sensor network using a communication graph of the network, but their work provides no experimental validation. The CN -Complex, the focus of this manuscript, is a simplicial complex introduced by Lobaton et al. [6]. The construction relies on the decomposition of the image domain of each camera by using occluding contours corresponding to static objects. The CN -Complex is proven to capture the homotopy type of the coverage of the camera network (e.g. it captures the holes in coverage and the overlap in coverage between cameras) for 3D environments with vertical walls. However, there are no guarantees for generic 3D environments. A distributed algorithm for its construction under noisy observations and multiple targets in the environment using reports of detections is also considered by Lobaton et al. [7]. This latter work only considered an indoor three camera, two target example which took several hours to setup. In this article, we extend their example to an indoor and outdoor eight camera, five participant setup that took approximately fifteen minutes to setup. Moreover, we describe how this additional geometric information can be exploited to perform tracking of multiple agents within the camera network. Activity Topology Activity Topology refers to the model obtained after identifying specific regions within the image domain from different camera views that correspond to the same physical location. Contrast this with the Vision Graph wherein the entire image domain is compared to establish overlap in coverage. This representation moves closer to a full metric reconstruction; however, little effort has been made to exploit this information to characterize network coverage. Mankris et al. [8] construct this representation by applying an appearance model between observed data in order to determine overlap between different portions of views. Van den Hengel et al. [9] introduce an exclusion approach to calculate the Activity Topology by starting with all possible combinations of topological connections and removing inconsistent links again using an appearance model. Detmold et al. [10] provide algorithms for large network setups, and an evaluation of the method and datasets are made available [11]. Though their method only relies on the detection of a target and avoids the use of appearance models, it has the unfortunate shortcoming of requiring the continuous streaming of detections from each camera. Full-Metric Model Full-metric models capture all geometric information about camera location (i.e. positions and orientations), which then determines the overlap between cameras as long as there exist no objects in their field of view. When objects are present in the environment, it is necessary to recover the locations of the objects in order to properly characterize the coverage of the network. Unfortunately, the amount of computation and the difficulty of constructing robust algorithms to accurately localize cameras is nontrivial. Stauffer et al. [12] determine connectivity between over-
IEEE TRANSACTIONS ON IMAGE PROCESSING
3
lapping camera views by calculating correspondence models and extracting homography. Lo Presti et al. [13] compute homographies by approximating tracks using piecewise linear segments and appearance models. Meingast et al. [14] utilize tracks and radio interferometry to fully localize the cameras. Rahimi et al. [15] describe a simultaneous calibration and tracking algorithm (using a network of non-overlapping sensors) by using velocity extrapolation for a single target. All of these algorithms work in a centralized fashion. Funiak et al. [16] introduce a distributed algorithm for simultaneous localization and tracking with a set of overlapping cameras, but their algorithm is not robust to large changes in perspective. Though all the representations considered in this section provide useful information about the deployment of cameras in a network, we employ the CN -Complex since it provides the flexibility required in an ad-hoc network (i.e. it is robustly computable in a distributed fashion), while not sacrificing valuable geometric information about the environment. III. M ATHEMATICAL BACKGROUND In this section, the concepts from algebraic topology used throughout this manuscript are introduced. This section contains material adapted from [4] and is not intended as a formal introduction to the topic. For a proper introduction to the topic, the reader is encouraged to read [17], [18], [19]. A. Simplicial Homology In order to characterize network coverage, we employ the fundamental construct of algebraic topology: the simplex. Definition 1. Given a collection of vertices V , a k-simplex is a set [v1 v2 v3 . . . vk+1 ] where vi ∈ V and vi 6= vj for all i 6= j. A (k − 1)-simplex, s1 , is a face of a k-simplex, s2 , denoted s1 ≺ s2 , if the vertices of s1 form a subset of the vertices of s2 . A finite collection of simplices, Σ, is called a simplicial complex if whenever a simplex lies in the collection then so does each of its faces. Simplices are defined as purely combinatorial objects whose vertices are just labels requiring no coordinates in space. Constructing simplices given a collection of sets is the focus of this article, and we consider the simplest of such methods next. Definition 2. The nerve complex of a S = {Si }N i=1 , for some N > 0, is the where vertex vi corresponds to the set Si correspond to non-empty intersections of ments of S.
collection of sets, simplicial complex and its k-simplices k + 1 distinct ele-
Fig. 2. A collection of sets corresponding to the coverage of a camera network (left) with corresponding nerve complex (top-right) and tracking graph (bottom-right).
1 to 7, and their corresponding field of view, Si , is shaded in the plane. The nerve complex consists of the simplices in Table I. The nerve of the collection is depicted graphically on the top-right of Figure 2, where 0-simplices are represented by nodes, 1-simplices are represented by edges, and 2-simplices are represented by triangles. Although in this setup the nerve complex captures all topological information in regards to network coverage, in Section V, we illustrate a setup where the nerve complex is unable to accurately characterize the network coverage. The corresponding tracking graph for the simplicial complex is shown on the bottom-right of Figure 2. Beyond providing tools to rigorously characterize network coverage, algebraic topology allows us to define meaningful algebraic structures on these simplices. Definition 4. Let {si }N i=1 be the k-simplices of a given complex, for some N > 0. Then, the group of k-chains , Ck , is the free Abelian group generated by {si }. That is: σ ∈ Ck
if and only if
σ = α1 s 1 + α2 s 2 + · · · αN s N
for some αi ∈ Z. If there are no k-simplices, then Ck := 0. Similarly, C−1 := 0. This definition allows us to construct an algebraic operation which characterizes topological invariants. Definition 5. Let the boundary operator ∂k applied to a k-simplex s = [v1 v2 · · · vk+1 ], be defined by: ∂k s =
k+1 X
(−1)i+1 [v1 v2 · · · vi−1 vi+1 · · · vk vk+1 ],
i=1
It is also useful to consider various modifications that improve the overall utility of the simplicial complex. Definition 3. The tracking graph, T (Σ), of a simplicial complex, Σ, is a directed graph such that the vertices correspond to simplices in Σ and an edge from s1 to s2 is present if s1 ≺ s2 . To illustrate these concepts, consider the network in 2D illustrated in Figure 2. Each camera position has a label from
and extended to any σ ∈ Ck by linearity. A k-chain, σ ∈ Ck , is called a k-cycle if ∂k σ = 0. The set of k-cycles, denoted TABLE I L IST OF S IMPLICES FOR E XAMPLE IN F IGURE 2 0-simplices: 1-simplices: 2-simplices:
[1], [2], [3], [4], [5], [6], [7] [1 2], [1 3], [1 5], [2 3], [3 4], [3 5], [4 6], [4 7], [5 6], [6 7] [1 2 3], [1 3 5], [4 6 7]
IEEE TRANSACTIONS ON IMAGE PROCESSING
4
by Zk , is the ker ∂k and forms a subgroup of Ck . That is: Zk := ker ∂k . A k-chain σ ∈ Ck is called a k-boundary if there exists ρ ∈ Ck+1 such that ∂k+1 ρ = σ. The set of k-boundaries, denoted by Bk , is the image of ∂k+1 and it is also a subgroup of Ck . That is: Bk := im ∂k+1 . We can check that ∂k (∂k+1 σ) = 0 for any σ ∈ Ck+1 , which implies that Bk is a subgroup of Zk . We now make several important observations. First, the boundary operator, ∂k , maps a k-simplex to its (k − 1)simplicial faces. Second, a calculation shows that the 1simplices that form a closed loop correspond to the group of 1-cycles (a proof of this fact can be found in any of the algebraic topology books we reference at the beginning of this section). We are interested in detecting the holes in our domain due to static objects. Unfortunately, these type of holes are just a subset of Z1 ; namely, 1-cycles can also be obtained from the 2-boundaries of a given complex. This observation motivates the definition of homology groups, which define a type of topological invariant. Definition 6. The k-th homology group is the quotient group Hk := Zk /Bk . The homology of a complex is the collection of all homology groups. The rank of Hk , called the k-th Betti number, βk , gives us a coarse measure of the number of holes. In particular, β0 is the number of connected components and β1 is the number of loops that enclose different “holes” in the complex. Considering again the example from Figure 2, we can represent the group of 0-chains C0 with the vector space R7 by identifying the simplices {[1], [2], · · · , [7]} with the standard basis vectors {v1 , v2 , · · · , v7 }, where v1 = (1, 0, 0, 0, 0, 0, 0)⊤ and so on. For C1 , we identify the 1simplices with the standard basis vectors {e1 , e2 , · · · , e10 }, where e1 = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)⊤ and so on. Similarly for C2 , the 2-simplices are identified with the standard basis vectors {f1 , f2 , f3 }, where f1 = (1, 0, 0)⊤ and so on. As mentioned earlier, ∂k is the operator that maps a simplex σ ∈ Ck to its boundary faces. For example, we have: ∂2 [1 2 3] = [2 3] − [1 3] + [1 2] ∂1 [4 6] = [6] − [4]
iff
iff
∂ 2 f 1 = e4 − e2 + e1 ,
∂ 1 e7 = v 6 − v 4 .
That is, ∂k can be expressed in matrix form as: ⊤ −1 1 0 0 0 0 0 −1 0 1 0 0 0 0 −1 0 0 0 1 0 0 0 −1 1 0 0 0 0 0 0 −1 1 0 0 0 , ∂1 = 0 −1 0 1 0 0 0 0 0 0 −1 0 1 0 0 0 0 −1 0 0 1 0 0 0 0 −1 1 0 0 0 0 0 0 −1 1
and
1 −1 1 ∂2 = 0 0 0
0 1 0 −1 0 0 0 0 0
0 0 1 0 0 1
0 0 0 0 −1 0
Since C−1 = 0 and rank(∂1 ) = 6,
⊤ 0 0 . 1
H0 = Z0 /B0 = ker ∂0 /im ∂1 = C0 /im ∂1 , and β0 = dim(H0 ) = dim(C0 ) − rank(∂1 ) = 1. Hence, we recover the fact that there is only a single connected component in Figure 2. Similarly, it can be verified that β1 = dim(H1 ) = dim(ker ∂1 ) − rank(∂2 ) = 1, which tells us that the number of holes in our coverage is 1. Observe, Hk = 0 for k > 1 (since Ck = 0). ˇ B. Cech Theorem ˇ Next we introduce the Cech Theorem which is proved by Bott et al. [20] (page 98). Before proceeding further, the following definition is required: Definition 7. Given two spaces X and Y , a homotopy between two continuous functions f0 : X → Y and f1 : X → Y is a 1-parameter family of continuous functions ft : X → Y for t ∈ [0, 1] connecting f0 to f1 . Two spaces X and Y are said to be of the same homotopy type if there exist functions f : X → Y and g : Y → X with g ◦ f homotopic to the identity map on X and f ◦ g homotopic to the identity map on Y . A set X is contractible if the identity map on X is homotopic to a constant map. Put simply, two functions are homotopic if it is possible to continuously deform one into the other. A space is contractible if it is possible to continuously deform it into a single point. Two spaces with the same homotopy type have the same homology. ˇ Theorem 1. (Cech Theorem) If the sets {Si }N i=1 (for some N > 0) and all S nonempty finite intersections are contractible, N then the union i=1 Si has the homotopy type as the nerve complex. If the required conditions are satisfied, then the topological structure dictated by the union of the sets is captured by the nerve complex. Observe that in Figure 2 all intersections are contractible. Therefore, we conclude that the extracted nerve complex has the same homology as the space formed by the union of the coverage. IV. T HE E NVIRONMENT M ODEL In this section, the model used for the physical layout, camera sensors, and agents in the environment are made explicit. The Environment: Consider a 3D domain and a collection of objects with the following properties:
IEEE TRANSACTIONS ON IMAGE PROCESSING
•
•
•
5
There is a global coordinate system, Ψ. Points in this coordinate system are denoted by (x, y, z). Objects and cameras reside within the planes z = 0 (the “floor”) and the z = hE (the “ceiling”). Objects are sets of the form O = {(x, y, z) | (x, y) ∈ P, z ∈ [0, hE ] } , where P is a connected polygon with non-empty interior, and the number of objects, No , is finite.
Agents: Agents are represented by the following properties: • They are line segments of the form ( ) No [ A = (x, y, z) (x, y, 0) ∈ Oi , z ∈ [0, hA ] , / i=1
•
where hA ≤ hE . Agents move continuously in the environment by translating along the floor plane and changing their (x, y) location.
Cameras: Cameras are located at static unknown locations, capable of detecting the agents, and satisfying the following properties: • Each camera α has a local coordinate system, Ψα , where the origin, oα , corresponds to the camera position. Points in this coordinate system are denoted by (xα , yα , zα ). • The field of view, Fα , of a camera α is given by Fα = {(xα , yα , zα ) | (xα , yα , zα ) ∈ Q, zα > 0} ,
•
where Q is the interior of a polyhedral convex cone based at oα . The camera projection, Πα : Fα → R2 , for camera α is given by Πα (xα , yα , zα ) = (xα /zα , yα /zα ) .
•
•
The image of this map is called the image domain, Ωα . The detection set, Dα,a , of agent a in camera α is defined as p = (xα , yα , zαS), p ∈ Aa ∩ Fα , Dα,a = Πα (p) poα ∩ Oi = ∅
where pq is the line segment joining points p and q. An agent is said to be visible by camera α if Dα,a 6= ∅. The coverage, Cα , of camera α is given by
Fig. 3. Mapping from 3D to 2D : A camera and its field of view are shown from multiple perspectives (left and middle), and its corresponding coverage as a set in 2D (right). For the 3D configuration, the planes displayed bound the space that can be occupied by an agent.
Figure 3 shows an agent and a camera with its corresponding field of view. On the right plot of the figure, we illustrate the coverage of the camera as a subset in R2 . There is a clear mapping from our 3D scenario to a 2D domain. Our goal, in this article, is to obtain a representation that captures the topological structure of camera network coverage, while not relying on a priori knowledge of camera or object location. V. T HE CN -C OMPLEX The goal of this section is to outline the steps required to construct a simplicial complex that captures accurate topological information about the network coverage, C. A na¨ıve approach considers the nerve complex constructed from the collection of camera coverages, {Cα }, as illustrated in Figure 4 (left). Unfortunately, the middle diagram in this figure demonstrates that this approach fails. In this case, the ˇ hypothesis of the Cech Theorem is unsatisfied (C1 ∩ C2 is not contractible). If, on the other hand, we first decompose camera 1’s coverage along lines corresponding to the object’s boundaries and consider the nerve complex constructed from these four coverage sets as done in the right diagram of Figure 4, then we capture the hole in the coverage. In fact, this construction accurately captures the topological structure of the camera network coverage. This decomposition of a camera’s coverage can be obtained via local detections from each camera without knowledge of camera position. To illustrate this approach, consider tracking the set of detections in the image domain, Ωα , of a camera α as
Cα = {(x, y) | an agent at (x, y) is visible by camera α} . The network coverage, C, is the union of the individual coverage of the cameras. Objects are static elements in the environment while agents are dynamic elements in the environment. We assume the existence of a global coordinate system for the sake of clarity while specifying the location of objects, agents or cameras, but its calculation is not required. Although these assumptions may seem restrictive, most camera networks (indoor or outdoor) satisfy our assumptions. Several of the choices in our model (such as the vertical line target and polyhedral objects) are made in order to simplify analysis. We validate these assumptions in real-life scenarios through experiment. The example in
Fig. 4. Nerve complexes (bottom row) obtained from the collection {Cα } (top row). One complex captures the correct topological information (bottomleft), but the other does not (bottom-middle) unless the coverage is properly decomposed (bottom-right).
IEEE TRANSACTIONS ON IMAGE PROCESSING
6
agents move in the physical environment. We would observe line segments that either leave Ωα through its boundary or disappear from the interior of the domain outlining line segments corresponding to the occluding contours due to the vertical objects in the domain. If we extend these line segments, we obtain bisecting lines that decompose the image domain and thus the camera coverage. We describe the algorithm to perform this task in Section VI-A. Definition 8. The nerve complex constructed after decomposing each camera’s field of view by its corresponding bisecting lines is called the CN-Complex. Its construction consists of two steps: 1) Identify all bisecting lines to decompose each camera’s coverage. 2) Construct the nerve complex on the resulting collection of sets by determining whether there is an overlap between cameras. This construction guarantees that all finite intersections between the resulting sets are contractible which together with ˇ the Cech Theorem yields the following result whose proof can be found in [6]: Theorem 2. (Decomposition Theorem) Given an environment and camera network that satisfy the modeling assumptions from Section IV and the coverage of each camera is connected, the nerve complex of the collection of sets obtained after decomposing each camera’s field of view by its corresponding bisecting lines has the same homotopy type as the network coverage. Since we are only interested in recovering the homotopy type of a 2D coverage, due to our environmental model, we need to only consider a nerve complex constructed with a maximum of 2-simplices. VI. D ISTRIBUTED ROBUST CN -C OMPLEX C ONSTRUCTION In this section, we describe the algorithms required to distributedly construct the CN -Complex. A. Finding Bisecting Lines First, we address the problem of detecting the bisecting lines that decompose the image domain of a camera. To this end, we assume we have a background subtraction algorithm: thresholding the difference between a background image and the current frame is sufficient. An example of the output of this algorithm on several sample images can be found in Figure 5. Unsurprisingly, this algorithm does not perform particularly well. We then utilize an algorithm presented by Jackson et al. [21], which consists of accumulating the boundary of foreground objects wherever partial occlusions are detected. In our case, we only store the detections at times when occlusion events (i.e. an agent appears or disappears from a camera’s view) occur. We are uninterested in the exact boundary of the objects, but only the bisecting lines. Hence, we take the approach of first approximating any occluding boundary with
Fig. 5. Sample frames from a video sequence employed to find bisecting lines and construct the CN -Complex presented in the experimental section VI-D. Actual images (top) and corresponding detections obtained by frame differencing (bottom). Note the unreliability of the foreground segmentation results.
vertical lines and then refining the fit. This step is done locally at each node. In Figure 6, we observe a camera view with several occluding boundaries due to walls and a column (left). The accumulated boundaries of the foreground detections are shown on the middle. Initial estimates for the boundaries are chosen at the peaks of the distributions of detections along each column (middle-bottom). Finally, the estimates are refined by performing a least-square fit on the data with respect to all the points on the boundary that are close to the vertical line estimates. The final result is shown in the right plot. B. Finding Intersect Points After the bisections within each camera view have been calculated, we must determine the connectivity between the resulting decomposed views. Throughout this section when referring to a camera view, we mean one of the resulting decomposed regions. More specifically, we look for intersect points (i.e. points in the intersection of the field of views of the cameras). No target identification is necessary, but recurrence over time is exploited. This is accomplished by approximating the probabilities of overlap using the number of times that an occlusion event occurs in a camera (hit count), and the count of concurrent detections with other cameras (match count). Each camera node is synchronized and has the following properties: •
Each has a unique ID.
Fig. 6. Steps to find bisecting lines: for the original view (left), the boundaries of the foreground masks are accumulated whenever occlusion events are detected (top-middle). Vertical bisecting lines are estimated by aggregating observations over all rows and obtaining the indices of the columns with the highest detections (bottom-middle). Bisecting lines are further refined through a linear fit procedure using the accumulated observations (right).
IEEE TRANSACTIONS ON IMAGE PROCESSING
7
Agent detections are stored in detection sets, D(n) , with corresponding times, t(n) , where n is the frame number. • A list of intersect points, P ts1, is maintained which contains coordinates of each intersect point in its local image domain, the ID of the primary camera for which each point is also visible, the coordinates of each point in the image domain of the primary camera, and corresponding hit and match counts. • A list of intersect points, P ts2, is maintained which contains coordinates of each intersect point in its local image domain, the IDs of the primary and secondary cameras for which each point is also visible, the coordinates of each point in the image domains of the primary and secondary cameras, and corresponding hit and match counts. The role of the primary and secondary cameras and how to calculate the various lists are explained later in this section, but first we illustrate our approach by considering an example. Assume we have two cameras in a room of area 1 with region R1 in the coverage of camera 1 and region R2 in the coverage of camera 2. Let R12 be the intersection of these regions. Also assume that we have N independent agents, and the probability of an agent’s location is uniformly distributed over the room. We define Di as the event that there is a detection in Ri at a given instance, Di as the complementary event, and D12 as the event that there is a detection in their intersection. For simplicity, we assume just for this thought experiment that an agent is detected in Ri if and only if it is actually present (i.e. there are no errors in detection). Hence, we have •
P (D1 ) = 1 − P (D1 ) = 1 −
|R1c |N
N
= 1 − (1 − |R1 |) , (1)
are disjoint, then P (D12 |D1 ) = 0. As the overlap increases, the probability increases. If R1 ⊂ R2 , then P (D12 |D1 ) = 1. These observations are illustrated in Figure 7. Intuitively, we expect a similar behavior even if the agents are not uniformly distributed, and there are detection errors. Hence, we use the following quantity as a measure of the overlap between two regions: r12 = max (P (D12 |D1 ), P (D12 |D2 )) .
Note that this quantity has the following properties: r12 = 1 whenever R1 = R2 , R1 ⊂ R2 or R2 ⊂ R1 ; and r12 = 0 whenever R12 = ∅. In our algorithm, r12 is employed as a direct measure of the confidence of overlap in coverage between cameras 1 and 2, which corresponds to a 1-simplex in our representation. The quantity r12 seems an ideal measure to quantify overlap between camera regions; however, its direct computation requires the identification of detections in the overlap between two cameras, in order to compute P (D12 ), which is unknown. Nevertheless, by assuming that the detection between two disjoint regions are independent from one another we observe that P (D2 ∧ D1 ) = P (D12 ) + P (D12 )P (D1−2 )P (D2−1 ), (5) where D1−2 represents a detection in the region R1 − R2 and D2−1 is similarly defined. We also have that P (D2−1 ) = P (D2 ) − P (D12 ) and P (D1−2 ) = P (D1 ) − P (D12 ). Then, we can express the previous equation as: P (D2 ∧ D1 ) = p + (1 − p)(P (D1 ) − p)(P (D2 ) − p), (6) where p := P (D12 ). Hence,
c
where for a set A, A is the set complement over the room and |A| is its area. Then, c N | = 1−(1−|R12 |)N . (2) P (D12 ) = 1−P (D12 ) = 1−|R12
Therefore, the probability of detecting a target in R12 given a detection in R1 is the conditional probability: P (D12 |D1 ) =
1 − (1 − |R12 |)N P (D12 ) = . P (D1 ) 1 − (1 − |R1 |)N
(3)
We observe that this quantity is a function of the overlap R12 and measures the amount of detections D1 that can be explained by observations D12 in the intersection. If R1 and R2
P (D2 |D1 ) =
p + (1 − p)(P (D1 ) − p)(P (D2 ) − p) . p(D1 )
(7)
We emphasize the dependence of this probability on the parameter p by using the notation Pp (D2 |D1 ). If we are given o samples {dk }N k=1 from this distribution, we can solve for the maximum likelihood estimator by using the formula: p∗ = argmaxp {Pp ((D2 |D1 ) = dk )}
(8)
where p ∈ [0, min(P (D1 ), P (D2 ))] since R12 is a subset of R1 and R2 . Remember that p∗ approximates P (D12 ) and we can use this quantity to estimate r12 using Equation (4) given that P (D1 ) and P (D2 ) are known. This gives r12 = max (p∗ /P (D1 ), p∗ /P (D2 )) .
Fig. 7. Geometric depiction illustrating different overlapping configurations and corresponding detection probabilities for 3 agents in a square room of area 1. Intuitively, whenever R1 and R2 are obtain P (D12 |D1 ) = 0. For a partial overlap, we expect a larger probability value (middle). If |R1 | ≈ |R2 | ≈ 0.01 and |R1 ∩ R2 | ≈ 0.005, then |R12 |/|R1 | ≈ P (D12 |D1 ) ≈ 0.5. If we have perfect overlap, then we observe that P (D12 |D1 ) = 1.
(4)
(9)
In order to perform the computations to calculate r12 , we require P (D1 ), P (D2 ) and samples from P (D2 |D1 ). In our algorithms, detections occur whenever there are agents entering or leaving the coverage of a camera. P (D1 ) and P (D2 ) can be estimated by counting the number of detections in each camera over time. The samples from P (D2 |D1 ) are obtained by having camera 1 broadcast its detections to the network and camera 2 keeping a count of concurrent and missed detections. A similar analysis can be performed to estimate the probability of overlap between three regions P (D123 ) by assuming
IEEE TRANSACTIONS ON IMAGE PROCESSING
8
Algorithm 1 Event Detections 1: if Current frame number, n, is greater than 1 then 2: if D(n) 6= ∅ and D(n−1) = ∅ then 3: Compute coordinates of detections points in D(n) . 4: Add coordinates, time t(n) , and camera ID to transmission queue. 5: else if D(n) = ∅ and D(n−1) 6= ∅ then 6: Compute coordinates of detection points in D(n−1) . 7: Add coordinates, camera ID, and time t(n−1) to transmission queue. 8: end if 9: end if o that P (D12 ) and P (D3 ) are known, and samples {dk }N k=1 from P (D3 |D12 ) are available. Note that P (D12 ) is estimated from the previous argument. In this case, we get
q ∗ = argmaxq {Pq ((D3 |D12 ) = dk )} ,
(10)
where q ∈ [0, min(P (D12 ), p(D3 ))] represents P (D123 ), and Pq (D3 |D12 ) is obtained from Equation (7) by replacing D1 with D12 and D2 with D3 . Then, we can define the overlap ratio r123 = max (q ∗ /P (D1 ), q ∗ /P (D2 ), q ∗ /P (D3 )) .
(11)
Note that samples from P (D3 |D12 ) are not directly available. Instead we use samples from P (D3 |D1 ∧D2 ) as long as r12 is close enough to 1 (we use a value of 0.9 in our experiments) which guarantees a large overlap between R1 and R2 . In our algorithm r123 is employed as a direct measure of the confidence of overlap between three regions, which corresponds to a 2-simplex in our representation. Since we are only interested in recovering the homotopy type of a 2D coverage, we need only consider a nerve complex constructed with a maximum of 2-simplices; hence we do not need to consider any higher “degree” overlap ratios. It is possible to bound the estimated overlap ratios such that values above a given threshold are guaranteed to correspond to sufficient overlap between two regions. Nevertheless, such a bound would require a priori knowledge about the distribution of an agent’s location, the number of agents and the geometry of the environment; however, this information may be unavailable and calculating an arbitrary cut-off maybe impossible. Therefore, we employ an argument from algebraic topology called persistence to robustly analyze the observed data in order to avoid making undue assumptions when extracting topological information about the coverage. This approach is described in Section VI-C. Next, we describe an algorithm to obtain samples from P (D2 |D1 ) and P (D3 |D1 ∧ D2 ) distributedly between two and three cameras, respectively. Locally, each camera makes observations and transmits detections after every occlusion event. The transmission of these detections serves as a trigger for other cameras to transmit their own detections. Once the detections have been shared, the estimates of the conditional probabilities are updated. The algorithm’s goal is to estimate the aforementioned conditional probabilities using the number of times that an
occlusion event occurs in a camera, the hit count, and the count of concurrent detection between other cameras, the match count. In the case of pairwise detections, we have a local and primary camera (the camera in which the event is observed). In the case of three cameras, we have a local, a primary (the camera in which the event is observed), and a secondary camera. These counts are used to compute the overlap ratios defined in Equations (9) and (11). These ratios are computed over all intersect points and use the maximum of these quantities as an indicator of the likelihood of overlap between two or three cameras. The first step in computing the hit and match counts for the intersect points is to perform occlusion event detection and transmit this information across the network. This process is outlined in Algorithm 1. Whenever an event is detected, it serves as a trigger for information sharing between camera nodes. Once an event transmission is received by another camera node, it is used to update detection counts for the intersect points, P ts1. This process is outlined in Algorithm 2. The primary camera is the camera from which the event originated. Entries that are unreliable, based on low estimates of the conditional probabilities with high enough confidence, are removed as described in line 3. An entry is labeled unreliable if there are more than 10 observations, and its probability of detection is below 0.1. Note that the first loop of the algorithm (starting at line 4) takes care of updating the match counts, while the second loop (starting at line 13) takes care of updating the hit counts. At the end of the algorithm (line 18), the observations of the local camera are transmitted throughout the network. These observations play the role of secondary observations while updating P ts2. Once an event transmission and secondary observations are received by a camera node, the counts for the intersect points, P ts2, is updated using a process identical to Algorithm 2. The only differences are that P ts1 is replaced by P ts2, mentions of primary are replaced by primary and secondary, and no transmission is necessary at the end of this process. Data storage and processing occurs distributedly, and each camera maintains a local copy of the representation. At the completion of the algorithm, we have a list of simplices with an associated likelihood, which are computed by taking the maximum probability between all intersect points with vertices corresponding to the same simplex. This is a probabilistic version of the CN -Complex that is used in the next section to extract a deterministic CN -Complex. C. Computing Homology As described in Section III, homology provides topological invariants. The homology of our representation can be extracted distributedly using the algorithms proposed by Muhammad et al. [5]. According to the results of the previous section, different complexes are obtained as a function of the threshold chosen on the simplex likelihood (i.e. maintaining only those simplices that have a likelihood above a specified threshold τ ). This of course means that the various homologies of these complexes are also a function of the threshold.
IEEE TRANSACTIONS ON IMAGE PROCESSING
At this point, we employ a persistent homology approach [22], [23]. Namely, we do not choose a particular threshold τ on the probability which would dictate only a single simplicial complex, but analyze the various homologies over the range of possible probabilities, τ ∈ [0, 1]. The outcome of this approach are barcodes with initial position depicting the birth of a new topological feature (a new connected component or hole) and its final position which symbolizes the disappearance of such a feature. We then choose a threshold according to the point with the most persistent topological feature (i.e. a threshold chosen amongst the set of possible thresholds corresponding to the range with the most consistent feature). This allows us to choose a threshold on the conditional probabilities considered in the previous subsection without relying on a priori knowledge about the distribution of an agent’s location, the number of agents, or the geometry of the environment. Though we do not describe how to compute persistence, their exist efficient algorithms to perform this computation [22], [23]. An example is illustrated in Figure 9 for the experimental scenario considered in the next section. The x-axis in the right plot denotes a threshold over the likelihood, each blue line denotes the existence of a connected component or a hole as a function of this threshold, and the number of blue lines at a particular threshold denotes the number of connected components or holes.
Algorithm 2 Update Detection Counts in P ts1 1: Receive detection coordinates D, time t, and ID from primary camera. 2: Obtain local detection, Dloc , at time t. 3: Remove any entries in list of intersect points, P ts1, that are unreliable. 4: for each detection coordinate from primary camera do 5: for each detection coordinate in Dloc do 6: if entry in P ts1 exists with matching coordinates (same as the one we are iterating over) and received camera ID then 7: Increase match count of such entry by 1. 8: else 9: Create entry in P ts1 with corresponding camera ID and coordinates, hit count set to 0 and match count set to 1. 10: end if 11: end for 12: end for 13: for each detection coordinate from primary camera do 14: if entries in P ts1 exist with corresponding primary coordinates and primary camera ID then 15: Increase hit count of such entries by 1. 16: end if 17: end for 18: Add coordinates in D and Dloc of detection points for which there was a match, primary ID and local camera ID, and time t to transmission queue.
9
1
2
3
Fig. 8. Experiment I setup: views from cameras 1 (top-left) through 3 (topright) and corresponding detected bisecting lines (bottom row).
D. Experiment In this section, we consider two experimental setups with three and eight cameras, respectively. Video was recorded simultaneously from all cameras using several computers. The computers were synchronized used the Network Time Protocol (NTP) and use no prior knowledge about camera locations, no appearance or tracking models, or no knowledge about the number of targets. The data was processed in a single computer by simulating a distributed camera network. Namely, each camera was treated as an independent process on our computer. The amount of data transmitted and processing required are small enough to occur distributedly on a sensor network platform such as CITRIC [24]. The simulation was performed in MATLAB by creating separate structures for each camera sensor which maintained information about its current state and a list of in-queue and out-queue messages. A path connecting cameras with consecutive IDs is used as a communication graph. Every 0.1 seconds a single message is transmitted from each camera to its communication neighbors. Messages were processed locally once they were received. We keep track of the number of messages generated by each camera sensor and the amount of data that goes through them. The data in both experiments was captured using a resolution of 320 × 240 for all the cameras at about 10 frames per second with continuous motion throughout the sequence. The number of people moving through the scene varied from one to five. We computed intersect points and corresponding
β0 Diagram 0
0.2
0.4
τ
0.6
0.8
1
0.6
0.8
1
β Diagram 1
0
0.2
0.4
Fig. 9. CN -Complex found for Experiment I using a threshold value of τ = 0.5 (left). The persistence diagrams obtained from the experiment (right). The bars in the diagram correspond to topological features that are tracked over a range threshold values τ . The start of the bar corresponds to the “birth” of the feature and its end corresponds to its “death” (e.g., a hole is introduced and then vanishes). Note, a single connected component and single hole are the most persistent features. The hole is due to the column in the middle of the room.
Number of Detections
IEEE TRANSACTIONS ON IMAGE PROCESSING
25
10
25
25
Camera 1
Camera 3
Camera 2
20
20
20
15
15
15
10
10
10
5
5
5
0 0
100
200
300
400
0 500 0
time (sec)
100
200
300
400
500
0 0
time (sec)
100
200
300
400
500
time (sec)
Fig. 10. Plots of the number of block detections per occlusion event over time for Experiment I. We note that the events are relatively sparse (over an 8.5 minutes period), and the number of blocks detected at each time step are under 15 in most cases. There were a total of 180, 42 and 270 events on each camera, respectively, for a total of 492 events in the network.
probabilities as described in the section VI-B. However, instead of considering every possible pixel as an intersect point, we split the image domain into blocks of size 20 × 20 and use these regions. The raw transmission of these detections using a single bit per block would need to be streamed at approximately 14 kB/min from each camera. However, in our experiment we observe less than 2 kB/min of data generated by each camera (see Table II and Table III). Since our algorithm is event driven, its communication cost is unaffected by devices with higher temporal resolution. Experiment I In our first experiment, we utilized three cameras in an indoor environment. Each camera was connected to a different computer while recording the data. The sequence corresponds to about 8.5 minutes of recording with the first 3.2 minutes corresponding to a single target, the next 3.3 minutes corresponding to a different single target, and the last 2 minutes corresponding to two targets moving in the environment. Images were captured at about 10 frames per second. The physical setup of our experiment is shown on the top of Figure 1. Views from the three cameras are shown in the top row of Figure 8. Importantly, note that though there is significant overlap between the three cameras, finding common features between views would be difficult due to the large change in perspective. The decomposed camera views (after finding bisecting lines) are shown at the bottom of Figure 8. There are three regions in camera 1, one region in camera 2, and five regions in camera 3 after decomposition. Figure 9 (left) illustrates the corresponding simplex after we threshold with a value of τ = 0.5. From the right plot, we observe a single connected component and hole in the domain are the most persistent topological features in the coverage, as desired. Since detections are only transmitted after an occlusion event, the transmission rate is low. Figure 10 shows a summary of the number of blocks in which a detection was observed for each camera over time. Table II shows the average number of communication packets associated with the construction of 1-
simplices and 2-simplices (columns 1 and 2) transmitted over the whole experiment, the average amount of data generated from each camera (column 3), and the average amount of data transmitted from each camera (column 4). The latter quantity includes messages delivered to the entire network that are not originated at the given camera. The data size of a 1-packet is obtained by assigning four bytes to encode the time stamp, two bytes to specify the ID of the camera source and two bytes to encode the coordinates in which there was a block detection. The data size of a 2-packet is obtained by assigning four bytes to encode the time stamp, four bytes to specify the IDs of the cameras associated with the detection, and four bytes per block for coordinates. Note, no additional compression is performed, and the amount of data generated from each camera is less than 12% of the amount of data required if we were to stream all of the detection. Experiment II In our second experiment we utilized eight cameras, four were indoors and four were outdoors. There were a total of four computers used for data recording (one for each nearest pair in the physical layout). The sequence corresponds to about 10.5 minutes where the number of targets varied between 2 and 5. The physical layout of our experiment is shown in Figure 11. Setting up the network for our experiment took less than 15 minutes by placing laptops at different locations and mounting cameras, which illustrates the inherent flexibility of an ad hoc network. The decomposed camera views (after finding bisecting lines) are shown counterclockwise from the top-left to the bottom-right in the figure. Figure 12 (left) illustrates the corresponding simplex after we threshold with a value of τ = 0.5. From the right plot, we observe a single connected component and hole in the domain are the most persistent topological features in the coverage, as desired. Table III shows the average number of packets associated with the construction of 1-simplices and 2-simplices (columns 1 and 2) transmitted over the whole experiment, the average amount of data generated from each camera (column 3), and the average amount of data transmitted from each camera (column 4). The size of the data packets are computed as before. Note that the amount of data generated by each camera is less than 7% of the amount of data required if we were to stream all of the detections.
TABLE III S UMMARY OF DATA T RANSMISSION FOR E XPERIMENT II Camera
TABLE II S UMMARY OF DATA T RANSMISSION FOR E XPERIMENT I Camera 1 2 3
1-Packets (pkt/min) 21.2 4.9 31.8
2-Packets (pkt/min) 16.1 0 22.6
Total Data (kB/min) 1.31 0.05 1.65
Total Flow (kB/min) 1.31 3.06 1.65
1 2 3 4 5 6 7 8
1-Packets (pkt/min) 7.1 5.4 3.8 1.9 2.8 6.1 6.9 3.1
2-Packets (pkt/min) 2.1 3.8 2.9 0.1 0.9 3.1 3.3 1.2
Total Data (kB/min) 0.29 0.28 0.35 0.03 0.13 0.88 0.47 0.28
Total Flow (kB/min) 0.29 2.97 3.05 2.71 2.83 3.56 3.16 0.28
IEEE TRANSACTIONS ON IMAGE PROCESSING
11
[2]
[1]
γ1 [3] [4]
γ2
[7]
[5] [6]
Fig. 13. Coverage of a camera network (left), its corresponding tracking graph (right), and paths for two agents moving in the environment.
graph introduced in Section III-A. Figure 13 illustrates this mapping from simplicial complex to tracking graph using the example camera network considered in the background section. In the figure, we observe a pair of paths (left), Γ1 and Γ2 , and their corresponding projections onto the tracking graph (right), γ1 and γ2 , respectively. Detecting the location of agents in the tracking graph helps localize agents in the physical environment and allows for potentially more efficient deployment of valuable resources. The objective of this section is to characterize the motion of agents in the camera network coverage by identifying their occupancy in the tracking graph. We refer to this process as weak tracking. A path of an agent in the CN -Complex is defined as the ordered list of simplices that it occupies over time. Since we employ no identification of agents, weak tracking results in some ambiguity when agents cross paths. A. Tracking Multiple Targets Fig. 11. Layout of Experiment II (top-right) and corresponding camera views with decomposed image domains (counterclockwise from top-left to bottomright).
β Diagram 0
0
0.2
0.4
τ
0.6
0.8
1
0.6
0.8
1
β1 Diagram 0
0.2
0.4
Fig. 12. CN -Complex found for Experiment II using a threshold value of τ = 0.5 (left). The persistence diagrams obtained from the experiment (right). Note, a single connected component and hole are the most persistent features.
VII. W EAK T RACKING In this section, we illustrate the utility of the CN -Complex by describing how to perform tracking on the representation and how to use the produced data to detect homotopic paths. To begin, we make several observations. First, observe that a single agent’s location is mapped to a simplex in the CN -Complex by determining for which cameras the agent is visible. Second, given fast enough sampling, an agent in simplex s can only move to a face of s or a simplex for which s is a face. This observation is captured by the tracking
Our goal, in this subsection, is to specify the dynamics of moving agents in the environment via their dynamics on the tracking graph. Throughout this subsection, we assume that cameras are high enough to identify the number of distinct agents in their field of view; however, the identification of the agents is not assumed. Let x(n) be the nodal count vector containing the number of agents at different states in the tracking graph, where n is the frame number. The transitions of agents between regions in the tracking graph is captured by a vector, α(n) , which specifies the number of transitions along a particular edge of the graph. Note, its sign specifies the direction of a transition. Since the tracking graph is a directed graph, we can specify an oriented incidence matrix, L, which satisfies the relation: x(n+1) = x(n) + Lα(n)
(12)
However, this formulation is unable to capture the fact that transitions leaving a node can never exceed the number of agents at a node at a given time. In order to quantify this property, we consider: Lα(n) = L( (α(n) )+ − (α(n) )− ) = Af (n)
(13)
where (v)+ for a vector or matrix v returns an element of the same dimension with entries equal to v except all negative entries are replaced by 0, (v)− := (−v)+ , A := [L | (−L)]
IEEE TRANSACTIONS ON IMAGE PROCESSING
12
(n) ⊤ ⊤ and f (n) := [(α(n) )⊤ )− ] . We count the number of + | (α transitions out of a particular node by computing:
(L)− (α(n) )+ + (L)+ (α(n) )− = Bf (n)
(14)
where B := [(L)− | (L)+ ]. Each camera can only count the number of agents in its own coverage and cannot identify which agents are in the coverage of another camera. The number of agents counted by camera (n) k, denoted yk , is equal to: X (n) yk = x(n) (15) s , where Λ := {s ∈ T (Σ) | k ∈ s} s∈Λ
The sensor count vector, y (n) , is then defined by its projec(n) tions, yk : y (n) = Cx(n) , (16) where Ck,s = 1 if k ∈ s and 0 otherwise. Finally, we must acknowledge that agents can enter or exit the coverage of the network through different regions in the coverage. We capture this requirement via a vector z (n) that is added to x(n) at every frame. Given a tracking graph, we can define a corresponding incidence matrix, L, and a projection matrix, C, and if we know the count of agents at frames n and n + 1, then the transitions between nodes must satisfy: x(n+1) y (n) Bf (n) f (n)
= = ≤ ≥
x(n) + z (n) + Af (n) , Cx(n) , x(n) , 0.
(17)
With these equations for the counting dynamics of agents in the graph and knowledge of the sensor counts {y (n) }, we can attempt to solve for the number of agents in the environment and their corresponding tracks. Hence, we consider these sub problems: 1) Counting Agents: Finding the number of agents in the coverage is a common surveillance problem. This can be addressed by either considering observations at a single frame or observations over multiple frames. In this scenario, we assume no boundary transitions. This problem can be posed as a question of feasibility under linear constraints by employing P (0) Equation (17) and xi = Na , where Na is the number of agents in question. 2) Recovering Tracks: Finding the internal transitions of the agents which define paths in the tracking graph allows us to localize targets as they move along the physical space. This requires knowledge of the number of agents and when they enter or exit the coverage. If in addition, we know probabilities of transitions and path probabilities, we can calculate an expected path in the environment when observations are unable to identify a unique path. As an example, consider the paths displayed in Figure 13. The physical and graph paths are displayed in the left and right plots of Figure 13, respectively. An occupancy list for the agents over time for the tracking graph is included in Table IV (second column). From the counts {y (n) }N n=1 (see third column in Table IV) and knowledge of the CN -complex, we can formulate a simple optimization problem in order to
TABLE IV L IST OF N ODES O CCUPIED BY AGENTS OVER TIME AND C ORRESPONDING C OUNTS Frame n=1 n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9 n = 10 n = 11 n = 12
Actual {[1], [5]} {[1 3], [5]} {[1 3], [3 5]} {[3], [3 5]} {[3], [3]} {[3 4], [3]} {[3 4], [3 4]} {[4], [3 4]} {[4], [4]} {[4 6], [4]} {[4 6], [4 7]} {[4 6 7], [4 7]}
Sensor Counts y(n) (1, 0, 0, 0, 1, 0, 0)⊤ (1, 0, 1, 0, 1, 0, 0)⊤ (1, 0, 2, 0, 1, 0, 0)⊤ (0, 0, 2, 0, 1, 0, 0)⊤ (0, 0, 2, 0, 0, 0, 0)⊤ (0, 0, 2, 1, 0, 0, 0)⊤ (0, 0, 2, 2, 0, 0, 0)⊤ (0, 0, 1, 2, 0, 0, 0)⊤ (0, 0, 0, 2, 0, 0, 0)⊤ (0, 0, 0, 2, 0, 1, 0)⊤ (0, 0, 0, 2, 0, 1, 1)⊤ (0, 0, 0, 2, 0, 1, 2)⊤
Recovered {[1], [5]} {[1 3], [5]} {[1 3], [3 5]} {[3], [3 5]} {[3], [3]} {[3 4], [3]} {[3 4], [3 4]} {[4], [3 4]} {[4], [4]} {[4 6], [4]} {[4 6 7], [4]} {[4 6 7], [4 7]}
recover the occupancy of the agents in the tracking graph. If we know the initial configuration x(0) and set z (n) ≡ 0, then we can pose the following binary programming problem: Af = y − Cx(0) ⊤ , (18) min 1 f s.t. Bf ≤ Ix(0) f ∈Ω −f ≤ 0
where Ω is the space of binary vectors, f and y are the vectors formed from stacking {f (n) } and {y (n) } respectively, and the matrices A, B, C and I are obtained by considering the dynamics from Equation (17). In solving this problem, we look for the sparsest vector of transitions that explain the observed counts. The solution to this problem gives the recovered node occupancies shown in Table IV (last column). Note, most of the occupancies agree with the actual paths followed by the agent; however, there is a single discrepancy at frame n = 11 which is still consistent with the observed counts. Hence, the recovered and actual paths are indistinguishable when only considering the count vectors. The tracking process, as described, on the CN -Complex does not distinguish between agents locally, i.e. there is no way to separate crossing paths. Employing a local appearance model could solve these ambiguities. In this context, it is also possible to build a distributed probabilities process to perform tracking on a graph via the framework presented by Oh et al. [25]. B. Identifying Homotopic Tracks
Once tracks have been recovered, it is useful to identify paths with common starts and end positions that are related to one another by deformation. This allows for clustering of paths when performing behavioral analysis of agents. For example, this can be used to identify (up to small deformations) the most common path followed by an individual when going from his office to a coffee shop. Two paths are homotopic if they are continuous deformations of one another. First, let us consider paths in the CN complex formed by 1-chains. In particular, we are interested in identifying when two paths, σi and σj , with common start and end positions in the CN -Complex are homotopic. Note that, σij := σi − σj
IEEE TRANSACTIONS ON IMAGE PROCESSING
13
q
[2]
γ
2
[1]
[3] [4]
γ1 [7] [5]
p
making undue assumptions about the location of cameras or objects or employing agent identification, we bridge the gap between vision graphs which provide limited information about overlaps in network coverage and full metric representations which falter in the ad-hoc distributed camera networks context. This work provides a framework to integrate local image observations using a global algebraic framework and opens the door for new avenues of research in this nascent field.
[6]
Fig. 14. Plot of paths in the physical space joining points p and q (left) and their corresponding projection onto the tracking graph for the setup considered in the background section.
forms a loop. If the previous paths are homotopic, then σij ∈ im ∂2 as a result of the comments following Definition 5. Hence, if the rank of the matrix, [∂2 |σij ], does not equal the rank of ∂2 , then σi and σj are not homotopic. When tracking agents using the tracking graph, paths are given by ordered lists of occupied simplices (not all of them have to be 1-simplices). In order to determine whether two such paths γi and γj with common starting and end location are homotopic, we can form the loop γij given by an ordered list of N simplices, {si }N i=1 . We can generate an ordered list of vertices, {vi }N , by selecting the vertex with the smallest i=1 index from each simplex (i.e. if si = [vn1 vn2 · · · vnk ] then vr is selected where r = min{n1 , · · · , nk }). After removing consecutively repeated vertices, a 1-chain, σij , can be built by selecting consecutive pairs of vertices as 1-simplices. Then, the rank check described above can be performed. In order to illustrate this process, consider the CN -Complex from Figure 14 and the paths γ1 and γ2 . Observe that γ12 is given by {[5 6], [5], [1 5], [1 3 5], [1 3], [1], [1 2], [2], [2 3], [3], [3 4], [4], [4 7], [4 6 7], [4 6], [6], [5 6]}. From this path, the list of vertices {5, 1, 2, 3, 4, 6, 5} is extracted and the 1-chain σ12 = [5 1] + [1 2] + [2 3] + [3 4] + [4 6] + [6 5] is constructed. Note that rank( [∂2 |σ12 ] ) = 4 while rank(∂2 ) = 3 allowing us to conclude that the two paths are not homotopic. VIII. C ONCLUSION In this article, a distributed algorithm for the robust construction of a simplicial representation of a camera network’s coverage, the CN -Complex, is presented. The utility of the representation in fully characterizing the topological structure of the network’s coverage and tracking agents in the network is demonstrated. The construction proceeds by first decomposing the field of views of the camera nodes locally and then determining overlap between pairs by transmitting sparse occlusion events across the network. The strength of the representation is its ease of construction. By creating a representation that is able to accurately capture topological information about the network’s coverage without
ACKNOWLEDGMENT This research work was partially supported by the ARO MURI grant W911NF-06-1-0076, AFOSR grant FA9550-061-0267, DARPA DSO HR0011-07-1-0002 via the projects SToMP, and the National Science Foundation under Grant # 0937060 to the Computing Research Association for the CIFellows Project. R EFERENCES [1] Z. Cheng, D. Devarajan, and R. Radke, “Determining Vision Graphs for Distributed Camera Networks Using Feature Digests,” EURASIP Journal on Applied Signal Processing, vol. 2007(1), 2007. [2] D. Marinakis and G. Dudek, “Topology Inference for a Vision-Based Sensor Network,” in Canadian Conf. on Computer and Robot Vision, 2005. [3] D. Marinakis, P. Giguere, and G. Dudek, “Learning Network Topology from Simple Sensor Data,” in Canadian Conf. on Artificial Intelligence, 2007. [4] V. de Silva and R. Ghrist, “Coordinate-Free Coverage in Sensor Networks with Controlled Boundaries via Homology,” The Intl. Journal of Robotics Research, vol. 25, pp. 1205 – 1221, 2006. [5] A. Muhammad and A. Jadbabaie, “Decentralized Computation of Homology Groups in Networks by Gossip,” in American Control Conf., 2007. [6] E. Lobaton, A. Parvez, and S. Sastry, “Algebraic Approach to Recovering Topological Information in Distributed Camera Networks,” in Intl. Conf. on Information Processing in Sensor Networks, 2009. [7] E. Lobaton, R. Vasudevan, S. Sastry, and R. Bajcsy, “Robust Construction of the Camera Network Complex for Topology Recovery,” in ACM/IEEE Intl. Conference on Distributed Smart Cameras, 2009. [8] D. Makris, T. Ellis, and J. Black, “Bridging the Gaps Between Cameras,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. [9] A. van den Hengel, A. Dick, and R. Hill, “Activity Topology Estimation for Large Networks of Cameras,” in IEEE Intl. Conf. on Video and Signal Based Surveillance, 2006. [10] H. Detmold, A. van den Hengel, A. Dick, A. Cichowski, R. Hill, E. Kocadag, Y. Yarom, K. Falkner, and D. Munro, “Estimating Camera Overlap in Large and Growing Networks,” in ACM/IEEE Intl. Conf. on Distributed Smart Cameras (ICDSC 2008), 2008. [11] R. Hill, A. van den Hengel, A. Dick, A. Cichowski, and H. Detmold, “Empirical Evaluation of the Exclusion Approach to Estimating Camera Overlap,” in ACM/IEEE Intl. Conf. on Distributed Smart Cameras, 2008. [12] C. Stauffer and K. Tieu, “Automated Multi-Camera Planar Tracking Correspondence Modeling,” in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 2003. [13] L. L. Presti and M. L. Cascia, “Real-Time Estimation of Geometrical Transformation Between Views in Distributed Smart-Cameras Systems,” in ACM/IEEE Intl. Conf. on Distributed Smart Cameras, 2008. [14] M. Meingast, M. Kushwaha, S. Oh, X. Koutsoukos, A. Ledeczi, and S. Sastry, “Fusion-Based Localization for a Heterogeneous Camera Network,” in ACM/IEEE Intl. Conf. on Distributed Smart Cameras, 2008. [15] A. Rahimi, B. Dunagan, and T. Darrell, “Simultaneous Calibration and Tracking with a Network of Non-Overlapping Sensors,” in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 1, 2004, pp. I–187–I–194. [16] S. Funiak, C. Guestrin, M. Paskin, and R. Sukthankar, “Distributed Localization of Networked Cameras,” in Intl. Conf. on Information Processing in Sensor Networks, 2006.
IEEE TRANSACTIONS ON IMAGE PROCESSING
[17] A. Hatcher, Algebraic Topology. Cambridge University Press, 2002. [18] J. Munkres, Topology, 2nd ed. Prentice Hall, 2000. [19] T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational Homology. Springer, 2003. [20] R. Bott and L. Tu, Differential Forms in Algebraic Topology. Springer, 1995. [21] B. Jackson, R. Bodor, and N. Papanikolopoulos, “Learning Static Occlusions from Interactions with Moving Figures,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, 2004. [22] G. Carlsson, A. Zomorodian, A. Collins, and L. Guibas, “Persistence Barcodes for Shapes,” in Eurographics/ACM SIGGRAPH symposium on Geometry processing. ACM New York, NY, USA, 2004, pp. 124–135. [23] A. Zomorodian and G. Carlsson, “Computing Persistent Homology,” Discrete and Computational Geometry, vol. 33, no. 2, pp. 249–274, 2005. [24] P. Chen et al., “CITRIC: A Low-Bandwidth Wireless Camera Network Platform,” in ACM/IEEE Intl. Conf. on Distributed Smart Cameras, 2008. [25] S. Oh and S. Sastry, “Tracking on a Graph,” in International Symposium on Information Processing in Sensor Networks, 2005, pp. 195–202.
Edgar Lobaton (Member, IEEE) received the B.S. degrees in Mathematics and Electrical Engineering from Seattle University in 2004, and the Ph.D. degree in Electrical Engineering and Computer Sciences from the University of California, Berkeley, in 2009. He is currently a Post-Doctoral Researcher at the department of computer science at the University of North Carolina at Chapel Hill. He was previously engaged in research at Alcatel-Lucent Bell Labs in 2005 and 2009. His research interests include sensor networks, computer vision, tele-immersion, and motion planning. He is the recipient of the 2009 Computer Innovation Fellows post-doctoral fellowship award, the 2004 Bell Labs Graduate Research Fellowship, and the 2003 Barry M. Goldwater Scholarship.
Ramanarayan Vasudevan (Student Member, IEEE) received a B.S. degree in Electrical Engineering and Computer Sciences and an Honors Degree in Physics from the University of California, Berkeley, in 2006, and a M.S. degree in Electrical Engineering from the University of California. Berkeley, in 2009. His research interests include sensor networks, computer vision, hybrid systems, and optimal control. He is the recipient of the 2002 Regent and Chancellor’s Scholarship.
14
Ruzena Bajcsy (Fellow, IEEE) received her Master’s and Ph. D. degrees in electrical engineering from Slovak Technical University in 1957 and 1967, respectively, and a Ph. D. in computer science from Stanford University in 1972. She is a Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley, and Director Emeritus of the Center for Information Technology Research in the Interest of Science (CITRIS). Prior to joining Berkeley, Dr. Bajcsy headed the Computer and Information Science and Engineering Directorate at the National Science Foundation. As a former faculty member of the University of Pennsylvania, she also served as the Director of the University’s General Robotics and Active Sensory Perception Laboratory, which she founded in 1978, and chaired the Computer and Information Science department from 1985 to 1990. Dr. Bajcsy is a member of the National Academy of Engineering and the National Academy of Science Institute of Medicine as well as a Fellow of the Association for Computing Machinery (ACM), the Institute of Electronic and Electrical Engineers, and the American Association for Artificial Intelligence. In 2001, she received the ACM/Association for the Advancement of Artificial Intelligence Allen Newell Award, and was named as one of the 50 most important women in science in the November 2002 issue of Discover Magazine. In 2008 she was the recipient Benjamin Franklin Medal for Computer and Cognitive Sciences. In 2010 she was the recipient of the IEEE Robotics and Automation Pioneer award.
S. Shankar Sastry (Fellow, IEEE) received his B.Tech. from the Indian Institute of Technology, Bombay, 1977, a M.S. in EECS, M.A. in Mathematics and Ph.D. in EECS from UC Berkeley, 1979, 1980, and 1981 respectively. S. Shankar Sastry is currently dean of the College of Engineering. He was formerly the Director of CITRIS (Center for Information Technology Research in the Interest of Society) and the Banatao Institute at CITRIS Berkeley. He served as chair of the EECS department from January, 2001 through June 2004. In 2000, he served as Director of the Information Technology Office at DARPA. From 19961999, he was the Director of the Electronics Research Laboratory at Berkeley, an organized research unit on the Berkeley campus conducting research in computer sciences and all aspects of electrical engineering. He is the NEC Distinguished Professor of Electrical Engineering and Computer Sciences and holds faculty appointments in the Departments of Bioengineering, EECS and Mechanical Engineering. Prior to joining the EECS faculty in 1983 he was a professor at MIT.