object tracking via a collaborative camera network a dissertation

0 downloads 0 Views 958KB Size Report
adequate in scope and quality as a dissertation for the degree of Doctor of. Philosophy. ... In addition to tracking, we also consider two relevant topics, namely camera node se- ... I am grateful to many people that made this work possible. First ...
OBJECT TRACKING VIA A COLLABORATIVE CAMERA NETWORK

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Ali Ozer Ercan June 2007

c Copyright by Ali Ozer Ercan 2007

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Abbas El Gamal) Principal Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Leonidas J. Guibas)

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Jack Wenstrand)

Approved for the University Committee on Graduate Studies.

iii

Abstract There is a growing need to develop low cost wireless networks of cameras with automated detection capabilities. The main challenge in building such networks is the high data rate of video cameras. On the one hand sending all the data, even after performing standard compression, is very costly in transmission energy, and on the other, performing sophisticated vision processing at each node to substantially reduce transmission rate requires high processing energy. To address these challenges, a task-driven approach, in which simple local processing is performed at each node to extract the essential information needed for the network to collaboratively perform the task, has been proposed. This dissertation presents such a task-driven approach for tracking a single object (e.g., a suspect) in a structured environment (e.g., an airport or a mall) in the presence of static and moving occluders using a wireless camera network. To conserve communication bandwidth and energy, each camera first performs simple local processing to reduce each frame to a scan line. This information is then sent to a cluster head to track a point object. We assume the locations of the static occluders to be known, but only prior statistics on the positions of the moving occluders are available. A noisy perspective camera measurement model is presented, where occlusions are captured through an occlusion indicator function. An auxiliary particle filter that incorporates the occluder information is used to track the object. Using simulations, we investigate (i) the dependency of the tracker performance on the accuracy of the moving occluder priors, (ii) the tradeoff between the number of cameras and the occluder prior accuracy required to achieve a prescribed tracker performance, and (iii) the importance of having occluder priors to the tracker performance as the number of occluders increases. We generally find that computing moving occluder priors may not be worthwhile, unless it can be obtained cheaply and to a reasonable accuracy. iv

In addition to tracking, we also consider two relevant topics, namely camera node selection and placement. Communication and computation cost can be further reduced by dynamically selecting the best subset of camera nodes to collaboratively perform the task. Such selection allows for efficient sensing with little performance degradation relative to using all the cameras, and makes it possible to scale to the network to a large numbers of nodes. The minimum mean square error of the best linear estimate of object position based on camera measurements is used as a metric for selection. A greedy selection heuristic is proposed to optimize this metric with respect to the selected subset and it is shown that this heuristic performs close to optimal. The same metric that is used for selection is also used for optimal placement in a simple setting. An analytical formula for the metric is presented and optimized for best camera placement, which is followed by a discussion on how to avoid static occluders.

v

Acknowledgments I am grateful to many people that made this work possible. First, I would like to thank my adviser, Professor Abbas El Gamal. He always enlightened my path with his patient guidance, encouragement and careful assessment. His broad knowledge in different areas inspired me and played an influential role in my research. I also had the privilege of taking his classes, which played an important role in my dissertation work. I would also like to thank the other members of my reading committee. I am grateful to Professor Leonidas Guibas for helping me look at my research problems from different points of view and always making insightful suggestions at our discussions. I want to thank Professor Jack Wenstrand, for always finding time for our meetings despite his busy schedule, and for emphasizing the practical aspects of my work. I find myself very lucky to have access to these approaches and ideas from different worlds. Without them, this multidisciplinary work would not be possible. I also would like to thank Professor Olav Solgaard for chairing my oral defense committee. I am thankful to Professor John Gill for his unparalleled help in setting up the experimental lab. He was always there and very patient with my questions and problems. I want to thank Professor Brian Wandell for fruitful discussions and help in the high speed camera work during my early years at Stanford. I also would like to thank Professor Balaji Prabhakar, Professor Hamid Aghajan, Professor Persi Diaconis and Professor Michael Godfrey for technical discussions. I would like to thank my other collaborators, Dr. Danny Yang, Dr. Jaewon Shin, Dr. Feng Xiao, Dr. Xinqiao Liu and Dr. Sukhwan Lim. It was a privilege working with them and I also learned a lot from them. I also want to thank my other past and present group members, Dr. Ting Chen, Dr. Helmy Eltoukhy, Keith Fife, Kunal Ghosh, Hossein vi

Kakavand, Dr. Sam Kavusi, Professor Stuart Kleinfelder, Dr. Olivier L´evˆeque, Mingjie Lin, Dr. James Mammen, Professor Khaled Salama and Dr. Sina Zahedi for their company and helpful discussions. It was a pleasure to share the same office with them. I would also like to thank Kyle Heath for his help in the experimental lab and Kelly Yilmaz and Denise Murphy for their support and patience with me. I had many great friends during my graduate studies, and I am thankful to all of them. Without their friendship, help and support, this work would be impossible. There are too ¨ ur Arslan, Mehpare many names to list here, but some of them include T¨urev Acar, Ozg¨ As¸kın, Dr. Ulrich Barnh¨ofer, Professor Barıs¸ Bayram, Tolga C¸ukur, Professor Aykutlu Dˆana, Ays¸eg¨ul Das¸tan, Professor Utkan Demirci, Onur Fidaner, Professor Cengiz G¨ulek, ¨ Pınar Hacık¨oyl¨u, Bas¸ak Kalkancı, Ozgen Karaer, Professor Murat Kaya, Onur Kılıc¸, Aykut

¨ ¨ Koc¸, Dr. Ali Kemal Okyay, Dr. Omer Oralkan, Emre Oto, Professor Nesrin Ozalp, Pro¨ ¨ ¨ uven, Fatih Sarıo¯glu, fessor Aydo¯gan Ozcan, Dr. Veysi Erkcan Ozcan, Dr. Nevran Ozg¨ ¨ ur S¸ahin, Dr. C¸a¯gan ¨ u Sarıtas¸, Uygar S¨umb¨ul, Dr. Ozg¨ Professor Afs¸in Sarıtas¸, Emine Ulk¨ S¸ekercio¯glu, Tu¯gsan Topc¸uo¯glu, Mehmet Burak Tuncer, Demet Ulus¸en and Dr. Erhan Yenilmez. I wish to thank the sponsors of the Programmable Digital Camera (PDC) Project, Stanford Networking Research Center (SNRC), Media-X Consortium and Defense Advanced Research Projects Agency (DARPA) for funding my graduate studies. Last but not least, I would like to express my greatest gratitude to my parents and my sister. I would not be able to achieve anything without their never-ending love and support. I feel very lucky and privileged to have them. I would also like to thank my brother-in¨ law, Professor Hitay Ozbay, for his continuous support and guidance. This dissertation is dedicated to my parents, sister and brother-in-law.

vii

Contents Abstract

iv

Acknowledgments

vi

1 Introduction

1

1.1

Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Camera Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.4

Camera Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Object Tracking

10

2.1

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2

Setup, Models, and Assumptions . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1

2.3

Camera Measurement Model . . . . . . . . . . . . . . . . . . . . . 13

Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1

Importance Density Function . . . . . . . . . . . . . . . . . . . . . 18

2.3.2

Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3

Adding Static Occluders and Limited Field of View . . . . . . . . . 23

2.3.4

Obtaining Occluder Priors . . . . . . . . . . . . . . . . . . . . . . 25

2.4

Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

viii

3 Camera Node Selection

36

3.1

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2

Setup, Models and Assumptions . . . . . . . . . . . . . . . . . . . . . . . 38

3.3

3.4

3.2.1

Camera Measurement Model . . . . . . . . . . . . . . . . . . . . . 38

3.2.2

Computing MSE(S) . . . . . . . . . . . . . . . . . . . . . . . . . 41

Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1

Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.2

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Camera Placement

49

4.1

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2

Setup, Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3

Optimal Camera Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1

Symmetric Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.2

General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4

Adding Static Occluders . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Conclusion

59

5.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2

Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.1

Theoretical Framework for the Analysis of the Tradeoffs . . . . . . 60

5.2.2

Use of Visual Hull for Occluder Priors . . . . . . . . . . . . . . . . 61

5.2.3

Effect of Camera Resolution . . . . . . . . . . . . . . . . . . . . . 63

5.2.4

Combining Tracking and Selection . . . . . . . . . . . . . . . . . . 63

A List of Selected Symbols

65

B Derivation of Camera Noise Variance

68

B.1 Perspective Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . 69 B.2 Weak Perspective Camera Model . . . . . . . . . . . . . . . . . . . . . . . 70 ix

C Derivation of Equation 2.5

72

D Derivation of the Localization MSE

77

D.1 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Bibliography

82

x

List of Figures 1.1

A central monitoring station at a casino. . . . . . . . . . . . . . . . . . . .

2

1.2

Cyclops platform attached to a Mica2 mote. . . . . . . . . . . . . . . . . .

3

1.3

Collaborative task-driven approach. . . . . . . . . . . . . . . . . . . . . .

5

1.4

Tracking a suspect in a structured environment in the presence of occluders using a camera network. . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1

Illustration of the setup used for object tracking. . . . . . . . . . . . . . . . 13

2.2

Local processing at each camera node. . . . . . . . . . . . . . . . . . . . . 15

2.3

The camera measurement model. . . . . . . . . . . . . . . . . . . . . . . . 15

2.4

The auxiliary sampling importance resampling (ASIR) algorithm. . . . . . 17

2.5

mv Computing qi,j (x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6

Geometric partitioning to add static occluders and limited FOV. . . . . . . . 24

2.7

Illustration of the visual hull. . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8

The setup used in simulations. . . . . . . . . . . . . . . . . . . . . . . . . 28

2.9

Average tracker RMSE versus the number of cameras for M = 40, and 1 static occluder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.10 Dependency of the tracker average RMSE on the accuracy of the occluder prior for N = 4, M = 40 and no static occluders. . . . . . . . . . . . . . . 30 2.11 Tradeoff between the number of cameras and moving occluder prior accuracy for target tracker average RMSE = 3 units for M = 40 and no static occluders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.12 Tracker average RMSE versus the number of moving occluders for the two extreme cases RMSEocc = 0 and RMSEocc = RMSEmax . . . . . . . . . . . . 32

xi

2.13 Average CPU time for computing the likelihoods relative to that for the case of 2 cameras and no occluder prior, i.e., RMSEocc = RMSEmax . . . . . 32 2.14 Experimental setup. (a) View of lab (cameras are circled). (b) Relative locations of cameras and virtual static occluder. . . . . . . . . . . . . . . . 33 2.15 Experimental results. Average tracker RMSE versus the number of cameras for M = 20, and 1 static occluder. . . . . . . . . . . . . . . . . . . . . . . 34 3.1

Illustration of the indicator function of visible points to camera i. . . . . . . 42

3.2

The greedy camera node selection algorithm. . . . . . . . . . . . . . . . . 44

3.3

Simulation results: localization performance for different selection heuristics. 45

3.4

An example camera selection for k = 3. . . . . . . . . . . . . . . . . . . . 45

3.5

Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6

Experimental results for different camera selection algorithms. . . . . . . . 47

4.1

The setup used for placement problem. . . . . . . . . . . . . . . . . . . . . 51

4.2

Example optimal camera placements for the symmetric case. . . . . . . . . 53

4.3

Distributed placement using clusters. . . . . . . . . . . . . . . . . . . . . . 54

4.4

An optimal solution to the general placement problem. . . . . . . . . . . . 54

4.5

Demonstration of solving the camera placement problem by inverse kinematics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6

Two optimal placements for a given object prior. . . . . . . . . . . . . . . . 56

4.7

Dealing with static occluders. . . . . . . . . . . . . . . . . . . . . . . . . . 57

B.1 Illustration of sources of camera measurement noise. . . . . . . . . . . . . 69 C.1 Illustration of coordinate systems and variables used in the derivation of Eq. 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 C.2 Monte-Carlo simulations to test the accuracy of Eq. 2.5. . . . . . . . . . . . 76

xii

Chapter 1 Introduction 1.1 Camera Networks The most prevalent applications for today’s multi-camera installations are security and surveillance. These systems consist of expensive, analog, hard-wired cameras and are widely deployed in airports, banks, casinos, supermarkets, etc. The video streams captured by the cameras are relayed to a central monitoring station where they are observed by human operators (see Fig. 1.1). It is not hard to see that these systems cannot scale to many camera nodes, because it is difficult and expensive to set up the network due to the high cost of the camera nodes, wiring and installation. More importantly, even when such networks have been realized [1], it is extremely hard to interpret and search the data that is collected, which makes the chance of catching any security breach very slim. According to quotes from [2, 3], after 12 minutes of continuous viewing of 2 or more sequencing monitors, an operator will miss up to 45% of all scene activity. After approximately 22 minutes, an operator will miss up to 95% of scene activity. This renders the current surveillance systems only useful for collecting evidence after a security breach has occurred, rather than actively preventing such breaches. These considerations make clear that there is a growing need to develop low cost, collaborative networks of cameras with automated detection capabilities. Some recent projects focus on such networks. Most famous is the VSAM project by Carnegie Mellon University and Sarnoff Corporation [4, 5]. The ADVISOR project sponsored by the EU’s fifth framework program is another such example [6]. 1

CHAPTER 1. INTRODUCTION

2

Figure 1.1: A central monitoring station at a casino.

Surveillance on the other hand is not the only application for future camera networks. Some other possible applications include, but are not limited to, the following: • Traffic Monitoring: Traditionally, magnetic loop detectors have been used to monitor traffic flow [7]. Recently, cameras have been used for this application [8]. Cameras

can yield additional information over the traditional loop detectors, such as vehicle counts and classification. Additionally, cameras are easier to install and cheaper than loop detectors [9]. These advantages make camera networks a promising alternative for traffic monitoring applications. • Environmental Monitoring: Holman et al. [10] apply networked sensors for monitoring environmentally sensitive beaches and nearshore coastal oceans. Several compa-

nies (e.g. [11–13]) sell network camera modules for environmental monitoring. • Smart Homes: Gu et al. [14] describe a context-aware infrastructure for applications that adapt to the dynamic environments of a smart-home. The Easy Living Project

from Microsoft [15] also presents a similar effort.

CHAPTER 1. INTRODUCTION

PSfrag replacements

3

32 mm

Figure 1.2: Cyclops platform attached to a Mica2 mote [22].

• Sports Broadcasts: During the broadcast of Super Bowl 2001, CBS Television Company used more than 30 pan-tilt-zoom (PTZ) cameras mounted at an elevation about 80 feet, to achieve a unique view of the action during the playbacks. The cameras were controlled and the video feed was computed in an automated fashion. The resulting images made the viewers feel as if they were flying through the scenes. CBS calls this experience “Eye Vision”. The technology was developed by Takeo Kanade and his team at Carnegie Mellon University [16]. The camera networks we envision and consider in this dissertation follow the sensor network paradigm [17–21]. These systems consist of many low cost nodes combining sensing, processing and communication. The nodes are wireless and easy to deploy. The system is scalable to many nodes and robust to failures. Several researchers started building nodes for camera networks. An example is the Cyclops platform [22]. It consists of an imager (Agilent ADCM-1700), a micro-controller, a complex programmable logic device, an external SRAM and flash memory. This platform can be attached to a Mica2 mote [23], which provides the processing and communication capabilities (See Fig 1.2). The overall system consumes low power and is suitable for the envisioned camera network implementation.

CHAPTER 1. INTRODUCTION

4

Although the required hardware is readily available, many challenges still remain unsolved. The main challenge in building such networks is the high data rate of video cameras. As usually the nodes have limited computation and communication capabilities, sending all the data, even after performing standard compression, is very costly in transmission energy. On the other hand, performing sophisticated vision processing at each node to substantially reduce transmission rate requires high processing energy, which is also problematic. Another problem that is unique to camera sensors is occlusion. Cameras can sense over long distances, which is an advantage over other sensing modalities. However, occlusions make the visibility discontinuous. A point which is visible may be farther away from the camera compared to a non-visible point due to an obstacle occluding the view. This phenomenon poses severe difficulties in video processing, computer vision and computer graphics [24–26]. To overcome these challenges, we adopt a task-driven approach. In this approach, the sensors are grouped into clusters, each of which may have a cluster head, which is basically a more powerful central processor compared to the camera nodes. First, the nodes independently process their sensed data at the local processing, which is tailored toward the needs of the task at hand. Only the essential information needed by the network to perform the task is extracted by autonomous and cheap operations. Then depending on the application, nodes can talk between each other to collaboratively process the data, or this could be done at the cluster head. Then refined data can be sent over to a higher level processor (or end user) for final processing (or decision making). These ideas are illustrated in Fig. 1.3. Collaborative task-driven approach has been widely adopted by the sensor network community. To name a few examples, Yang et al. [27] use simple cameras that collaborate to count people. The counts are obtained by using the “visual hull”. Maroti et al. [28] use acoustic sensors to triangulate the position of a sniper. The triangulation is performed at the cluster head. Another example is from UCLA, where data from acoustic sensors are used to detect, locate and classify birds from their calls [29]. This dissertation presents such a collaborative task-driven approach to track a single object in the presence of occlusions using a camera network. It is shown that using this

CHAPTER 1. INTRODUCTION

Autonomous, cheap,

PSfrag replacements

task-driven local processing

5

Collaborative

Decision

processing

making

Cluster head

Figure 1.3: Collaborative task-driven approach.

approach, problems such as the requirement of low computation and communication costs and handling occlusions are surmountable. In addition, we consider two other topics related to our framework, namely camera node selection and placement. In the following sections, brief introductions to each of these topics are provided.

1.2 Object Tracking In Chapter 2, we present a task-driven approach for tracking a single object (e.g., a suspect) in a structured environment (e.g., an airport or a mall) in the presence of static and moving occluders using a wireless camera network (see Fig. 1.4). Tracking is a useful in many applications such as robotics, human computer interaction, surveillance, health care (monitoring of patients) and biology (migration patterns of species). Specifically, we focus on tracking on the 2D ground plane, since this is most relevant for many real world applications. Most previous work on tracking with multiple cameras has focused on tracking all the

CHAPTER 1. INTRODUCTION

6

objects and does not deal directly with static occluders, which are often present in structured environments (see the brief survey in Section 2.1). Tracking all the objects clearly provides a solution to our problem, but may be infeasible to implement in a wireless camera network due to its high computational cost. Instead, our approach is to track only the target object while treating all other objects as occluders. We assume complete knowledge of static occluder (e.g., partitions, large pieces of furniture) locations and some prior statistics on the positions of the moving occluders1 (e.g., people) which are updated in time. Simple local processing whereby each image is reduced to a horizontal scan line is performed at each camera node. If the camera sees the object, it provides a measurement of its position in the scan line to the cluster head, otherwise it reports that it cannot see the object. A noisy perspective camera measurement model is presented, where occlusions are captured through an occlusion indicator function. Given the camera measurements and the occluder position priors, an auxiliary particle filter is used at the cluster head to track the object. The occluder information is incorporated into the measurement likelihood, which is used in the weighting of the particles. Even if one wishes to track only one object treating other moving objects as occluders, a certain amount of information about the positions of the occluders may be needed to achieve high tracking accuracy. Since obtaining more accurate occluder priors would require expending more processing and/or communication energy, it is important to understand the tradeoff between the accuracy of the occluder information and that of tracking. Do we need any prior occluder information? If so, how much accuracy is sufficient? This important tradeoff is also investigated in Chapter 2. We develop a measure of the moving occluder prior accuracy and use simulations to explore the dependency of the tracker performance on this measure. We also explore the tradeoff between the number of cameras used, the number of occluders present, and the amount of occluder prior information needed to achieve a prescribed tracker performance. We generally find that: • Obtaining moving occluder prior information may not be worthwhile in practice, unless it can be obtained cheaply and to a reasonable accuracy.

1

Note that the prior statistics or probability distributions of moving occluder positions are referred as ‘moving occluder priors’, or simply ‘priors’ throughout the text.

CHAPTER 1. INTRODUCTION

7

PSfrag replacements

Suspect

Moving occluder

Static occluder

Figure 1.4: Tracking a suspect in a structured environment in the presence of occluders using a camera network.

• There is a tradeoff between the number of cameras used and the amount of occluder

prior information needed. As more cameras are used, the accuracy of the prior information needed decreases. Having more cameras, however, means incurring higher communications and processing cost. So, in the design of a tracking system, one needs to compare the cost of deploying more cameras to that of obtaining more ac-

curate occluder priors. • The amount of prior occluder position information needed depends on the number

of occluders present. When there are very few moving occluders, prior information does not help (because the object is not occluded most of the time). When there is a moderate number of occluders, prior information becomes more useful. However,

when there are too many occluders, prior information becomes less useful (because the object becomes occluded most of the time).

CHAPTER 1. INTRODUCTION

8

1.3 Camera Node Selection In Section 1.1, we mentioned that one of the main challenges in achieving the envisioned camera networks is the cost of processing and communicating the high volume of data collected by camera sensors, in relation to the limited energy budget per node. We adopt a collaborative task-driven approach to overcome this challenge, and in Chapter 2, we assess this approach for the object tracking application. Node selection is another approach that is often adopted together with collaborative processing by the sensor network community, to help alleviate the limited energy problem [19, 30, 31]. In Chapter 3, we describe a node selection algorithm that is suitable for camera networks. Communication and computation cost can be reduced by dynamically selecting the best subset of camera nodes that collaboratively perform the task. Only the selected nodes actively sense, process and send data, while the rest are in sleep mode. This duty cycling can reduce the average power consumption per node substantially. Moreover, measurements from different nodes might be highly correlated. This is probable for cameras, because multiple cameras can simultaneously observe the same scene. Therefore, a clever selection of a subset of nodes will result in little performance degradation relative to using all the cameras. If there are enough cameras, dynamical selection of the cameras can also help avoid the occlusions in the scene. Another benefit of selection is it makes scaling of the network to a large number of nodes possible, as only a fraction of the cameras is used at a given instant in time. In Chapter 3, we investigate the problem of camera node selection in order to minimize the localization error for a single object. Similar to the tracking, we focus on 2-D object localization, i.e., location on the ground plane. The setup used for selection is very similar to the one in tracking. We use a weak perspective camera model instead of a full perspective camera model. The minimum mean square error (MSE) of the best linear estimator of the object location is used as the selection utility metric. The selection problem then involves minimizing the MSE over subsets S, subject to |S| = k, where k is given. This

optimization is in general combinatorial and the complexity of brute-force search grows exponentially with k. This can be too costly in a wireless camera network setting. Instead, we use a greedy selection algorithm. We show that this simple heuristic performs close to

CHAPTER 1. INTRODUCTION

9

optimal and outperforms naive heuristics such as picking the closest subset of cameras or a uniformly spaced subset.

1.4 Camera Placement Finally in Chapter 4, we turn to the camera placement problem in a simple setting. We show that the same metric that we use for selection (object localization MSE) can be written in an analytical form. We optimize this metric with respect to the camera positions. For a circularly symmetric object prior distribution and sensors with equal noise, we show that a uniform sensor arrangement is optimal. Somewhat surprisingly, we establish that the general problem is equivalent to solving the inverse kinematics of a planar robotic arm which can be solved efficiently using gradient descent techniques. At this point, we would also like to draw the reader’s attention to Appendix A. It contains a list of selected symbols that are used throughout the dissertation.

Chapter 2 Object Tracking In this chapter, a sensor network approach to tracking a single object in the presence of static and moving occluders using a network of cameras is described 1 . To conserve communication bandwidth and energy, each camera first performs simple local processing to reduce each frame to a scan line. This information is then sent to a cluster head to track a point object. The locations of the static occluders are assumed to be known, but only prior statistics on the positions of the moving occluders are available. A noisy perspective camera measurement model is presented, where occlusions are captured through an occlusion indicator function. An auxiliary particle filter that incorporates the occluder information is used to track the object. Using simulations, several tradeoffs involving the tracker performance, accuracy of the occluder prior statistics, number of cameras used and number of moving occluders are investigated. Experimental results are provided. The rest of this chapter is organized as follows. A brief survey of previous work on tracking using multiple cameras is presented in the next section. In Section 2.2, we describe the setup of our tracking problem and introduce the camera measurement model used. The tracker is described in Section 2.3. Simulation and experimental results are presented in Sections 2.4 and 2.5, respectively. 1

The work in this chapter was first published in [32].

10

CHAPTER 2. OBJECT TRACKING

11

2.1 Previous Work Tracking has been a popular topic in sensor network research (e.g., [33–42]). Most of this work assumes low data rate range sensors. By comparison, our work assumes cameras, which are bearing sensors and have high data rate. The most related work to ours is [41] and [42]. Pahawalatta et al. [41] use a camera network to track and classify multiple objects on the ground plane. This is done by detecting feature points on the objects and using a Kalman Filter (KF) for tracking. By comparison, we use a particle filter (PF), which is more suitable for non-linear camera measurements and track only a single object treating others as occluders. Funiak et al. [42] use a Gaussian model obtained by reparametrizing the camera coordinates together with a KF. This method is fully distributed and requires less computational power than a PF. However, because the main goal of the system is camera calibration and not tracking, occlusions are not considered. Also, this work requires minimal overlap of the camera field of views, which is not a requirement for our work. Tracking has also been a very popular topic in computer vision (e.g., [43–48]). Most of the work, however, has focused on tracking objects in a single camera video sequence [43, 44]. Tracking using multiple camera video streams has also been considered [45, 46, 48]. Individual tracking is performed for each video stream and the objects appearing in the different streams are associated. More recently, there has been work on tracking multiple objects in world coordinates using multiple cameras [49–51]. Utsumi et al. [49] extract feature points on the objects and use a KF to track the objects. They perform camera selection to avoid occlusions. By comparison, in our work occlusions are treated as part of the tracker. Otsuka et al. [50] describe a double loop filter to track multiple objects, where objects can occlude each other. One of the loops is a PF that updates the states of the objects in time using the object dynamics, the likelihood of the measurements, and the occlusion hypotheses. The other loop is responsible for generating these hypotheses and testing them using the object states generated by the first loop, the measurements, and a number of geometric constraints. Although this method also performs a single object tracking in the presence of moving occluders, the hypothesis generation and testing is computationally prohibitive for a sensor network implementation. The work also does not consider static occlusions that could be present in structured environments. Dockstader et al. [51] describe

CHAPTER 2. OBJECT TRACKING

12

a method for tracking multiple people using multiple cameras. Feature points are extracted from images locally and corrected using the 3-D estimates of the feature point positions that are fed back from the central processor to the local processor. These corrected features are sent to the central processor where a Bayesian network (BN) is employed to deduce a first estimate of the 3-D positions of these features. A KF follows the BN to maintain temporal continuity. This approach requires that each object is seen by some cameras at all times. This is not required in our approach. Also, performing motion vector computation at each node is computationally costly in a wireless sensor network. We would like to emphasize that our work is focused on tracking a single object in the presence of static and moving occluders in a wireless sensor network setting. When there are no occluders, one could adopt a less computationally intensive approach similar to [42]. When all the objects need to be tracked simultaneously, the above mentioned methods ( [50, 51]) or a filter with joint-state for all the objects [52] can be used.

2.2 Setup, Models, and Assumptions We consider the setup illustrated in Fig. 2.1 in which N cameras are aimed roughly horizontally around a room. Although an overhead camera would have a less occluded view than a horizontally placed one, it generally has a more limited view of the scene and may be impractical to deploy. Additionally, targets may be easier to identify in a horizontal view. The cameras are assumed to be fixed and their locations and orientations are known to some accuracy to the cluster head. The camera network’s task is to track an object in the presence of static occluders and other moving objects. We assume that the object to track to be a point object. This is reasonable because the object may be distinguished from occluders by some specific point feature. We assume there are M other moving objects, each modeled as a cylinder of diameter D. The position of each object is assumed to be the center of its cylinder. From now on, we shall refer to the object to track as the “object” and the other moving objects as “moving occluders.” We assume the positions and the shapes of the static occluders in the room to be completely known in advance. This is not unreasonable since this information can be easily

PSfrag CHAPTER replacements 2. OBJECT TRACKING

13

Cam 1 θN

θ2 Cam 2

θ1

Moving occluder priors Moving occluder

D

Cam N

θi

Particles

Static occluder

x Cam i

Figure 2.1: Illustration of the setup used for object tracking.

provided to the network. On the other hand, only some prior statistics of the moving occluder positions are known at each time step. In Section 2.3.4, we discuss how these priors may be obtained. We assume that simple background subtraction is performed locally at each camera node. We assume that the camera nodes can distinguish between the object and the occluders. This can be done, for example, through feature detection, e.g., [53]. Since the horizontal position of the object in each camera’s image plane is the most relevant information to 2-D tracking, the background subtracted images are vertically summed and thresholded to obtain a “scan line” (see Fig. 2.2). Only the center of the object in the scan line is sent to the cluster head, which is only a few bytes.

2.2.1 Camera Measurement Model If a camera “sees” the object, its measurement is described by a noisy perspective camera model. If the camera cannot see the object because of occlusions or limited field of view (FOV), it reports a “not-a-number” (NaN, using MATLAB syntax) to the cluster head.

CHAPTER 2. OBJECT TRACKING

14

Mathematically, for camera i = 1, . . . , N , we define the occlusion indicator function 4

ηi =

(

1, if camera i sees the object 0, otherwise.

(2.1)

Note that the ηi random variables are not in general independent from each other. The camera measurement model including occlusions is then defined as zi =

(

(x) fi hdii(x) + vi , if ηi = 1

NaN,

otherwise,

(2.2)

where x is the position of the object, fi is the focal length of camera i, and di (x) and hi (x) are defined through Figure 2.3. The random variable vi is the additive noise to the measurements and its variance is given by σv2i

=

fi2



h2 (x) 1 + 2i di (x)

2

σθ2 + fi2

h2i (x) + d2i (x) 2 2 σpos + σread , d4i (x)

(2.3)

where it is assumed that the camera position is known to the tracker to an accuracy of σ pos , the camera orientation θi is known to an accuracy of σθ and readout is assumed to have an accuracy of σread , and these error sources are mutually independent. See Appendix B for derivation of this formula. We further assume that given x, the noise from the different cameras v1 , v2 , . . . , vN are independent, identically distributed Gaussian random variables. Note that the camera nodes report only the observations {zi } to the cluster head, and the

cluster head derives the values of the ηi s from the zi s.

2.3 Tracker As the measurement model in Eq. 2.2 is nonlinear in the object position, using a linear filter, e.g., Kalman Filter (KF), for tracking would yield poor results. As discussed in [54], using an Extended Kalman Filter (EKF) with measurements from bearing sensors, which are similar to cameras with the aforementioned local processing, is not very successful. Although the use of an Unscented Kalman Filter (UKF) is more promising, its performance

CHAPTER 2. OBJECT TRACKING

15

Scan line

PSfrag replacements

Vertical summation & thresholding



Scene

Camera

Background subtraction

Figure 2.2: Local processing at each camera node.

Focal plane

x

PSfrag replacements hi (x)

fi di (x)

zi

Figure 2.3: The camera measurement model.

CHAPTER 2. OBJECT TRACKING

16

degrades quickly when the static occluders and limited FOV constraints are considered. Because of the discreteness of the occlusions and FOV and the fact that UKF uses only a few points from the prior of the object state, most of these points may get discarded. We also experimented with a Maximum A-Posteriori (MAP) estimator combined with a KF, which is similar to the approach in [39]. This approach, however, failed at the optimization stage of the MAP estimator, as the feasible set is highly disconnected due to the static occluders and limited camera FOV. Given these considerations, we decided to use a particle filter (PF) tracker [55]. We denote by u(t) the state of the object at time t, which includes its position x(t) and other relevant information. The positions of the moving occluders j ∈ {1, . . . M }, x j (t) are assumed to be Gaussian with mean µj (t) and covariance matrix Σj (t). These priors

are available to the tracker. The state of the object and positions of moving occluders are assumed to be mutually independent. Note that if the objects move in groups, one can still apply the following tracker formulation by defining a “super-object” for each group and assuming that the super-objects move independently. The tracker maintains the probability density function (pdf) of the object state u(t), and updates it at each time step using the new measurements. Given the measurements up to time t − 1, {Y (t0 )}tt−1 0 =1 , the particle filter approximates the pdf of u(t − 1) by a set of L weighted particles (e.g., see Figure 2.1) as f (u(t − 1)|{Y (t

0

)}tt−1 0 =1 )



L X `=1

w` (t − 1)δ (u(t − 1) − u` (t − 1)) ,

where δ(·) is the Dirac delta function, u` (t) is the state of particle ` at time t (i.e., a sample of u(t)). At each time step, given these L weighted particles, the camera measurements Z(t) = {z1 (t), . . . , zN (t)} and η(t) = {η1 (t), . . . , ηN (t)}, the moving occluder priors

{µj (t), Σj (t)}, j ∈ {1, . . . , M }, information about the static occluder positions and the

camera positions, orientations and FOVs, the tracker incorporates the new information obtained from the measurements at time t to update the particles (and their associated weights). We use the auxiliary sampling importance resampling (ASIR) filter described in [55, 56]. The outline of one step of our implementation of this filter is given in Fig. 2.4. In

CHAPTER 2. OBJECT TRACKING

17

Algorithm: ASIR Inputs: Particle - weight tuples: {u` (t − 1), w` (t − 1)}L`=1 ; Moving occluder priors: {µj (t), Σj (t)}M j=1 ; Measurements: Z(t) = {z1 (t), . . . , zN (t)} and η(t) = {η1 (t), . . . , ηN (t)}; Shapes and positions of static occluders; Camera orientations ({θi }N i=1 ) and positions; FOVs of the cameras. Output: Particle - weight tuples: {u` (t), w` (t)}L`=1 . 01. for ` = 1, . . . , L 02. κ` := E(u(t)|u` (t − 1)) 03. w ˜` (t) ∝ f (Z(t), η(t)|κ` )w` (t − 1) 04. end for 05. {w` (t)}L`=1 = Normalize ({w ˜` (t)}L`=1 ) 06. {·, ·, p`}L`=1 = Resample ({κ` , w` (t)}L`=1 ) 07. for ` = 1, . . . , L 08. Draw u` (t) ∼ f (u(t)|up` (t − 1)) ` (t)) 09. w ˜` (t) = ff(Z(t),η(t)|u (Z(t),η(t)|κp` ) 10. end for ˜`(t)}L`=1 ) 11. {w` (t)}L`=1 = Normalize({w Figure 2.4: The auxiliary sampling importance resampling (ASIR) algorithm. this figure, E(·) represents the expectation operator, and the procedure “{w ` }L`=1 = Normal-

ize ({w ˜` }L`=1 )” normalizes the weights so that they sum to one. The procedure “{u ` , w` , p` }L`=1 = Resample ({u` , w` }L`=1 )” takes L particle-weight pairs and produces L equally

weighted particles (w` = 1/L), preserving the original distribution. This amounts to particles with small initial weights being killed and the ones with high weights reproducing. The

third output of the procedure (p` ) refers to the index of particle `’s parent. The ASIR algorithm approximates the optimal importance density function f (u(t)|u `(t − 1), Z(t), η(t)),

which is not feasible to compute in general [55].

In the following, we explain the implementation of the importance density function f (u(t)|u`(t − 1)) and the likelihood f (Z(t), η(t)|u` (t)).

CHAPTER 2. OBJECT TRACKING

18

2.3.1 Importance Density Function The particles are advanced in time by drawing a new sample u` (t) from the “importance density function” f (u(t)|u`(t − 1)): u` (t) ∼ f (u(t)|u`(t − 1)), ` ∈ {1, . . . , L}. This is similar to the “time update” step in a KF. After all L new particles are drawn, the distribution of the state is forwarded one time step. Therefore, the dynamics of the system should be reflected as accurately as possible in the importance density function. In KF, a constant velocity assumption with a large variance on the velocity is assumed to account for direction changes. Although assuming that objects move at constant velocity is not a realistic assumption, the linearity constraint of the KF forces this choice. In the PF implementation, we do not have to choose linear dynamics. We use the more realistic “random waypoints model,” where the objects choose a target and try to move toward the target with constant speed plus noise, until they reach the target. When they reach it, they choose a new target. We implemented a modified version of this model in which the state of the particle consists of its current position x` (t), target τ` (t), speed s` (t) and regime r` (t). Note that the time step here is 1 and thus s` represents the distance travelled in a unit time. The model is given by uT` (t) = [xT` (t) τ`T (t) s` (t) r` (t)]. The regime can be one of the following: 1. Move toward target (MTT): A particle in this regime tries to move toward its target with constant speed plus noise: x` (t) = x` (t − 1) + s` (t − 1)

τ` (t − 1) − x` (t − 1) + ν(t), kτ` (t − 1) − x` (t − 1)k

where ν(t) is zero mean Gaussian white noise with Σν = σν2 I, I denotes the identity matrix and σν is assumed to be known. The speed of the particle is also updated

CHAPTER 2. OBJECT TRACKING

19

according to s` (t) = (1 − φ)s` (t − 1) + φkx` (t) − x` (t − 1)k. Updating the speed this way smooths out the variations due to added noise. We chose φ = 0.7 for our implementation. The target is left unchanged. 2. Change Target (CT): A particle in this regime first chooses a new target randomly (uniformly) in the room and performs an MTT step. 3. Wait (W): A particle in this regime does nothing. Drawing a new particle from the importance density function involves the following. First, each particle chooses a regime according to their current position and their target. If a particle reached its target, it chooses the regime according to

r` (t) =

    MTT, w.p. β1 , CT,    W,

w.p. λ1 , w.p. (1 − β1 − λ1 ).

The target is assumed “reached” when the distance to it is less than the particle’s speed. If a particle does not reach its target, the probabilities β1 and λ1 are replaced by β2 and λ2 , respectively. We chose β1 = 0.05, λ1 = 0.9, β2 = 0.9, λ2 = 0.05.

2.3.2 Likelihood Updating the weights in the ASIR algorithm requires the computation of the likelihood of the measurements, f (Z(t), η(t)|u`(t)). For brevity, we shall drop the time index from now on. We can use the chain rule for probabilities to decompose the likelihood and obtain f (Z, η|u`) = p(η|u` )f (Z|η, u`).

(2.4)

CHAPTER 2. OBJECT TRACKING

20

Now, given x` , which is part of u` , and η, z1 , . . . , zN become independent Gaussian random variables and we have   hi (x` ) 2 ,σ , f (Z|η, u`) = N zi ; fi di (x` ) vi i;η =1 Y i

where N {r; ξ, ρ2} denotes a univariate Gaussian function of r with mean ξ and variance ρ2 , σv2i is given in Eq. 2.3 and di (x) and hi (x) are defined in Fig. 2.3.

The first term in Eq. 2.4, however, cannot be expressed as a product, as the occlusions are not independent given u` . This can be explained via the following simple example: Suppose 2 cameras are close to each other. Once we know that one of these cameras cannot see the object, it is more likely that the other one also cannot see it. Hence, the 2 ηs are dependent given u` . Luckily, we can approximate the first term in 2.4 in a computationally feasible manner using recursion. First, we ignore the static occluders and the limited FOV, and only consider the effect of the moving occluders. The effects of static occluders and limited FOV will be added in Section 2.3.3. Define the indicator functions ηi,j for i = 1, . . . , N and j = 1, . . . , M such that ηi,j = 1 if occluder j does not occlude camera i, and 0, otherwise. Thus {ηi = 1} =

M \

{ηi,j = 1}.

j=1

The probability that occluder j occludes camera i given u is thus given by P {ηi,j = 0|u} = (a)

=

4

Z

Z

f (xj |u)P {ηi,j = 0|u, xj } dxj f (xj )P {ηi,j = 0|x, xj } dxj

mv = qi,j (x),

where x is the position part of the state vector u and step (a) uses the facts that x j is independent of u and ηi,j is a deterministic function of x and xj . The superscript “mv” signifies that “only moving occluders are taken into account”. mv To compute qi,j (x), refer to Figure 2.5. Without loss of generality, we assume that

CHAPTER 2. OBJECT TRACKING

21

x

PSfrag replacements θi,j (x) D

Ai (x)

Prior of object j

Camera i mv Figure 2.5: Computing qi,j (x).

camera i is placed at the origin. We assume that the moving occluder diameter D is small compared to the occluder standard deviations. Occluder j occludes point x at camera i if its center is inside the rectangle Ai (x). This means P {ηi,j = 0|x, xj } = 1 if xj ∈ Ai (x) and it is zero everywhere else: mv qi,j (x)

Z

1 1 T −1 p e− 2 (xj −µj ) Σj (xj −µj ) dxj |Σj | Ai (x) 2π  √   √   αj D αj D (b) 1 ≈ erf −ϕ + erf +ϕ 4 kg10 k 2 kg10 k 2 ! !# " µ0j T o1 kxkkg1 k2 − µ0j T o1 + erf , erf kg10 k kg10 k

=

(2.5)

where µ0j is rotated version of µj such that the major axis of occluder j’s prior is horizontal √ (see Fig. C.1), oT1 = [cos(θi,j (x)) αj sin(θi,j (x))], g1T = [cos(θi,j (x)) αj sin(θi,j (x))], √ g10 = 2σj g1 , ϕ = [− sin(θi,j (x)) cos(θi,j (x))]µ0j , and σj2 and σj2 /αj (αj ≥ 1), are the eigenvalues of the covariance matrix Σj of the prior of occluder j. Step (b) follows by the assumption of small moving occluders. See Appendix C for the derivation of this equation. To compute p(η|u), first consider the probability of all ηs of the cameras in subset S,

CHAPTER 2. OBJECT TRACKING

22

given u, to be equal to 1, P

\

i∈S

! {ηi = 1} u = P

! {ηi,j = 1} u i∈S j=1 ! M \ \ =P {ηi,j = 1} u j=1 i∈S ! M \ (c) Y = P {ηi,j = 1} u M \\

j=1

=

M Y j=1

(d)



=

M Y j=1

M Y j=1

4

i∈S

1−P 1−

X

1−

X

= pmv S (x),

i∈S

i∈S

[

i∈S

{ηi,j

!! = 0} u !

P {ηi,j = 0|u} mv qi,j (x)

! (2.6)

where (c) follows by the assumption that the occluder positions are independent, and (d) follows from the assumption of small D and the reasonable assumption that the cameras in S are not too close so that the overlap between Ai (x), i ∈ S, is negligible. Note that

cameras that satisfy this condition can still be close enough, such that their FOVs overlap and ηs are dependent.

Now we can compute pmv (η|u) using Eq. 2.6 and recursion as follows. Let S = {1,

. . . , N } (i.e., the set of all cameras). For any n ∈ S such that ηn = 0, define ηa = {η1 , . . . , ηn−1 , ηn+1 , . . . , ηN } ηb = {η1 , . . . , ηn−1 , 1, ηn+1 , . . . , ηN }. Then, pmv (η|u) = pmv (ηa |u) − pmv (ηb |u).

(2.7)

CHAPTER 2. OBJECT TRACKING

23

Both terms in the right-hand-side of Eq. 2.7 are one step closer to p mv S (u) (with different S), because one less element is zero in both ηa and ηb . This means that any pmv (η|u) can be reduced recursively to terms consisting of pmv S (x), using Eq. 2.7. Let us explain this with the following example. Assume we have N = 3 cameras and η = {1, 0, 0}. Then pmv (η|u) =P ({η1 = 1} ∩ {η2 = 0} ∩ {η3 = 0}|u) =P ({η1 = 1} ∩ {η2 = 0}|u) − P ({η1 = 1} ∩ {η2 = 0} ∩ {η3 = 1}|u) =P ({η1 = 1}|u) − P ({η1 = 1} ∩ {η2 = 1}|u) − P ({η1 = 1} ∩ {η3 = 1}|u) + P ({η1 = 1} ∩ {η2 = 1} ∩ {η3 = 1}|u)

mv mv mv =pmv {1} (x) − p{1,2} (x) − p{1,3} (x) + p{1,2,3} (x),

where we used the above trick 2 times, to obtain 4 terms of the form p mv S (x). The bad news is, the computational load of this recursion is exponential in the number of zeros in η. However, this bottleneck is greatly alleviated by the limited FOV of the cameras as will be explained in the following subsection.

2.3.3 Adding Static Occluders and Limited Field of View Adding the effects of the static occluders and limited camera FOV to the procedure described above involves a geometric partitioning of the particles into bins. Each bin is assigned a set of cameras. After this partitioning, only the ηs of the assigned cameras are considered for the particles in that bin. This is explained using the example in Fig. 2.6. In this example, we have 2 cameras and a single static occluder. As denoted by the dashed line in the figure, we have 2 partitions. Let η1 = 0 and η2 = γ2 ∈ {0, 1}. Let us consider a

particle belonging to the upper partition, namely particle `1 . If the object is at x`1 , the static occluder makes η1 = 0, independent of where the moving occluders are. On the other hand, the static occluder and limited FOV do not occlude the second camera’s view of particle ` 1 . So, only Cam2 is assigned to this partition, and the first term in the likelihood is given by P ({η1 = 0} ∩ {η2 = γ2 }|u`1 ) = pmv (η2 |u`1 ).

CHAPTER 2. OBJECT TRACKING

24

Cam2

u`1 = [x`1 , . . .]

PSfrag replacements u`2 = [x`2 , . . .]

Cam1

Figure 2.6: Geometric partitioning to add static occluders and limited FOV. If η 1 = 1, the object cannot be at x`1 . If η1 = 0, only Cam2 needs to be considered for computing p(η|u`1 ). Both cameras need to be considered for computing p(η|u`2 ). Similarly, P ({η1 = 1} ∩ {η2 = γ2 }|u`1 ) = 0

P ({η1 = γ1 } ∩ {η2 = γ2 }|u`2 ) = pmv (η1 , η2 |u`2 ), where the fist line follows because if the object is at x`1 , η1 = 0, and the second line follows because the static occluder and limited FOV do not occlude particle ` 2 . Note that the number of cameras assigned to a partition is not likely to be large. In a practical setting, the cameras are placed such that the fields of views of the cameras cover the whole room (or monitored area). This means that the number of cameras that see a given point is a fraction of the total number of cameras. This reduces the average complexity considerably. Also, because the camera placements, FOV and static occluder positions are known in advance, the room can be divided into regions beforehand, with each region assigned the cameras that can see it. The number of such regions grows at most quadratically in the number of cameras [57]. During tracking, the particles can be

CHAPTER 2. OBJECT TRACKING

25

easily divided into partitions depending on which pre-computed region each particle is. We mentioned in Section 2.2 that the camera nodes can distinguish between the object and the occluders. This may be unrealistic in some practical settings. To address this problem, one can introduce another random variable that indicates the event of detecting and recognizing the object and include its probability in the likelihood. We have not implemented this modification, however.

2.3.4 Obtaining Occluder Priors Our tracker assumes the availability of priors for the moving occluder positions. In this section we discuss how these priors may be obtained. In Section 2.4, we investigate the tradeoff between the accuracy of such priors and that of tracking. Clearly, one could run a separate PF for each object, and then fit Gaussians to the resulting particle distributions. This requires solving the data association problem, which would require substantial local and centralized processing. Instead of solving the data association problem, trackers that represent the states of all objects in a joint state have been proposed (e.g. [52]). This approach, however, is computationally prohibitive as it requires employing an exponentially increasing number of particles in the size of the state. Another approach to obtaining the priors is to use a hybrid sensor network combining, for example, acoustic sensors in addition to cameras. As these sensors use less energy than cameras, they could be used to generate the priors for the moving occluders. An example of this approach can be found in [58]. Yet another approach to obtaining the occluder priors involves reasoning about occupancy using the “visual hull” (VH) as described in [27] (see Fig. 2.7). To compute the VH, the entire scan lines from the cameras are sent to the cluster head instead of only the centers of the object blobs in the scan lines as discussed in Section 2.2. This only marginally increases the communication cost. The cluster head then computes the VH by back-projecting the blobs in the scan lines to cones in the room. The cones from the multiple cameras are intersected to compute the total VH. Since the resulting polygons are larger than the occupied areas and “phantom” polygons that do not contain any objects may be

CHAPTER 2. OBJECT TRACKING

26

c

b

a a

1D scan line

a b c

a

Figure 2.7: The visual hull is computed by back-projecting the scan lines to the room and intersecting the resulting cones.

present, VH provides an upper bound on occupancy. The computation of the VH is relatively light-weight, and does not require solving the data association problem. The VH can then be used to compute occluder priors by fitting ellipses to the polygons and using them as Gaussian priors. Alternatively, the priors can be assumed to be uniform distributions mv over these polygons. In this case the computation of qi,j (x) in Eq. 2.5 would need to be

modified. Although the VH approach to computing occluder priors is quite appealing for a WSN implementation, several problems remain to be addressed. These include dealing with the object’s own polygon and phantom removal [59], which is necessary because their existence can cause the killing of many good particles.

2.4 Simulation Results In a practical tracking setting one is given the room structure (including information about the static occluders), the range of the number of moving occluders and their motion model, and the required object tracking accuracy. Based on this information, one needs to decide on the number of cameras to use in the room, the amount of prior information about the moving occluder positions needed and how to best obtain this information. Making these

CHAPTER 2. OBJECT TRACKING

27

decisions involve several tradeoffs, for example, between the occluder prior accuracy and the tracker performance, between the number of cameras used and the required occluder prior accuracy, and between the number of occluders present and the tracking performance. In this section we explore these tradeoffs using simulations. In the simulations we assume a square room of size 100 × 100 units and a maximum

of 8 cameras placed around its periphery (see Fig. 2.8). The black rectangle in the figure

depicts a static occluder. Note, however, that in some of the simulations we assume no static occluders. All cameras look toward the center of the room. The camera FOV is assumed to be 90◦ . The standard deviation of the camera position error is σpos = 1 unit, that of camera angle error is σθ = 0.01 radians and read noise standard deviation is σread = 2 pixels. The diameter of each moving occluder is assumed to be D = 3.33 units. We assume that the objects move according to random waypoints model. This is similar to the way we draw new particles from the importance density function as discussed in Subsection 2.3.1 with the following differences: • The objects are only in regimes MTT or CT. There is no W regime. • The objects choose their regimes deterministically, not randomly. If an object reaches

its target or is heading toward the inside of a static occluder or outside the room

boundaries, it transitions to the CT regime. • Objects go around each other instead of colliding. The average speed of the objects is set to 1 unit per time step. The standard deviation of the noise added to the motion each time step is 0.33 units. Fig. 2.8 also shows a snapshot of the objects for M =40 occluders. In the PF tracker we use 1000 particles. In each simulation, the object and the occluders move according to the random waypoints model for 4000 time steps. To investigate tradeoffs involving moving occluder prior accuracy, we need a measure for the accuracy of the occluder prior. To develop such a measure, we assume that the priors are obtained using a KF run on virtual measurements of the moving occluder positions of the form yj (t) = xj (t) + ψj (t), j = 1, 2, . . . , M,

CHAPTER 2. OBJECT TRACKING

3

28

4

2

1

5

6

8

7

Figure 2.8: The setup used in simulations.

where xj (t) is the true occluder position, ψj (t) is white Gaussian noise with covariance σψ2 I, and yj (t) is the measurement. We then use the average root mean square error (RMSE) of the KF (RMSEocc ) as a measure of the occluder prior accuracy. Lower RMSEocc means higher accuracy sensors or more computation is used to obtain the priors, which result in more energy consumption in the network. At the extremes, RMSE occ = 0 (when σψ = 0) corresponds to complete knowledge of the moving occluder positions and RMSE occ = RMSEmax (when σψ = ∞) corresponds to no knowledge of the moving occluder positions.

Note that the worst case RMSEmax is finite because when there are no measurements about the occluder positions, one can simply assume that they are located at the center of the room. This corresponds to RMSEmax = 25.0 units for the setup in Fig. 2.8. To implement the tracker for these two extreme cases, we modify the p(η|u) computation as follows. We assign 0 or 1 to p(η|u) depending on the consistency of η with our

knowledge about the occluders. For RMSEocc = 0, i.e., when we have complete information about the moving occluder positions, the moving occluders are treated as static occluders. On the other hand, for RMSEocc = RMSEmax , i.e., when there is no information

CHAPTER 2. OBJECT TRACKING

29

25 RMSEmax

20

RMSE

=RMSE

occ

max

RMSEtr

RMSEocc=6.67

PSfrag replacements

RMSEocc=0

15

10

5

0 2

3

4

5

6

Number of cameras used

7

8

Figure 2.9: Average tracker RMSE versus the number of cameras for M = 40, and 1 static occluder. The dotted line is the worst case RMSE when no tracking is performed and the object is assumed to be at the center of the room.

about the moving occluder positions, we check the consistency with only the static occluder and the limited FOV information to assign zero probabilities to some particles. For the example in Fig. 2.6, we set P ({η1 = 1} ∩ {η2 = γ2 }|u`1 ) = 0, because if cam1 sees the object, the object cannot be at x`1 . Any other probability that is non-zero is set to 1. Note that for these 2 extreme cases, we no longer need the recursion discussed in Section 2.3.2 to compute the likelihood. Hence, the computational complexity is considerably lighter compared to using Gaussian priors. First in Fig. 2.9 we plot the average RMSE of the tracker (RMSEtr ) over 5 simulation runs for the two extreme cases of RMSEocc = 0 and RMSEocc = RMSEmax and for RMSEocc = 6.67 (obtained by setting σψ = 8) versus the number of cameras (the cameras constitute a roughly evenly spaced subset of cameras in Fig. 2.8. For 2 cameras, orthogonal placement is used for better triangulation of the object position). The dotted line represents the worst case RMSE (RMSEmax ), when there are no measurements and the object is assumed to be in the center of the room. We then investigate the dependency of the tracker accuracy on the accuracy of the moving occluder priors. Fig. 2.10 plots the average RMSE for the tracker over 5 simulation

CHAPTER 2. OBJECT TRACKING

30

8 7.5 7

RMSEtr

6.5 6 5.5 5 4.5 4

PSfrag replacements

3.5 3 0

5

10

RMSEocc

15

20

Figure 2.10: Dependency of the tracker average RMSE on the accuracy of the occluder prior for N = 4, M = 40 and no static occluders. The dotted line is for RMSE tr = RMSEocc . runs versus RMSEocc for N = 4 cameras. In order to include the effect of moving occluder priors only, we used no static occluders in these simulations. RMSE max reduces to 21.3 units for this case. Note that there is around a factor of 2.35× increase in RMSE tr from the case of perfect occluder information (RMSEocc = 0) to the case of no occluder information (RMSEocc = RMSEmax ). Moreover, it is not realistic to assume that the occluder prior accuracy would be better than that of the tracker. With this consideration the improvement reduces to around 1.94× (this is obtained by noting that RMSEtr =RMSEocc at around 3.72). These observations suggest that obtaining prior information may not be worthwhile in practice, unless it can be obtained cheaply and to a reasonable accuracy. The tradeoff between RMSEocc and the number of cameras needed to achieve average RMSEtr = 3 is plotted in Fig. 2.11. As expected there is a tradeoff between the number of cameras and the accuracy of the moving occluder priors as measured by RMSE occ . As more cameras are used, the accuracy of the prior information needed decreases. The plot suggests that if a large enough number of cameras is used, no prior information would be needed at all. Of course having more cameras means more communications and processing

CHAPTER 2. OBJECT TRACKING

31

25

RMSEocc

20

15

10

5

PSfrag replacements 0 4

5

6

Number of cameras to use

7

Figure 2.11: Tradeoff between the number of cameras and moving occluder prior accuracy for target tracker average RMSE = 3 units for M = 40 and no static occluders.

cost. So, in the design of a tracking system, one needs to compare the cost of deploying more cameras to that of obtaining better occluder priors. Next we explore the question of how the needed moving occluder prior accuracy depends on the number of occluders present. To do so, in Fig. 2.12 we plot the RMSE tr versus the number of moving occluders for the two extreme cases, RMSE occ = 0 and RMSEocc = RMSEmax . Note that the difference between the RMSEtr for the two cases is the potential improvement in the tracking performance achieved by having occluder prior information. When there are very few moving occluders, prior information does not help (because the object is not occluded most of the time). As the number of occluder increases prior information becomes more useful. But the difference in RMSEtr between the two extreme cases decreases when too many occluders are present (because the object becomes occluded most of the time). In Section 2.3.3, we mentioned that the complexity of computing the likelihood given u` is exponential in the number of cameras that cannot see the object and are assigned to the region x` belongs to. We proposed that in practice, the average complexity is significantly lower than exponential in N , because the number of assigned cameras to a region is a

CHAPTER 2. OBJECT TRACKING

32

12

10

RMSEtr

8

6

4

2

PSfrag replacements

RMSEocc=RMSEmax RMSE

=0

occ

0 0

10

20

30

40

50

60

70

80

Number of moving occluders

90

Figure 2.12: Tracker average RMSE versus the number of moving occluders for the two extreme cases RMSEocc = 0 and RMSEocc = RMSEmax . Here N = 4 and there are no static occluders.

5

10

Relative average CPU time

∝ 2N 4

10

Gaussian Priors RMSEocc=RMSEmax

3

10

2

10

1

10

PSfrag replacements

0

10

2

3

4

5

6

Number of cameras used

7

8

Figure 2.13: Average CPU time for computing the likelihoods relative to that for the case of 2 cameras and no occluder prior, i.e., RMSEocc = RMSEmax . Here M = 40 and there is 1 static occluder.

CHAPTER 2. OBJECT TRACKING

33

5

6

7

8

9

4

10

3

11

2

12

1

16

(a)

15

14

13

(b)

Figure 2.14: Experimental setup. (a) View of lab (cameras are circled). (b) Relative locations of cameras and virtual static occluder. Solid line shows actual path of the object to track. fraction of N . To verify this, in Fig. 2.13 we plot the average CPU time (per time step) used to compute the likelihood relative to that of RMSEocc = RMSEmax case for 2 cameras, versus the total number of cameras in the room. The simulations were performed on an 3GHz Intel Xeon Processor running MATLAB R14. Note that the rate of increase of the CPU time using priors is significantly lower than 2N , where N is the number of cameras used, and it is close to the rate of increase of RMSEocc = RMSEmax case. In fact, the rate of increase for this particular example is close to linear in N .

2.5 Experimental Results We tested our tracking algorithm in an experimental setup consisting of 16 web cameras placed around a 220 × 190 room. The horizontal FOV of the cameras used is 47◦ . A picture of the lab is shown in Fig. 2.14(a) and the relative positions and orientations of the cameras

in the room are provided in Fig. 2.14(b). Each pair of cameras is connected to a PC via IEEE 1394 (FireWire) interface and each can provide 8-bit 3-channel (RGB) raw video at 7.5 Frames/s. The data from each camera is processed independently as described in Section 2.2. The scan line data is then sent to a central PC (cluster head), where further processing is performed.

CHAPTER 2. OBJECT TRACKING

34

18

16

RMSEocc=RMSEmax RMSEocc=14.2

RMSEtr

14

RMSEocc=0

12

10

PSfrag replacements

8

6 4

5

6

7

Number of cameras used

8

9

Figure 2.15: Experimental results. Average tracker RMSE versus the number of cameras for M = 20, and 1 static occluder.

The object follows the pre-defined path (shown in Fig. 2.14) with no occlusions present and 200 time-steps of data is collected. The effect of static and moving occluders is simulated using 1 virtual static occluder and M = 20 virtual moving occluders: we threw away the measurements from the cameras that would have been occluded, had there been real occluders. The moving occluders walk according to the model explained in Section 2.4. D is chosen 12 inches for the moving occluders, and the camera noise parameters were assumed σpos = 6 inches, σread = 2 pixels and σθ = 0.005 radians. Figure 2.15 plots the average RMSE of the tracker over 40 simulation runs for the two extreme cases of RMSEocc = RMSEmax = 61.8 inches and RMSEocc = 0 and for RMSEocc = 14.2 inches versus the number of cameras. There is a notable difference in the performance between the three cases throughout the entire plot, but the difference is more pronounced when the number of cameras is small, agreeing with the tradeoffs discussed in Section 2.4.

CHAPTER 2. OBJECT TRACKING

35

2.6 Summary We described a sensor network approach for tracking a single object in a structured environment using multiple cameras. Instead of tracking all objects in the environment, which is computationally very costly, we track only the target object and treat others as occluders. The tracker is provided with complete information about the static occluders and some prior information about the moving occluders. One of the main contributions of this work is developing a systematic way to incorporate this information into the tracker formulation. Using simulations we explored several tradeoffs involving the occluder prior accuracy, the number of cameras used, the number of occluders present, and the accuracy of tracking with some interesting implications.

Chapter 3 Camera Node Selection In this chapter, a camera network node subset selection methodology for target localization in the presence of static and moving occluders is described1 . Similar to Chapter 2, it is assumed that the locations of the static occluders are known, but that only prior statistics for the positions of the object and the moving occluders are available. A weak perspective camera model, which is a linear approximation to the perspective camera model is adopted, and the occluder information is included in the camera measurement via the occlusion indicator function that was presented in Chapter 2. The minimum mean square error of the best linear estimate of object position based on camera measurements is then used as a metric for selection. It is shown through simulations and experimentally that a greedy selection heuristic performs close to optimal and outperforms other heuristics. The rest of this chapter is organized as follows. A brief survey of previous work on sensor selection and camera selection is presented in the next section. In Section 3.2, we introduce the setup and camera model, define the selection metric and explain how it can be efficiently computed. In Section 3.3, we compare the performance of the greedy selection heuristic to other heuristics and to the optimal solution, both in simulation and experimentally. 1

The work in this chapter was first published in [60].

36

CHAPTER 3. CAMERA NODE SELECTION

37

3.1 Previous Work Sensor selection has been addressed in the sensor networks, computer vision, and computer graphics literature. Selection has been studied in wireless sensor networks with the goal of decreasing energy cost and increasing scalability. Viewpoint selection, or the next best view, has been studied in computer graphics and vision for picking the most informative views of a scene. We summarize the work related to this chapter in this section. Sensor Selection: Chu et al. [61] develop a technique referred to as IDSQ to select the next best sensor node to query in a sensor network. The technique is distributed and uses a utility measure based on the expected posterior distribution. However, expected posterior distribution is expensive to compute because it involves integrating over all possible measurements. Ertin et al. [62] use the mutual information metric to select sensors. This is shown to be equivalent to minimizing the expected posterior uncertainty, but with significantly less computation. The work in [31] expands on [62] and shows how to select the sensor with the highest information gain. An entropy-based heuristic that approximates the mutual information and is computationally cheaper is used. Slijepcevic et al. [63] use a heuristic that makes sure the selected subset of nodes completely cover the monitored area. Other researchers consider general utility functions and used their properties for optimal selection [30, 64]. We use the minimum mean square object localization error as the selection metric. Also, we consider the occlusion phenomenon, which is unique to camera sensors compared to other sensing modalities. Camera Selection: Sensor selection has also been studied for camera sensors. In [65– 67], a metric is defined for the next best view based on most faces seen (given a 3-D geometric model of the scene), most voxels seen, or overall coverage. The solution requires searching through all camera positions to find the highest scoring viewpoints. Yang et al. [68] and Isler et al. [69] deal with sensing modalities where the measurements can be interpreted as polygonal subsets of the plane and use geometric quantities such as the area of these subsets as the selection metric. [68] proposes a greedy search to minimize the selection metric. [69] prove that, for their setting, an exhaustive search for at most 6 sensors yields a performance within a factor 2 of the optimal selection. These works use numerical techniques or heuristics to compute the selection metrics. We use a noisy camera

CHAPTER 3. CAMERA NODE SELECTION

38

measurement model and approximate the optimal selection that minimizes the minimum mean square target localization error using a greedy heuristic.

3.2 Setup, Models and Assumptions We use a similar setup to the one that is described in Section 2.2 and Fig. 2.1. We assume there are N cameras that are aimed roughly horizontally around a room. The cameras are assumed to be fixed and their locations and orientations are known to some accuracy to the cluster head. The camera network’s task is to localize an object in the presence of static occlusions and other moving objects. We assume that the object to localize to be a point object and there are M other moving occluders, each modeled as a cylinder of diameter D. The position of each moving occluder is assumed to be the center of its cylinder. We assume the positions and the shapes of the static occluders in the room to be completely known in advance. Gaussian prior statistics with mean µ j and covariance matrix Σj , j = 1, . . . , M of the moving occluder positions are known. The object position x is also assumed to be distributed normally with mean µ and covariance matrix Σ. Note that this particular assumption is different from Chapter 2. There, the distribution of x is represented by the particles and it is not necessarily Gaussian. While this assumption is not required for the selection algorithm, it does facilitate our analysis. One could easily adopt any general prior, including ones represented by particles, to our framework. The object’s and moving occluders’ positions are assumed to be mutually independent. Our focus is on selecting the best subset of nodes of size k < N to perform the localization. The selection is assumed to be performed at a single time step.

3.2.1 Camera Measurement Model As in Chapter 2, we assume that the camera nodes can distinguish between the object and moving occluders and that simple background subtraction is performed locally at each camera node to detect the object. The center of the scan-line is sent to the cluster head for localization. Noisy weak perspective camera model is adopted in this chapter, instead of the perspective model [25].

CHAPTER 3. CAMERA NODE SELECTION

39

Weak perspective camera model is a linear approximation to the perspective camera model. To get to it, let us consider Eq. 2.2 under the case of camera i seeing the object: zi0 = fi

hi (x) + vi0 . di (x)

The primes are used, because as will be explained shortly, the measurements assumed in this chapter are scaled versions of the above. In the weak perspective camera model, it is assumed that di (x)  hi (x) and one can approximate di (x) with d¯i , which is defined as 4 d¯i = di (µ),

where µ is the mean of the object’s prior. Note that d¯i is a known value to the cluster head, as µ is known. Thus one can scale the measurements from camera i by

d¯i fi

without changing

its information content. For brevity, we assume these scaled measurements in this chapter. d¯i zi = zi0 fi =hi (x) +

d¯i 0 v fi i

=aTi x + vi , i = 1, 2, . . . , N,

(3.1)

where aTi = [sin(θi ) − cos(θi )] and vi is additive noise due to the read noise and the errors

in the camera position and angle θi . Note that Eq. 3.1 ignores a constant term which is the inner product of ai with the position vector of camera i, because this known constant term does not effect the selection metric. Under the weak perspective camera model, the variance of vi becomes (see Appendix B) 2 σv2i = ζi d¯2i + σpos ,

where 4

ζi = σθ2 +

(3.2)

2 σread . fi2

Similar to Chapter 2, the quantities σpos , σθ and σread are the standard deviations of the

CHAPTER 3. CAMERA NODE SELECTION

40

errors in camera position, orientation and readout, respectively. We assume that the noise from different cameras are mutually independent. The occlusions are captured through the indicator function ηi defined in Eq. 2.1. If the camera sees the object, the measurement model in Eq. 3.1 is valid. If it cannot see the object, it assumes that the object is at its mean µ, which provides no new information to the localization. The modified camera measurement model including occlusions is then given by  zi = ηi aTi (x − µ) + vi + aTi µ, i = 1, 2, . . . , N.

(3.3)

Note that Eq. 3.3 reduces to Eq. 3.1 for ηi = 1.

Because ηi is a random variable and it is not a constant, after multiplication with η i , zi is no longer linear in x, and it is certainly not Gaussian. However, we still formulate the problem of camera node selection for target localization in the framework of linear estimation. Note this framework is used only for selection. After the best subset of cameras are selected and the measurements are taken, actual localization (or tracking) can be performed by a non-linear estimator (or tracker, e.g., the one described in Chapter 2). Given the object and moving occluder priors, the positions and shapes of the static occluders, the camera positions, orientations and FOVs, and the camera noise parameters, we use the minimum mean square error (MSE) of the best linear estimate of the object position as a metric for selection. The best camera node subset is defined as the subset that minimizes this metric. To compute the MSE, define the vector 4

z =[z1 , z2 , . . . , zN ]T ,

(3.4)

and define for cameras i, s ∈ {1, 2, . . . , N } 4

pis (x) =P{ηi = 1, ηs = 1|x}.

(3.5)

Then it can be shown that (see Appendix D) the MSE of the linear estimator assuming the

CHAPTER 3. CAMERA NODE SELECTION

41

camera model in Eq. 3.3 is given by  MSE = Tr Σ − ΣTzx Σ−1 z Σzx  Σzx (i) = aTi Ex pii (x)˜ xx˜T  xx˜T as − aTi [Ex (pii (x)˜ x)] [Ex (pss (x)˜ x)]T as Σz (i, s) = aTi Ex pis (x)˜ ( Ex (pii (x)) σv2i i = s + , 0 i 6= s

(3.6)

4

where Σzx = E((z − E(z))(x − µ)T ), Σz = E((z − E(z))(z − E(z))T ) and x˜ = x − µ.

The MSE for a subset S ⊂ {1, 2, . . . , N }, MSE(S), is defined as in Eq. 3.6 but with only

the camera nodes in S included.

3.2.2 Computing MSE(S) Similar to the methodology in Chapter 2, we first ignore the static occluders and limited FOV. Under this assumption, by comparing Eq. 3.5 and Eq. 2.6, we conclude that mv pmv is (x) =p{i,s} (x)

=

M Y j=1

 mv mv 1 − qi,j (x) − qs,j (x) .

(3.7)

The superscript “mv” is also added to the left hand side, to signify that “only moving occluders are taken into account”. The effects of static occluders and limited FOV can be readily included as follows. Let Ii (x) be the indicator function of the points visible to camera i considering only static occluders and the limited FOV (see Fig. 3.1), then it is easy to show that pii (x) = Ii (x)pmv ii (x), and similarly pis (x) = Ii (x)Is (x)pmv is (x).

(3.8)

To compute the expectations in Eq. 3.6, we fit a grid of points over the 3-σ ellipse of

CHAPTER 3. CAMERA NODE SELECTION

42

Camera i

PSfrag replacements Ii (x) = 1

Figure 3.1: Illustration of the indicator function of visible points to camera i.

the Gaussian pdf of the object. pis (x) values are computed over these points as explained above. We then perform a 2-D numerical integration over the grid. However, we are not restricted to Gaussian priors. For example, if the selection algorithm is used together with the tracking algorithm in Chapter 2, the object prior is represented by particles. In this case one can compute pis (x) values at the particle positions and take the expectations over the particle-weight tuples. For example 4

x˜` =x` − T

E(pis (x)˜ xx˜ ) =

L X

L X

w ` 0 x` 0

`0 =1

w` pis (x` )˜ x` x˜T` .

`=1

3.3 Selection The selection problem is formalized as follows Minimize MSE(S) subject to |S| = k. A brute-force search to find the optimal solution to this problem requires O(N k ) trials. This can be too costly in a wireless camera network setting. Instead, we use the greedy selection algorithm in Fig. 3.2. In general, greedy optimization is not guaranteed to achieve

CHAPTER 3. CAMERA NODE SELECTION

43

the global optimum. Fortunately, as we shall demonstrate in the following subsections, it yields close to optimal results. The computational complexity of the greedy algorithm is O(k 2 M N L + k 4 N ), where k is the subset size, M is the number of moving occluders, N is the number of cameras, and L is the number of grid points used to evaluate the expectations in Eq. 3.6.

3.3.1 Simulation Results We performed Monte-Carlo simulations to compare the performance of the greedy approach to the optimal brute-force enumeration as well as to the heuristics: • Uniform: Pick uniformly placed cameras. • Closest: Pick the closest cameras to the object mean. Fig. 3.3 compares the RMS localization error of the four selection procedures for a typical simulation run with k = 3 to 9 camera nodes out of 30 cameras uniformly placed around a circular room of radius 100 units with 3 static occluders and M = 10 moving occluders. We first randomly chose the object and moving occluder prior parameters as follows. We chose the means randomly and independently. We chose camera FOVs to be 60 ◦ , D = 2, σpos = 5, ζi = 0.01, σo = 10, αo = 8, σj = 10 and αj > 1 is chosen randomly. Here, σo2 and σo2 /αo (αo ≥ 1) are the eigenvalues of the covariance matrix Σ of the object’s prior

and σj2 and σj2 /αj are the eigenvalues of the covariance matrix Σj of the prior of occluder j. We also applied random rotations to all priors. We performed the selection using the four aforementioned procedures. We then placed the object and the moving occluders at random according to the selected priors 5000 times and localized the object with the selected camera nodes. This procedure was repeated 20 times. As seen in Fig. 3.3, the error for the greedy approach completely overlaps with that of brute-force enumeration and outperforms the other heuristics. Note that, even if a selection algorithm makes bad decisions, e.g., selects cameras that are all occluded, the worst the cluster head will do is predict that the object position is at its mean. Because of this, the difference in performance between the above procedures is

CHAPTER 3. CAMERA NODE SELECTION

Algorithm: Greedy camera node selection Inputs: Object’s prior (µ, Σ); Dynamic occluders’ priors {µj , Σj }M j=1 ; Shapes and positions of static occluders; Camera orientations ({θi }N i=1 ) and positions; FOVs of the cameras; Number of camera nodes to select (k). Output: Best subset (S). 01. Choose a grid of points x` centered at µ covering 3-σ ellipse of Σ, ` ∈ {1, . . . , L} 02. S := ∅ 03. for counter = 1 . . . k 04. MSE := ∞ 05. for i = 1 . . . N 06. if i ∈ /S 07. S := S ∪ {i} 08. Compute pis (x` ), ∀s ∈ S (Eq. 3.8) 09. e := MSE(S) (Eq. 3.6) 10. if e < MSE 11. MSE := e, sel := i 12. end if 13. S := S\{i} 14. end if 15. end for 16. S := S ∪ {sel} 17. end for Figure 3.2: The greedy camera node selection algorithm.

44

CHAPTER 3. CAMERA NODE SELECTION

8.5

Uniform Closest Greedy Brute−force enumeration

8 7.5 RMS Error

45

7 6.5 6 5.5 5 3

5 7 Number of Sensors

9

Figure 3.3: Simulation results: localization performance for different selection heuristics. Here, M = 10 and there are 3 static occluders.

11

10

9

8

7

6

12

5

13

4

14

3

15

2

16

1

17

30

18

29 19

28 20

27 21

22

23

24

25

26

Figure 3.4: An example camera selection for k = 3. The prior with multiple contours is for the object. The priors with single contour are for the moving occluders. The black rectangles are static occluders. Closest heuristic selects cameras marked with squares. Greedy and brute-force select the cameras marked with triangles.

CHAPTER 3. CAMERA NODE SELECTION

46

not too large. However, in an application such as tracking, these errors could build up over time and may in fact result in completely losing the object. Fig. 3.4 depicts an example selection for k = 3 for the setup described above. Note that even though the selection using the closest heuristic seems to be quite natural because it avoids occlusions with high probability, the greedy method, which selects the same nodes as the brute force in this case, better localizes the object along the major axis of its prior, where the uncertainty about its position is highest.

3.3.2 Experimental Results We tested our selection algorithm in the experimental setup that is described in Section 2.5. After local processing, the selected nodes send their scan-lines to the cluster head where localization is performed. The object was randomly placed 100 times according to the prior shown in Fig. 3.5(a). The object to localize is the tip of the tripod as shown in Fig. 3.5(b). We added 2 static and 10 moving “virtual” occluders to the experimental data. We randomly selected priors for the moving occluders as before. For each placement of the object, we randomly placed the moving occluders, according to these priors. After the camera nodes are selected and queried for measurements, we threw away the measurements from the cameras that would have been occluded, had there been real occluders. The other parameters are σo = 12 inches, αo = 8, D = 12 inches, ζi = 0.01 rad2 and σpos = 1 inch. The selection procedures were applied with k = 2 to 7 camera nodes and the object was localized with the selected nodes for 100 placements using linear estimation. This procedure was repeated 100 times using different random priors for the moving occluders. The virtual static occluders have fixed locations throughout. Fig. 3.6 compares the RMS localization error of the four selection procedures averaged over 100 × 100 = 10, 000

runs. As can be seen from the figure, the greedy approach again outperforms the other 2

heuristics and performs very close to brute-force enumeration. Actually, the greedy method performs slightly better than brute-force for k = 3 and 4 cameras, because of the fact that the camera model in Eq. 3.3 is an approximation to the full perspective camera model. The experiments confirm that our selection algorithm is useful using real cameras with highly non-linear measurements.

CHAPTER 3. CAMERA NODE SELECTION

5

6

7

8

47

9

4

10

3

11

2

12

1

16

14

15

13

(a)

(b)

Figure 3.5: Experimental setup. (a) The prior with multiple contours is for the object. The priors with single contour show an example run for the moving occluders. The black rectangles are static occluders. Cones show FOVs of cameras. (b) The object to be localized.

9

Uniform Closest Greedy Brute−force enumeration

RMS Error (inches)

8 7 6 5 4 2

3

4 5 Number of Sensors

6

7

Figure 3.6: Experimental results. Here, M = 10 and there are 2 static occluders.

CHAPTER 3. CAMERA NODE SELECTION

48

3.4 Summary We developed a camera network node selection methodology for target localization in the presence of static and moving occluders. The minimum MSE of the best linear estimate of object position based on camera measurements is used as a metric for selection. It is shown through simulations and experimentally that a greedy selection heuristic performs close to optimal.

Chapter 4 Camera Placement Chapters 2 and 3 described algorithms for tracking an object and selecting best subsets of camera nodes, in a wireless sensor network framework. The first step before implementing these tasks is setting up the camera network. This arises the question of where to place the cameras. Is the intuitive uniform placement optimal, or is there a better placement? The problem of placing the cameras optimally for the task of localizing a point object is explored in this chapter1 . We use the same localization minimum MSE metric that we used for selection. We optimize this metric with respect to the camera positions. The rest of the chapter is organized as follows. A brief survey of previous work on camera and sensor placement is presented in the next section. In Section 4.2, we introduce the setup and camera model, define the selection metric and present an analytical formula for it. In Section 4.3, how to optimize this metric for best camera placement is explained. In Section 4.4, we present a discussion of how to avoid the static occluders during camera placement.

4.1 Previous Work Camera placement has been studied in computer vision and graphics. In photogrammetry [71–73], the goal is to place the cameras so as to minimize the 3D measurement error. The error propagation is analyzed to derive an error metric that is used to rank camera 1

The work in this chapter was first published in [70].

49

CHAPTER 4. CAMERA PLACEMENT

50

placements. The best camera placement is then solved numerically. The computational complexity of this approach only allows solutions involving a few cameras. In our approach we simplify the camera model, derive the localization error analytically as a function of camera orientations and minimize the error to find the best placement. This is less intensive computationally in comparison with the numerical methods above. Zhang [74] investigates the problem of how to position (general) sensors with 2-D measurement noise to minimize the overall error. The paper also presents an algorithm to compute the optimal sensor placement. Our measurements are 1-D after local processing, and we pose the placement problem as a special case of the classical inverse kinematics problem.

4.2 Setup, Model and Assumptions We use a similar setup to that of Chapters 2 and 3. We would like to place N cameras roughly horizontally around a room. The goal is to find the best placement that minimizes the mean square localization error of a point object. We first ignore the static occluders and limited FOV. A discussion on these issues is given in Section 4.4. Also, the placement is usually performed once at the setup of the network, and thereafter the cameras remain fixed. Not only are there no moving occluders present at this point, but also the placement should be valid for all possible moving occluder priors during the lifetime of the network. Therefore, we simply assume there are no moving occluders. For similar reasons, it is also natural to assume that the object prior is centered in the room. However, the prior does not have to be circularly symmetric, as people might tend to walk along certain directions more than others. For example, in a hallway, the major axis of the prior distribution would be aligned with the hallway. We assume the object prior is a Gaussian with covariance matrix Σ = σo2 diag(1, 1/αo ) with αo ≥ 1. If the prior is not of this form, the coordinate system

can be rotated to achieve it.

Cameras are usually placed on the walls of the room, so the distance of the camera to the object prior mean cannot vary by much and can be approximated by a constant. Therefore, for the placement problem specifically, we are trying to come up with the best orientations of the cameras, hence we assume a circular room for simplicity. We assume cameras are

CHAPTER 4. CAMERA PLACEMENT

51 cam1

cam2 θ1

PSfrag replacements

Object prior

camN

Figure 4.1: The setup used for placement problem.

fixed to the periphery and oriented towards the center. With these assumptions, the orientations of the cameras (θi s) uniquely determine their placement. With these assumptions, the setup used for placement is illustrated in Fig. 4.1. We assume the same camera model as in Chapter 3. As the static occluders and limited FOV are ignored and there are no moving occluders, all cameras “see” the object. zi = aTi x + vi , i = 1, 2, . . . , N,

(4.1)

where aTi = [sin(θi ) − cos(θi )] and vi is additive noise due to the read noise and the errors

in the camera position and angle θi . The variance of vi is given in Eq. 3.2. We assume that the noise from different cameras are independent from each other. Although Chapter 3 starts with a linear weak perspective measurement model, after multiplication with the occlusion indicator function the model becomes non-linear. Here, because we don’t have any such indicator functions, the model in Eq. 4.1 is linear. It can be shown that for this linear model, the MSE is given by (see Appendix D.1 for the derivation

CHAPTER 4. CAMERA PLACEMENT

52

of this formula) 4 MSE = αo +1 σo2

N X 1 + σ2 i=1 vi

!2



αo +1 σo2

αo −1 σo2

N X 1 + σ2 i=1 vi

!

N X cos 2θi

+

σv2i

i=1

!2



N X sin 2θi i=1

σv2i

!2 .

(4.2)

4.3 Optimal Camera Placement Given the assumptions above, the camera noise variances are fixed and the numerator and the first term of the denominator of Eq. 4.2 do not change with respect to the placement of the cameras. Only the last two terms of the denominator depend on the orientations θ i s. Thus minimizing Eq. 4.2 is the same as minimizing N

αo − 1 X cos 2θi + σo2 σv2i i=1

!2

+

N X sin 2θi i=1

σv2i

!2

.

(4.3)

It is clear that Eq. 4.3 is bounded below by 0. The following sections show when this can be achieved for the optimal camera orientations.

4.3.1 Symmetric Case When the cameras have the same error variance, σvi = σv for i = 1, 2, . . . , N , and the object prior is circularly symmetric (αo = 1), the problem of minimizing Eq. 4.3 reduces to minimizing N X i=1

cos 2θi

!2

+

N X i=1

sin 2θi

!2

.

(4.4)

This is equivalent to the norm-squared of the sum of N unit vectors with angles 2θ i . Thus Eq. 4.4 is equal to zero when the θi s are chosen uniformly between 0 and π. This leads to the intuitive conclusion that when the object prior is circularly symmetric and the cameras have the same amount of noise, uniform placement of cameras is optimal. An

CHAPTER 4. CAMERA PLACEMENT

53

2θ1 = π

θ1 =

PSfrag replacements

π 2

θ2 = π

2θi

PSfrag replacements (a)

(b)

Figure 4.2: Example optimal camera placements for the symmetric case. (a) Uniform placement of 6 cameras that minimizes Eq. 4.4. (b) For 2 cameras, orthogonal placement is optimal.

illustration of this result for 6 cameras is depicted in Fig. 4.2(a). The angles of the vectors are twice the orientation angles of the cameras. For two cameras, an orthogonal placement of the cameras is optimal, so that the unit vectors are 180 degrees apart (see Fig. 4.2(b)). Uniform placement, however, is not the only optimal way to place the cameras. If we partition the vectors into subgroups and all subgroups of vectors sum to zero, then the combination of all the vectors also sums to zero. This means that we can cluster the cameras into (local) groups (with at least 2 cameras in each group) and solve the problem distributedly in each cluster. If each group finds a locally optimal solution, then the combined solution is globally optimal (see Fig. 4.3). This is true no matter what the relative orientations between the groups of cameras are.

4.3.2 General Case We now discuss the general placement problem, i.e., when α o 6= 1 and the σvi s are not

all equal2 . The problem corresponds to minimizing Eq. 4.3. Again this is the sum of N

vectors, but the vectors can have different lengths 1/σv2i . The MSE is minimized when the 2

This can happen when cameras have different focal lengths, σ pos , σread or σθ .

CHAPTER 4. CAMERA PLACEMENT

54

and PSfrag replacements Figure 4.3: Distributed placement using clusters. Locally optimal clusters are globally optimal. Relative orientations of clusters do not matter.

2θi

PSfrag replacements αo −1 σo2

Figure 4.4: An optimal solution to the general placement problem.

sum equals − αoσ−1 (offset from zero) on the abscissa. Again, the resulting angles of the 2 o

vectors are twice the optimal θi of the cameras (see Fig. 4.4).

This problem can be thought of as an inverse kinematics robotics problem. Our vectors describe a planar revolute robot arm with N linkages. The base of the robot arm is at the origin and it is trying to reach a point − αoσ−1 on the abscissa with its end effector. If the 2 o

σvi s are ordered such that

σvN ≥ σvN −1 ≥ . . . ≥ σv1 , then any point in an annulus with inner and outer radii rout =

X

1/σv2i ,

i

rin = max 0, 1/σv21 −

X i6=1

1/σv2i

!

CHAPTER 4. CAMERA PLACEMENT

αo −1 σo2

PSfrag replacements

rout

55

αo −1 σo2

PSfrag replacements (a)

(b)

Figure 4.5: Inverse kinematics can solve for the best θi : The case when the point to reach is (a) inside the annulus, and (b) outside of the annulus. Note that r in is 0 for this example.

is achievable. If the point the robot is trying to reach is inside the annulus, we use gradient descent algorithms to find an optimum solution that minimizes Eq. 4.3 by setting it to zero [75]. If the point is outside the annulus, we minimize the distance to the point the robot arm is trying to reach by lining up all the vectors along the abscissa such that the tip of the arm touches the outer or inner radius of the annulus. This configuration does not zero out Eq. 4.3 but gives the minimum achievable error. Figure 4.5 illustrates these two cases. In Fig. 4.5(b), note that all the vectors point in the same direction. The best placement for this scenario is putting all cameras orthogonal to the object prior’s major axis (such that twice the angles are 180 ◦ ). This seems counterintuitive since we expect an orthogonal placement to be better for triangulation. However in this case, the prior uncertainty along P the minor axis is small enough ((αo − 1)/σo2 > i 1/σv2i ) that the optimal solution is to

place all cameras to minimize the uncertainty along the major axis.

An example placement for N = 4, αo = 5, σo = 4 and σv2i = 5, 10, 15, and 20 is given

in Fig. 4.6(a). The room, object prior and resulting camera placements are shown. Note that the three higher noise cameras are placed close to each other, while the first camera is placed separately. The interpretation here is that the similar views from these bad cameras

CHAPTER 4. CAMERA PLACEMENT

cam4

PSfrag replacements

cam3

cam2

cam1

56

cam2

cam1

PSfrag replacements (a)

(b)

Figure 4.6: Two optimal placements for the given object prior. (a) One good camera and 3 worse ones. (b) Two good cameras.

are averaged by the linear estimator to provide one good measurement. This is verified by the example in Fig. 4.6(b). Here, an optimal placement for two high quality cameras (σv2i = 5 for both) is shown. The first camera is placed roughly at the same position as before, while the second camera is placed in the middle of the three bad camera positions. Note that the cameras are placed in [0, π), as the corresponding vectors have angles in [0, 2π) and they are twice the angles of the cameras. However, one can flip any camera to the opposite side of the room without changing its measurement. Note that for the general case, clustering the cameras into multiple groups can still achieve global optimality while solving the placement problem for each cluster as long as the clusters zero out their share of the offset. Suppose N cameras are clustered into c o −1 groups. Then one algorithm might ask each cluster’s “arm” to reach − αcσ 2 . If this can be

achieved by all the clusters, the solution is globally optimal.

o

CHAPTER 4. CAMERA PLACEMENT

PSfrag replacements

57

PSfrag replacements Flip second cam here (a)

Not allowed angles (b)

Figure 4.7: Dealing with static occluders.(a) Static occluders can be avoided if there is no other static occluder on the other side of the room. (b) We cannot place the cameras at some angles if both sides are occluded.

4.4 Adding Static Occluders We assume that the cameras are placed to the periphery of a circular room and the object prior is centered. Under this configuration, the FOV of the cameras are made sure to cover the object prior maximally by orienting them toward the center of the room and this is our only consideration about the limited FOV. On the other hand, as noted earlier, usually there exists multiple optimal placements that minimize the MSE. Under the assumptions we made, all such placements yield the same MSE. In this section, we argue that this property can be exploited to avoid static occluders. For example flipping the camera to the other side of the room does not affect the MSE. So, if a static occluder avoids a camera to cover a significant portion of the prior, one can simply flip that camera to the other side of the room provided that there is no such occluder at the flip side (see Fig. 4.7). If it is the case that for some region of angles there are static occluders at both sides of the room, then we cannot place the cameras at those angles (Fig. 4.7(b)). This places regions of angles that are not allowed (and regions that are allowed) for our inverse kinematics solution of Sect. 4.3. For the case that the number of allowed regions is one, a

CHAPTER 4. CAMERA PLACEMENT

58

solution is given in [76]. Note that the situation illustrated in Fig. 4.7(b) is an example of one region of allowed angles (not two), as two flip sides of the room are equivalent. However, for the case of more than one such allowed regions, there is no general inverse kinematics solution. For this case, we can restrict each joint angle to be in a specific region and try all possible combinations until a feasible solution is found. Although the complexity of this search is exponential in the number of allowed regions, in practice we do not expect the number of static occluders to be so high as to make the computation infeasible.

4.5 Summary This chapter visits the camera placement problem in a simple setting. We use the same metric that is used for selection. An analytical formula for this metric is presented under this simple setting. This metric is first optimized for best camera orientations, which is followed by a discussion on how to avoid static occluders.

Chapter 5 Conclusion 5.1 Summary This dissertation presents an object tracking algorithm for camera networks in a wireless sensor network framework, where the network consists of many low cost nodes combining sensing, processing and communication. The nodes are wireless and easy to deploy, and the system is scalable to many nodes and robust to failures. Although required hardware to realize such systems is readily available, many challenges still remain unsolved. The main challenge in building such networks is the high data rate of video cameras. Sending all of the acquired data, even after standard compression is very costly in transmission energy. Performing sophisticated vision algorithms locally in order to substantially reduce the data is also prohibitive for resource constrained nodes. Another big challenge with camera networks is the discontinuity of visibility due to occlusions. To overcome these challenges, we adopt a collaborative task-driven approach where the camera nodes are assumed to be grouped into clusters and each cluster is assumed to have a more powerful cluster head. The camera nodes process the images locally with simple operations, and extract the minimum amount of data that is essential for the task. This refined data is then sent to the cluster head, where the task is performed. With many camera nodes collaborating, the task can be accomplished with lightweight local processing and sending a very limited amount of data to the cluster head. An object tracking algorithm is presented under the collaborative task-driven approach. 59

CHAPTER 5. CONCLUSION

60

Instead of tracking all objects in the environment, we proposed to track only the target object and treat others as occluders. We developed a systematic way to incorporate the static and moving occluder information into the tracker formulation. Using simulations, we explored several tradeoffs and concluded that obtaining moving occluder prior information may not be worthwhile in practice, unless it can be obtained cheaply and to a reasonable accuracy. We also showed that there is a tradeoff between the number of cameras used and the accuracy of the moving occluder priors, which should be taken into account during system design. Finally, we observed that the required occluder prior accuracy also depends on the number of moving occluders. In order to further alleviate the limited bandwidth and energy problems of camera networks, we proposed a camera node selection algorithm. Only the selected nodes are active at any given time, which reduces the per node average power consumption and bandwidth usage considerably. Moreover, measurements from different cameras might be highly correlated. Therefore a clever selection of a subset of nodes results in little performance degradation relative to using all the cameras. We used the minimum MSE of the best linear estimate of object position based on camera measurements as a metric for selection. We proposed a greedy selection heuristic to optimize this metric with respect to the selected subset and showed that it performs close to brute-force optimization. Finally, we looked at the camera placement problem in a simple setting. We used the same metric that is used for selection. An analytical formula for this metric is presented under a simple setting. This metric is first optimized for best camera orientations, which is followed by a discussion on how to avoid static occluders.

5.2 Suggestions for Future Work 5.2.1 Theoretical Framework for the Analysis of the Tradeoffs In Chapter 2, we observed several tradeoffs involving the tracking accuracy, moving occluder prior accuracy, number of cameras used and number of moving occluders via tracking simulations. A theoretical framework would be useful in examining the underlying

CHAPTER 5. CONCLUSION

61

mechanism behind these tradeoffs. For example, a metric such as the mean square localization error can be used. This metric is simpler to compute compared to RMS tracking accuracy, which allows faster computation of the tradeoff curves. Moreover, if such a metric can be written as a function of the aforementioned parameters, it can be optimized for system design. The insight obtained by the theoretical framework might also be useful for improving the tracking algorithm itself. For example, an initial analysis using the Fisher Information Matrix reveals that the curvature of the log-likelihood of the occlusion indicator functions evaluated at the object state is proportional to the amount of information that the occluder priors yield. Before computing the likelihood of the measurements given a particle state (Section 2.3.2), there could be cases that the curvature of the log-likelihood is predicted to be low (e.g., using history of previous occluder priors), which means the information obtained from the occluder priors is low. Using this observation, one might choose not to use the moving occluder priors under such circumstances, which yields savings in likelihood computation and other system resources used for obtaining these priors.

5.2.2 Use of Visual Hull for Occluder Priors In Section 2.3.4, we have proposed several possible methods to obtain the occluder priors that are used in the tracking and selection algorithms. Among them, the method using the visual hull (VH), which is obtained via the scan lines from the cameras, is promising. Not only computing the VH is relatively light weight, but also it does not require deployment of additional sensors or solving the data association problem. Another attractive property of this approach is that it increases the communication cost only marginally, as the entire scan lines are needed to be sent (which are still a few bytes in size, after run-length coding) to the cluster head instead of the center of the blob corresponding to the “object”. The cluster head computes the VH by back-projecting the blobs in the scan lines to cones in the room (see Fig. 2.7). The cones from the multiple cameras are intersected to compute the total VH. Since the resulting polygons are larger than the occupied areas and “phantom” polygons that do not contain any objects may be present, VH provides an upper bound on the occupied regions in the room. These polygons can be used to compute moving

CHAPTER 5. CONCLUSION

62

occluder priors. For example by using the convex optimization techniques presented for “extremal volume ellipsoids” in Chapter 8.4 of [77] or even simple least square methods, one could fit ellipses to these polygons and use them as Gaussian priors. Another interesting idea is to use the polygons in the VH as uniform priors. In this case, mv qi,j (x)

simply becomes mv qi,j (x) =

Area(Ai (x) ∧ j th polygon) , Area(j th polygon)

where A ∧ B denotes the intersection of the shapes A and B. Note that this equation is exact, not an approximation as in Eq. 2.5. Also, using uniform priors gets rid of the

approximation at step (d) of Eq. 2.6. The probability of the union event before that step can directly be computed by intersecting ∨i∈S Ai (x) with the j th polygon.

Although this approach seems very attractive considering the properties mentioned

above, important problems still need to be resolved. Existence of “phantom” polygons is one such problem. These are polygons that contain no objects and they exist because the VH is computed by simply intersecting the cones from scan lines and no data association information is used. While the good polygons that do contain objects give useful information for tracking, the “phantom” polygons yield false information, which results in assigning wrong weights to the particles. Usually when there are many occluders present, the number of phantoms might be so large that the false information might overwhelm the true information from non-phantom polygons. In [59], several methods are proposed to remove the phantom polygons. The most straight forward ones are removing polygons that are too small to contain any objects and ones that pop out of nowhere, i.e., not consistant (in time) with a moving object. Also, color information can be used to resolve some ambiguities to prune phantom polygons. Probably a bigger problem is the tracked object’s own polygon. Because the tracker treats this polygon as an occluder prior and there are many good particles in this polygon as the object is also in it, such particles get assigned much lower weight than they should. Simply removing the polygon that contains the object (or the estimate of its position) also does not work, because there might actually be moving occluders close to the object, inside the same polygon. If these non-idealities can be handled elegantly, we think VH could be

CHAPTER 5. CONCLUSION

63

a very promising cheap way to obtain the occluder priors.

5.2.3 Effect of Camera Resolution Another idea to overcome the high data rate problem of camera sensors while maintaining their advantages is to use multiple cameras with different resolutions. Low resolution cameras could be used for event detection while high resolution cameras can be activated only when interesting events happen [78]. This brings up the question: “how does the resolution of the cameras effect the performance of the tracker?” The answer to this question lies in the camera model, specifically in the variance of the additive noise to the camera measurements (Eq. 2.3). This variance is given in pixels and for fair comparison, let us first convert it to variance in direction of arrival (DOA) angle (in radians) 2 σDOA

σ2 = v2i = fi



h2 (x) 1 + 2i di (x)

2

σθ2 +

2 h2i (x) + d2i (x) 2 σread σ + . pos d4i (x) fi2

2 is also same for differThe first 2 terms don’t change with resolution, and we assume σread

ent resolutions. On the other hand, for a given FOV, the focal length of a camera (in pixels) is proportional to the resolution. This means the last term in the noise variance is higher for a lower resolution camera. The effect of using lower resolution cameras might be explored along these guidelines.

5.2.4 Combining Tracking and Selection We mentioned the similarity of the setups used for tracking and selection. The only difference is the use of Gaussian prior for the object position in selection while particles are used for tracking. However, we mentioned that the selection algorithm can be easily adopted to particle priors by taking the expectations in Eq. 3.6 over the particle-weight tuples. This allows for performing selection and tracking simultaneously. Consider the following scenario. Assume we want to track a suspect in an airport and the cameras are grouped into clusters, each with a cluster head. First, one cluster head is selected. This can be done following the ideas in [61] or simply by selecting the one closest to the current position estimate of the object. This cluster head is now in

CHAPTER 5. CONCLUSION

64

charge of tracking. Every T time steps, it selects best k camera nodes within the cluster. Only these k nodes are active during this period, and the rest are sleeping. After T time steps, the cluster head selects another k cameras. If at any time the object position is estimated to be closer to another cluster head, the tracking job is handed off to it. The object can be tracked successfully using a very little fraction of the total number of cameras under this scenario. Of course, determining parameters like cluster size, T and k are still open research problems. Also, other modifications such as dynamically changing these parameters could be interesting. Finally, the simulations and experiments presented in this dissertation might be performed over more complicated real world environments to validate our preliminary findings. Especially, performing the experiments with real occluders (instead of virtual ones) would be interesting. However, this requires actual implementation of a system to obtain the occluder priors, either through VH or some other method such as the ones mentioned in Section 2.3.4.

Appendix A List of Selected Symbols Symbol

Description

Section

x

Position of the object

2.2.1

u

State of the object

2.3

t

Time index (t ∈ Z )

2.3

+

f (·)

Continuous and mixed probability density function

2.3

p(·)

Probability mass function (pmf)

2.3.2

p (·)

Pmf, ignoring static occluders and FOV

2.3.2

P(·)

Probability of (·)

2.3.2

mv

65

APPENDIX A. LIST OF SELECTED SYMBOLS

66

Symbol

Description

Section

D

Diameter of moving occluders

2.2

M

Total number of moving occluders

2.2

j

Enumerator for moving occluders

2.3

xj

Position of moving occluder j

2.3

µj

Mean of occluder j’s prior

2.3

Σj

Covariance of occluder j’s prior

2.3

σj2 , σj2 /αj (αj ≥ 1) Eigenvalues of the covariance matrix Σj

2.3.2

yj

Virtual measurement on occluder j’s position

2.4

ψj

Additive noise to virtual measurements

2.4

N

Total number of cameras

2.2

i

Enumerator for cameras

2.2.1

θi

Orientation of camera i

2.2.1

ηi

Occlusion indicator variable for camera i

2.2.1

ηi,j

= 1 if occluder j does not occlude camera i

2.3.2

zi

Measurement from camera i

2.2.1

fi

Focal length of camera i

2.2.1

hi (x)

Vertical component of x in camera i’s coordinates

2.2.1

di (x)

Horizontal component of x in camera i’s coordinates 2.2.1

vi

Additive Gaussian noise to camera measurement zi

2.2.1

σv i

Standard deviation of vi

2.2.1

σθ

Standard deviation of error in camera orientation

2.2.1

σpos

Standard deviation of error in camera position

2.2.1

σread

Standard deviation of error in camera read-out

2.2.1

mv qi,j (x)

P{ηi,j = 0|u}, considering moving occluders only

2.3.2

APPENDIX A. LIST OF SELECTED SYMBOLS

67

Symbol

Description

Section

L

Total number of particles

2.3

`

Enumerator for particles

2.3

w`

Weight of particle `

2.3

x`

Position of particle `

2.3.1

τ`

Target of particle `

2.3.1

s`

Speed of particle `

2.3.1

r`

Regime of particle ` [xT` (t)

τ`T (t)

2.3.1

s` (t) r` (t)] State of particle ` T

2.3

u`

=

ν

Noise added to object/particle motion

2.3.1

µ

Mean of prior of x

3.2

Covariance of prior of x

3.2

Eigenvalues of the covariance matrix Σ

3.3.1

A subset of (selected) nodes

3.2.1

k

Number of selected cameras

3.3

ai d¯i

Projection vector for camera i

3.2.1

di (µ)

3.2.1

pis (x)

P{ηi = 1, ηs = 1|x}

3.2.2

Ii (x)

Indicator function of visible points to camera i

3.2.2

Σ σo2 ,

σo2 /αo

S

(αo ≥ 1)

Appendix B Derivation of the Camera Measurement Noise Variance We assume when the object is visible, the local processing detects it and generates a corresponding blob in the scan-line. Ideally when there are no error sources, a line originating from the camera’s center of projection and passing through the center of this blob should intersect the vertical line at the center of the object (see Figure B.1(a)). However, this is not the case in real life due to errors in the knowledge of the camera position and orientation, the quantization error due to finite sized pixels and errors added at the background subtraction. As illustrated in Figure B.1(b), we assume that the cluster head knows the position of camera i to an accuracy of σpos (in room units), and its angle to an accuracy of σθ (in radians). The quantization error and other errors at the local processing and readout are assumed to have a standard deviation of σread (in pixels). All error sources are assumed to be mutually independent. In this appendix, the additive camera noise standard deviations in Equations 2.2, 3.3 and 4.1 due to aforementioned error sources are derived.

68

APPENDIX B. DERIVATION OF CAMERA NOISE VARIANCE

69

θi Object σθ

x

σpos Scan-line PSfrag replacements

σread Position of camera i

PSfrag replacements zi

Center of projection

(a)

(b)

Figure B.1: Illustration of sources of camera measurement noise. (a) Ideally, a line originating from the camera’s center of projection and passing through the center of the scan-line intersects the vertical line at the center of the object. (b) The camera measurements are noisy due to errors in readout, knowledge of camera position and angle.

B.1 Perspective Camera Model First, we start with the perspective camera model that was provided in Section 2.2.1. For completeness, let us repeat it here zi =

(

(x) fi hdii(x) + vi , if ηi = 1

NaN,

otherwise.

(B.1)

When ηi = 0, the object is not visible and there is no additive noise. Let us concentrate on the ηi = 1 case. Without loss of generality, assume that the camera is at the origin, and

APPENDIX B. DERIVATION OF CAMERA NOISE VARIANCE

70

4

x = [x1 x2 ]T . Then hi (x) and di (x) in Eq. B.1 (defined in Fig. 2.3) are given by hi (x) = sin(θi )x1 − cos(θi )x2 di (x) = − cos(θi )x1 − sin(θi )x2 . We take partial derivatives of zi given in Eq. B.1 with respect to θi , x1 and x2 to obtain   h2i (x) ∂zi = −fi 1 + 2 ∂θi di (x) di (x) sin(θi ) + hi (x) cos(θi ) ∂zi = fi ∂x1 d2i (x) ∂zi −di (x) cos(θi ) + hi (x) sin(θi ) = fi . ∂x2 d2i (x) An error in the camera position translates into errors in x1 and x2 . Assuming the inaccuracy in both directions are equal, σx1 = σx2 = σpos . Using the mutual independence assumption of the error sources, we obtain σv2i = σz2i |x 2 2  2   ∂zi ∂zi ∂zi 2 2 2 σθ i + σx 1 + σx22 + σread = ∂θi ∂x1 ∂x2  2 h2 (x) + d2 (x) 2 h2i (x) 2 2 = fi 1 + 2 σθ2 + fi2 i 4 i σpos + σread . di (x) di (x)

B.2 Weak Perspective Camera Model In Section 3.2, the weak perspective camera model is used [25]. There we assumed d i (x)  hi (x) and di (x) ≈ d¯i . Under these assumptions, the noise variance given above can be

approximated as

f2 2 2 σv2i0 ≈ fi2 σθ2 + ¯i2 σpos + σread . di As explained in Section 3.2.1, the measurements are assumed to be scaled by

d¯i . fi

After

APPENDIX B. DERIVATION OF CAMERA NOISE VARIANCE

this scaling, the variance becomes σv2i =

d¯2i 2 σ0 fi2 vi

σ 2 ¯2 2 d = σθ2 d¯2i + σpos + read fi2 i   2 σread 2 2 = σθ + 2 d¯2i + σpos fi = ζi d¯2 + σ 2 . i

pos

71

Appendix C Derivation of Equation 2.5 In this appendix, the derivation of Eq. 2.5 is provided. Without loss of generality, assume camera i is at the origin and everything is rotated such that the major axis of occluder j’s prior is horizontal. This orientation is shown in Fig. C.1(a). Let us assume that under this rotation, object position is given by x0 and the mean and the covariance matrix for the prior are given by µ0j and Σ0j = σj2 diag(1, 1/αj ) There is another rotation that is useful, where the rectangle Ai (x) is horizontal (Fig. C.1(b)). Let the position of the object be x00 and the mean and the covariance matrix be µ00j and Σ00j at this orientation. 4

Let θi,j (x0 ) = θi,j (x00 ) = θ for brevity. Then we have the following relations: x00 = RθT x0 µ00j = RθT µ0j Σ00j = RθT Σ0j Rθ kx00 k = kx0 k = kxk |Σ00j |

=

|Σ0j |

σj4 = , αj

where Rθ is the rotation matrix by θ (such that RθT is the rotation matrix by −θ) Rθ =

"

cos θ − sin θ sin θ

72

cos θ

#

.

APPENDIX C. DERIVATION OF EQUATION 2.5

73

Camera i Origin Ai (x0 )

PSfrag replacements

D µ0j θi,j

(x0 )

σj

x00

D

√ PSfrag σj / αjreplacements

θi,j (x00 ) Origin

x0

(a)

(b)

Figure C.1: Rotation of Fig. 2.5 such that (a) major axis of the prior is horizontal, (b) A i (x) is horizontal. Camera i is assumed to be at origin.

mv Then using the orientation in Fig. C.1(b), qi,j (x) is found by

mv qi,j (x)

=

=

Z Z

D 2

−D 2 D 2

−D 2

Z Z

  1 1 00 00 T 00 −1 00 00 q exp − (x − µj ) Σj (x − µj ) dx00 (C.1) 2 2π |Σ00j |   √ αj T −1 T −1 T −1 exp −1/2 x00 Σ00j x00 + µ00j Σ00j x00 −1/2 µ00j Σ00j µ00j  dx00 2 2πσj | {z } | {z } | {z }

kx00 k 0 kxk 0

A

B

C

(C.2)

Let us look at each term defined above separately. First define 4

G=

"

− sin θ √ √ αj sin θ αj cos θ cos θ

#

= [g1 g2 ] ,

APPENDIX C. DERIVATION OF EQUATION 2.5

74

4

where g1 and g2 are the columns of G. Also define x00 = [x001 x002 ]T . Note that Σ00j

−1

−1

= RθT Σ0j Rθ 1 = 2 GT G. σj

Then 1 T −1 A = − x00 Σ00j x00 2  1  2 00 2 2 00 2 T 00 00 = − 2 kg1 k x1 + kg2 k x2 + 2g1 g2 x1 x2 . 2σj Now let us look at B. For this, define 4

O=

"

cos θ

− sin θ

αj sin θ αj cos θ

#

= [o1 o2 ] ,

where o1 and o2 are the columns of O. Then T

−1

B = µ00j Σ00j x00 T

= (µ0j Rθ )(RθT Σ0j T

−1

−1

Rθ )x00

= µ0j Σ0j Rθ x00 1 T = 2 µ0j Ox00 σj 1 T T = 2 (µ0j o1 x001 + µ0j o2 x002 ) σj

APPENDIX C. DERIVATION OF EQUATION 2.5

75

Finally, 1 T −1 C = − µ00j Σ00j µ00j 2 1 0T −1 = − µj Rθ RθT Σ0j Rθ RθT µ0j 2 1 T −1 = − µ0j Σ0j µ0j . 2 Now by substituting A, B and C into Eq. C.2 and using the formula Z

c2 c1

1 exp(−ζρ + 2ξρ)dρ = 2 2

r

π exp ζ



ξ2 ζ

     ζc2 − ξ ζc1 − ξ √ √ erf − erf , (C.3) ζ ζ

we reach mv qi,j (x)

  r 1 1 0 T 0 −1 0 αj = exp − µj Σj µj 2 2πσj2 kg1 k2 2 ! Z D T 2 2µ0j o2 x002 − kg2 k2 x002 2 (µ0j T o1 − g1T g2 x002 )2 exp + 2σj2 2σj2 kg1 k2 −D 2 " ! !# µ0j o1 − g1T g2 x002 kg1 k2 kxk − µ0j o1 + g1T g2 x002 √ √ + erf dx002 . erf 2σj kg1 k 2σj kg1 k

(C.4)

Notice that there are 3 places where we have g1T g2 x002 = x002 (αj − 1) sin(2θ)/2. We

assume that αj is not too big and D is small with respect to σj , such that g1T g2 x002 can be ignored mv qi,j (x)

!   r (µ0j T o1 )2 αj 1 1 0 T 0 −1 0 ≈ exp − µj Σj µj exp 2 2πσj2 kg1 k2 2 2σj2 kg1 k2 ! !# " µ0j T o1 kg1 k2 kxk − µ0j T o1 √ + erf √ erf 2σj kg1 k 2σj kg1 k ! Z D 2 2µ0j T o2 x002 − kg2 k2 x002 2 exp dx002 . 2 2σj −D 2

(C.5)

The formula in Eq. C.3 is then used once more and reach Eq. 2.5. Note that when αj is too big, the prior of occluder j can be treated as a degenerate

APPENDIX C. DERIVATION OF EQUATION 2.5

76

PSfrag replacements

mv qi,j (x) found by Monte-Carlo simulations

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

mv qi,j (x) computed by using Eq. 2.5

0.9

1

Figure C.2: Monte-Carlo simulations to test the accuracy of Eq. 2.5. Here M = 40, D = 3.33, σj = 2, α = 4.

1-D Gaussian function in 2-D, and one could still perform the integral in Eq. C.1 using Eq C.3 once, as the prior is effectively one dimensional. However, we did not implement this modification. To test the validity of the approximation taken at Eq. C.5 we performed several MonteCarlo simulations. We selected random priors for the occluders, and ran Monte-Carlo mv simulations to find qi,j (x) empirically. We compared these values to the ones computed

by using Eq. 2.5. For example, in Fig. C.2, you see Monte-Carlo runs for 16,000 random points. The solid line is for denoting y = x and the error bars represent the 3-σ tolerance for the Monte-Carlo simulation. Here M = 40, D = 3.33, σj = 2, αj = 4. For this example, although D > σj and αj is fairly big, most of the 16,000 points still lie in the 3-σ tolerance range.

Appendix D Derivation of the Localization MSE It is known that the MSE of the best linear unbiased estimator is given by Eq. 3.6 [79]. For completeness, let us repeat it here:  MSE = Tr Σ − ΣTzx Σ−1 z Σzx ,

(D.1) 4

where z = [z1 , z2 , . . . , zN ]T . We defined x˜ = x − µ and let µzi = E(zi ). Then the terms of

Σzx are given by

Σzx (i) =E (zi − µzi )˜ xT



=E(zi x˜T ) − µzi E(˜ xT ) | {z } 0

 + vi ) + aTi µ)˜ xT    =E ηi aTi x˜x˜T + E ηi vi x˜T +aTi µ E x˜T | {z } | {z } =E

(ηi (aTi x˜

0

=aTi Ex

T

Eηi ηi x˜x˜ |x 



0



  =aTi Ex 1 · P(ηi = 1|x) x˜x˜T + 0 · P(ηi = 0|x)˜ xx˜T  | {z } pii (x)

 =aTi Ex pii (x)˜ xx˜T .

77

APPENDIX D. DERIVATION OF THE LOCALIZATION MSE

78

To derive the terms of Σz , first we need µzi =E(zi ) =E(ηi aTi x˜ + ηi vi + aTi µ) =aTi E(ηi x˜) + aTi µ. Then from the definition of Σz and using the formula for µzi given above, we get Σz (i, s) =E(zi zs ) − µzi µzs

 =E (ηi aTi x˜ + ηi vi + aTi µ)(ηs aTs x˜ + ηs vs + aTs µ) − µzi µzs  =aTi E ηi ηs x˜x˜T as + aTi E (ηi x˜) aTs µ + E (ηi ηs ) E (vi vs ) +

aTi µaTs E (ηs x˜) + aTi µaTs µ − µzi µzs   =aTi E ηi ηs x˜x˜T as + E (ηi ηs ) E (vi vs ) − aTi E (ηi x˜) E ηs x˜T as .

From the mutual independence of camera noise sources, we have E (vi vs ) =

(

σv2i i = s 0

i 6= s

,

and we use iterated expectation again to get E (ηi x˜) =Ex (pii (x)˜ x)   E ηi ηs x˜x˜T =Ex pis (x)˜ xx˜T  E ηi2 =Ex (pii (x)).

After substituting the above into Eq. D.2, we get

 Σz (i, s) = aTi Ex pis (x)˜ xx˜T as − aTi [Ex (pii (x)˜ x)] [Ex (pss (x)˜ x)]T as ( Ex (pii (x)) σv2i i = s + 0 i 6= s

(D.2)

APPENDIX D. DERIVATION OF THE LOCALIZATION MSE

79

D.1 Linear Model In Chapter 4, we use a linear model in order to derive the localization MSE. For this model, the MSE can be written in the the form of Eq. 4.2. First define 

Then z is given by

sin(θ1 )

− cos(θ1 )

  − cos(θ2 ) 4  sin(θ2 ) A= .. . ..  .  sin(θN ) − cos(θN )



   .   (D.3)

z = Ax + v, 4

where v = [v1 , v2 , . . . , vN ]T is the random vector of independent camera noise random variables. The covariance matrix for v is given by: 4

Σv = diag(σv21 , σv22 , . . . , σv2N ). It is known that if xˆ is the best linear unbiased estimator of x given linear measurements of the form of Eq. D.3, the covariance matrix for the error vector is given by [79] −1 Σxˆ−x =(Σ−1 + AT Σ−1 v A) .

Using the definitions of A and Σv given above, and Σ = σo2 diag(1, 1/αo ), we get

Σxˆ−x



X sin θi cos θi −1 1 −  σo2 +  σv2i σv2i   i i = X  sin θi cos θi αo X cos2 θi   + − σo2 σv2i σv2i i i  X cos2 θi X sin θi cos θi αo + 2  σo σv2i σv2i 1  i i =  X X sin2 θi sin θi cos θi 1 det(Σ−1 + AT Σ−1 v A)  + σo2 σv2i σv2i i i X sin2 θi



  , 

APPENDIX D. DERIVATION OF THE LOCALIZATION MSE

80

The MSE is given by: MSE =E(||ˆ x − x||2 ) =Tr(Σxˆ−x ) 1 = −1 det(Σ + AT Σ−1 v A)

αo + 1 X 1 + σo2 σv2i i

!

.

Let us solve for the determinant in above equation. 4

D =det(Σ−1 + AT Σ−1 v A) αo αo X sin2 θi 1 X cos2 θi = 4+ 2 + 2 + σo σo i σv2i σo i σv2i ! ! !2 X cos2 θi X sin θi cos θi X sin2 θi − σv2i σv2i σv2i i i i 1 X 1 αo αo − 1 X sin2 θi + + = 4+ 2 2 2 σo σo2 σ σ σ v o v i i i i ! ! !2 X sin2 θi X cos2 θi X sin 2θi − σv2i σv2i 2σv2i i i i αo αo − 1 X 1 − cos 2θi 1 X 1 = 4+ + + σo σo2 2σv2i σo2 i σv2i i ! ! !2 X 1 − cos 2θi X 1 + cos 2θi X sin 2θi − . 2 2 2 2σ 2σ 2σ v v v i i i i i i Define 4

γ=

X cos 2θi i

and

4

ζ=

σv2i

X 1 . σv2i i

(D.4)

APPENDIX D. DERIVATION OF THE LOCALIZATION MSE

81

Then 1 αo αo − 1 ζ − γ ζ −γζ +γ + − D= 4 + ζ + σo σo2 2 σo2 2 2

X sin 2θi

!2

2σv2i !2 X sin 2θi αo αo + 1 αo − 1 ζ 2 γ2 = 4+ − − ζ− γ+ σo 2σo2 2σo2 4 4 2σv2i i  !2   2  2 X sin 2θi  αo − 1 αo + 1 1 + +ζ − +γ − =  2 2 4 σo σo σv2i i "  2  2 # 1 4αo αo + 1 αo − 1 − + 4 2 4 σo σo σo2  !2   2  2 X sin 2θi αo + 1 1 αo − 1 . + ζ + γ =  − − 2 4 σo2 σo2 σ v i i i

After substituting this into Eq. D.4, we get 4 MSE = αo +1 σo2

N X 1 + σ2 i=1 vi

!2



αo +1 σo2

αo −1 σo2

N X 1 + σ2 i=1 vi

+

!

N X cos 2θi i=1

σv2i

!2



N X sin 2θi i=1

σv2i

!2 .

Bibliography [1] C. Norris, M. McCahill, and D. Wood, “Editorial. the growth of CCTV: a global perspective on the international diffusion of video surveillance in publicly accessible space.” Surveillance and Society, CCTV Special, vol. 2(2/3), pp. 110–135, 2004. [2] “AlertVideo,”

http://careers.northropgrumman.com/ExternalHorizonsWeb/college/

discover info cool.html. [3] “A

user’s

guide

to

digital

video,”

http://www.ati247.com/PDF/

digitalvideoguidefortheweb.pdf. [4] T. Kanade, R. T. Collins, A. J. Lipton, H. Fujiyoshi, D. Duggins, , Y. Tsin, D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt, and L. Wixson, “A system for video surveillance and monitoring, cmu vsam final report,” Carnegie-Mellon University and The Sarnoff Corporation, Tech. Rep., NOV 1999. [5] “VSAM project home page,” http://www.cs.cmu.edu/∼vsam/vsamhome.html. [6] “ADVISOR project,” http://www.thalesresearch.com/Default.aspx?tabid=178. [7] R. L. Bruce, “Loop detector for traffic signal control,” US Patent 4,430,636, 1984. [8] V. Kastrinaki, M. Zervakis, and K. Kalaitzakis, “A survey of video processing techniques for traffic applications,” Image and Vision Computing, vol. 21, no. 4, pp. 359– 381, April 2003. [9] S. H. Park, K. Jung, J. K. Hea, and H. J. Kim, “Vision-based traffic surveillance system on the internet,” in Proceedings of ICCIMA, Los Alamitos, CA, USA, 1999. 82

BIBLIOGRAPHY

83

[10] R. Holman and T. Ozkan-Haller, “Applying video sensor networks to nearshore environment monitoring,” IEEE Pervasive Computing, vol. 2, no. 4, pp. 14–21, 2003. [11] “Jacarta,” http://www.jacarta.co.uk. [12] “Netbotz,” http://www.netbotz.com. [13] “Sensorsoft Corporation,” http://www.sensorsoft.com. [14] T. Gu, H. K. Pung, and D. Q. Zhang, “Toward an OSGi-based infrastructure for context-aware applications,” IEEE Pervasive Computing, vol. 3, no. 4, pp. 66–74, October-December 2004. [15] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer, “Multi-camera multi-person tracking for easy living,” in IEEE Workshop on Visual Surveillance, 2000. [16] “Eye Vision,” http://www.ri.cmu.edu/events/sb35/tksuperbowl.html. [17] G. J. Pottie, W. J. Kaiser, L. Clare, and H. Marcy, “Wireless integrated network sensors,” Communications of the ACM, vol. 43, no. 5, pp. 51–58, 2000. [18] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: A survey,” Computer Networks, vol. 38, pp. 393–422, 2002. [19] F. Zhao and L. Guibas, Wireless Sensor Networks.

Elsevier Inc., 2004.

[20] D. E. Culler and H. Mulder, “Smart sensors to network the world,” Scientific American, pp. 84–91, June 2004. [21] D. Culler, D. Estrin, and M. Srivastava, “Overview of sensor networks,” IEEE Computer Magazine, pp. 41–49, August 2004. [22] M. Rahimi, R. Baer, O. I. Iroezi, J. C. Garcia, J. Warrior, D. Estrin, and M. Srivastava, “Cyclops: in situ image sensing and interpretation in wireless sensor networks,” in Proceedings of SenSys’05. New York, NY, USA: ACM Press, 2005, pp. 192–204. [23] “Mica2 datasheet,” http://www.xbow.com/Products/productdetails.aspx?sid=174.

BIBLIOGRAPHY

[24] A. M. Tekalp, Digital video processing.

84

Upper Saddle River, NJ, USA: Prentice-

Hall, Inc., 1995. [25] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision.

Upper

Saddle River, NJ: Prentice Hall, 1998. [26] M. R. Stevens and J. R. Beveridge, Integrating Graphics and Vision for Object Recognition. Springer, 2000. [27] D. B.-R. Yang, H. Gonzales-Banos, and L. J. Guibas, “Counting people in crowds with a real-time network of image sensors,” in Proceedings of ICCV, October 2003. [28] M. Maroti, G. Simon, A. Ledeczi, and J. Sztipanovits, “Shooter localization in urban terrain,” Computer, vol. 37, no. 8, pp. 60–61, 2004. [29] J. C. Chen, L. Yip, J. Elson, H. Wang, D. Maniezzo, R. E. Hudson, K. Yao, and D. Estrin, “Coherent acoustic array processing and localization on wireless sensor networks,” Proceedings of the IEEE, vol. 91, no. 8, August 2003. [30] J. Byers and G. Nasser, “Utility-based decision-making in wireless sensor networks, Tech. Rep. 2000-014, 1 2000. [Online]. Available:

citeseer.ist.psu.edu/article/

byers00utilitybased.html [31] H. Wang, K. Yao, G. Pottie, and D. Estrin, “Entropy-based sensor selection heuristic for localization,” in Proceedings of IPSN, April 2004. [32] A. O. Ercan, A. El Gamal, and L. J. Guibas, “Object tracking in the presence of occlusions via a camera network,” in Proceedings of IPSN, April 2007, pp. 509–518. [33] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classification and tracking of targets,” IEEE Signal Processing Magazine, pp. 17–29, March 2002. [34] J. Aslam, Z. Butler, F. Constantin, V. Crespi, G. Cybenko, and D. Rus, “Tracking a moving object with a binary sensor network,” in Proceedings of SENSYS, November 2003.

BIBLIOGRAPHY

85

[35] R. R. Brooks, P. Ramanathan, and A. M. Sayeed, “Distributed target classification and tracking in sensor networks,” Proceedings of the IEEE, vol. 91, no. 8, pp. 1163–1171, August 2003. [36] S. Pattem, S. Poduri, and B. Krishnamachari, “Energy-quality tradeoffs for target tracking in wireless sensor networks,” in Proceedings of IPSN, April 2003, pp. 32– 46. [37] F. Zhao, J. Liu, J. Liu, L. J. Guibas, and J. Reich, “Collaborative signal and information processing: an information-directed approach,” Proceedings of the IEEE, vol. 91, no. 8, pp. 1199–1209, August 2003. [38] W. Kim, K. Mechitov, J.-Y. Choi, and S. Ham, “On target tracking with binary proximity sensors,” in Proceedings of IPSN, April 2005. [39] C. Taylor, A. Rahimi, J. Bachrach, H. Shrobe, and A. Grue, “Simultaneous localization, calibration and tracking in an ad-hoc sensor network,” in Proceedings of IPSN, April 2006. [40] N. Shrivastava, R. Mudumbai, and U. Madhow, “Target tracking with binary proximity sensors: Fundamental limits, minimal descriptions, and algorithms,” in Proceedings of SENSYS, November 2006. [41] P. V. Pahalawatta, D. Depalov, T. N. Pappas, and A. K. Katsaggelos, “Detection, classification, and collaborative tracking of multiple targets using video sensors,” in Proceedings of IPSN, April 2003, pp. 529–544. [42] S. Funiak, C. Guestrin, M. Paskin, and R. Sukthankar, “Distributed localization of networked cameras,” in Proceedings of IPSN, April 2006. [43] P. F. Gabriel, J. G. Verly, J. H. Piater, and A. Genon, “The state of the art in multiple object tracking under occlusion in video sequences,” in Proceedings of ACIVS, September 2003.

BIBLIOGRAPHY

86

[44] A. Yilmaz, X. Li, and M. Shah, “Contour-based object tracking with occlusion handling in video acquired using mobile cameras,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1531–1536, November 2004. [45] Q. Cai and J. K. Aggarwal, “Tracking human motion in structured environments using a distributed-camera system,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1241–1247, 1999. [46] S. Khan, O. Javed, Z. Rasheed, and M. Shah, “Human tracking in multiple cameras,” in Proceedings of ICCV, July 2001. [47] A. Dick and M. J. Brooks, “A stochastic approach to tracking objects across multiple cameras,” in Proceedings of Australian Conference on Artificial Intelligence, December 2004, pp. 160–170. [48] W. Zajdel, A. T. Cemgil, and B. J. A. Krose, “Online multicamera tracking with a switching state-space model,” in Proceedings of ICPR, August 2004. [49] A. Utsumi, H. Mori, J. Ohya, and M. Yachida, “Multiple-view-based tracking of multiple humans,” in Proceedings of the ICPR, 1998. [50] K. Otsuka and N. Mukawa, “Multiview occlusion analysis for tracking densely populated objects based on 2-d visual angles,” in Proceedings of CVPR, 2004. [51] S. L. Dockstander and A. M. Tekalp, “Multiple camera tracking of interacting and occluded human motion,” Proceedings of the IEEE, vol. 89, no. 10, pp. 1441–1455, October 2001. [52] A. Doucet, B.-N. Vo, C. Andrieu, and M. Davy, “Particle filtering for multi-target tracking and sensor management,” in Proceedings of ISIF, 2002, pp. 474–481. [53] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Carnegie Mellon University, Technical Report CMU-CS-91-132, April 1991. [54] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with Applications to Tracking and Navigation. New York, NY: John Wiley & Sons Inc., 2001.

BIBLIOGRAPHY

87

[55] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter, Particle Filters for Tracking Applications. Artech House, 2004. [56] M. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle filters,” Journal of the American Statistical Association, vol. 94, no. 446, pp. 590–599, 1999. [57] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computational Geometry: Algorithms and Applications. Berlin: Springer-Verlag, 1997. [58] X. Sheng and Y.-H. Hu, “Maximum likelihood multiple-source localization using acoustic energy measurements with wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 53, no. 1, pp. 44–53, 2005. [59] D. B. Yang, “Counting and localizing targets with a camera network,” Ph.D. dissertation, Stanford University, December 2005. [60] A. O. Ercan, A. El Gamal, and L. J. Guibas, “Camera network node selection for target localization in the presence of occlusions,” in Distributed Smart Cameras, October 2006. [61] M. Chu, H. Haussecker, and F. Zhao, “Scalable information-driven sensor querying and routing for ad hoc heterogeneous sensor networks,” The International Journal of High Performance Computing Applications, vol. 16, no. 3, pp. 293–313, 2002. [62] E. Ertin, J. W. Fisher III, and L. C. Potter, “Maximum mutual information principle for dynamic sensor query problems,” in Proceedings of IPSN, April 2003. [63] S. Slijepcevic and M. Potkonjak, “Power efficient organization of wireless sensor networks,” in Proceedings of IEEE International Conference on Communications, June 2001. [64] F. Bian, D. Kempe, and R. Govindan, “Utility-based sensor selection,” in Proceedings of IPSN, April 2006, pp. 11–18. [65] P.-P. Vazquez, M. Feixas, M. Sbert, and W. Heidrich, “Viewpoint selection using viewpoint entropy,” in Proceedings of the Vision Modeling and Visualization’01, November 2001.

BIBLIOGRAPHY

88

[66] L. Wong, C. Dumont, and M. Abidi, “Next best view system in a 3d object modeling task,” in Proceedings of Computational Intelligence in Robotics and Automation, November 1999. [67] D. Roberts and A. Marshall, “Viewpoint selection for complete surface coverage of three dimensional objects,” in Proceedings of the British Machine Vision Conference, September 1998. [68] D. B.-R. Yang, J.-W. Shin, A. O. Ercan, and L. J. Guibas, “Sensor tasking for occupancy reasoning in a network of cameras,” in BASENETS, October 2004. [69] V. Isler and R. Bajcsy, “The sensor selection problem for bounded uncertainty sensing models,” in Proceedings of IPSN, April 2005, pp. 151–158. [70] A. O. Ercan, D. B.-R. Yang, A. El Gamal, and L. J. Guibas, “Optimal placement and selection of camera network nodes for target localization,” in Proceedings of DCOSS, June 2006. [71] X. Chen and J. Davis, “Camera placement considering occlusion for robust motion capture,” Stanford University Computer Science Technical Report, CS-TR-2000-07, December 2000. [72] G. Olague and R. Mohr, “Optimal camera placement for accurate reconstruction,” Pattern Recognition, vol. 35, no. 4, pp. 927–944, 2002. [73] J. Wu, R. Sharma, and T. Huang, “Analysis of uncertainty bounds due to quantization for three-dimensional position estimation using multiple cameras,” Optical Engineering, vol. 37, no. 1, pp. 280–292, 1998. [74] H. Zhang, “Two-dimensional optimal sensor placement,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 25, no. 5, May 1995. [75] C. Welman, “Inverse kinematics and geometric constraints for articulated figuremanipulation,” Master’s Thesis, Simon Fraser University, 1993.

BIBLIOGRAPHY

89

[76] A. A. Goldenberg, B. Benhabib, and R. G. Fenton, “A complete generalized solution to the inverse kinematics of robots,” IEEE Journal of Robotics and Automation, vol. RA-1, no. 1, pp. 14–20, 1985. [77] S. Boyd and L. Vandenberghe, Convex Optimization.

Cambridge University Press,

2004. [78] S. Hengstler, D. Prashanth, S. Fong, and H. Aghajan, “Mesheye: A hybrid-resolution smart camera mote for applications in distributed intelligent surveillance,” in Proceedings of IPSN/SPOTS, 2007. [79] T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation. Upper Saddle River, NJ: Prentice Hall, 1999.

Suggest Documents