2013 American Control Conference (ACC) Washington, DC, USA, June 17-19, 2013
A Method of Camera Selection based on Partially Observable Markov Decision Process Model in Camera Networks * Qian Li, Zhengxing Sun, Songle Chen and Yudi Liu
Abstract—Camera selection in camera networks is a dynamic decision-making process based on the analysis and evaluation of visual content. In this paper, a novel camera selection method based on a partially observable Markov decision process model (POMDP) is proposed, in which the belief states of the model are used to represent noisy visual information and an innovative evaluation function is defined to identify the most informative of several multi-view video streams. Our experiments show that these proposed visual evaluation criteria successfully measure changes in scenes and our camera selection method effectively reduces camera switching when compared to other state-of-the-art methods. Index Terms—Camera Selection, POMDP, Video Analysis, Camera Networks.
I. INTRODUCTION The development of computer vision has promoted extensive use of cameras for security, surveillance, human–computer interaction, navigation, and positioning. The limited field of view (FOV) of a single camera makes it difficult to resolve partial and full occlusions and continuously track the activity in an area; multiple cameras can monitor areas of interest more effectively. However, the using of multiple cameras introduces issues related to deployment and control the cameras, real-time fusion of video streams with high resolution and high frame rates, and selection and coordination of the cameras [1]. Camera selection, which involves selection of one or more cameras from a group of cameras to extract essential information, is a particularly challenging task in camera networks. In contrast to node selection in general sensor networks [2], the nearest camera cannot be merely identified by directional measurements. Camera selection is a dynamic decision making process based on the analysis and evaluation of visual content. Moreover, * Research supported by The National Natural Science Foundation of Chi na (61272219, 61100110 and 61021062), The National High Technology Re search and Development Program of China (2007AA01Z334), The Progra m for New Century Excellent Talents in University of China (NCET-04-046 05) and The Science and technology program of Jiangsu Province (BE20100 72, BE2011058 and BY2012190). Qian Li. Author is with State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093. He is also with College of Meteorology and Oceanography, PLA University of Science and Technology, Nanjing 211101, China (e-mail: public_liqian@ 163.com). Zhengxing Sun. Author was with State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China (corresponding author, phone: 86-25-89686099; e-mail:
[email protected]). Songle Chen. Author is with State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China. (e-mail: flychenn21@ gmail.com). Yudi Liu. Author is with College of Meteorology and Oceanography, PLA University of Science and Technology, Nanjing 211101, China (e-mail: yd_liu0509@ yahoo.com.cn).
978-1-4799-0178-4/$31.00 ©2013 AACC
camera views are affected by factors such as illumination, occlusion, and the sequential results produced by visual computing can be unstable. Therefore, camera selection must aim at obtaining the most representative video stream and reducing the jitter effect caused by frequent switching with noise and unstable visual computing. A number of previous studies have investigated the issues related to camera selection. Gupta, Mittal, and Davis [3] proposed a greedy approach to camera selection based on the probabilistic distribution of camera positions and appearance of individuals in the FOV. Li and Bhanu [4] proposed a game-theoretic approach to hand-off the camera with the global utility, camera utility, and person utility determined by user-supplied criteria such as the size, position, and view of the individual being tracked. Daniyal, Taj, and Cavallaro [5] proposed a Dynamic Bayesian Network approach that uses object- and frame-level features. Bimbo and Pernici [6] selected optimal parameters for the active camera on the basis of the appearance of objects and predicted motions to solve the traveling salesman problem. Park, Bhat, and Kak [7] built a camera lookup table for selecting the appropriate camera based on the position of the target object. Monari and Kroschel [8] proposed a method to identify the task-relevant camera by evaluating geometrical attributes related to the last observed object position. Shen et al. [9] calculated the distance and angle of each camera with respect to the target and selected a camera with the best resolution immediately in front of the target. Tessens et al [10] used face detection and the calculated spatial position of the target to select a primary view and a number of additional views, because face detection is time consuming and prone to errors and requires calibration of all cameras, it is usually difficult to achieve. All methods mentioned above are either based on optimization or prior knowledge or learning methods. Although methods based on optimization define a utility or cost function to effectively integrate multiple constraints, their computational burden is extremely heavy for a real-time decision-making selection process. In contrast, methods based on either prior knowledge or learning scheduled according to accumulated knowledge require less computation and are easy to implement, but they are less flexible and cannot appropriately select cameras in complex situations. In addition, while the low-level image features used in these methods to evaluate visual information are useful, the high-level information in video streams, such as local salient movement details and specific events, is more informative and should be considering at the same time. In this paper, we present a novel method for camera selection based on a partially observable Markov decision model (POMDP). The rest of the paper is organized as follows. Section describes the use of the belief states of the model to represent noisy visual information. By considering current
3833
states and anticipated transition trends with the cost generated by camera switching, the visual jitters that arise from frequent switching can be effectively reduced. Section presents our evaluation function for visual information, which is designed to reflect the richness of information in each view by extracting global motion, properties of moving objects in the scene, and specific events. Section reports our experimental results, including the camera selection results for several video datasets and compares our method with state-of-the-art methods. II. DYNAMIC CAMERA SELECTION BASED ON A POMDP MODEL
The camera selection problem can be described as follows. A camera network has N cameras (C1 , C 2 ,...C N ) with partially or completely overlapped FOVs, and one node is designated as the central controller for scheduling according to a selection policy that is computed offline. In our method, the central controller selects only one camera C * online as the optimal camera at fixed time intervals of duration 't . At each time step t , visual features indicating global motion, properties of objects, and specific events in the view of each camera are extracted and scored (see Section ). The scores are then sent to the central controller, which makes dynamic selection decisions based on current and previous camera view scores. Although the camera selection problem can usually be modeled as a finite-state Markov decision process (MDP), when an observed state contains errors caused by factors such as illumination, occlusion, and camera shock and does not reflect the actual state, we must implement sequential decision making based on partially observable states. Owing to uncertainty, we model this dynamic process as a POMDP, which is an extended version of an MDP. A. Definition of POMDPs POMDPs [11] generalize the MDP model and provide a natural and principled framework for sequential planning to allow for even more forms of uncertainty to be accounted for in the process. The system states of the POMDP model, which are used for decision making, are belief states that use probability distributions over the state space to represent the probability that the process is in a given state. Because of the abstraction, adaptability, and robustness of model, POMDPs are widely used for robot navigation, search and rescue activities, resource allocation, and other purposes.
A POMDP can be formally defined as a 6-tuple S , A , :,T ,O ,R ! , where S is a finite set of all possible underlying states, A is a finite set of actions, i.e., available control choices at each time instant, : is a finite set of all possible observations that the process can provide, T is a state transition function T : S u A o S that encodes the uncertainty about the evolution of the states of the process, O is an observation function O : S u A o : that relates the process outputs (camera observations) to the true underlying state of the process, and R is an immediate that assigns real-valued reward function R : S u A o rewards to the actions that may be performed in each of the underlying process states. Unlike the MDP model, an observation in a POMDP can correspond to several system
states, and thus, belief states are usually used to represent the probability that the process is in a given state with a probability distribution over the state space of the process. Therefore, the POMDP model can be used to derive a control policy S : ( s0 , s1 ,...st , a0 , a1 ,..., at ) 6 at 1 , where si • S , ai • A that will yield the greatest utility over some decision steps according to the history of states and actions, i.e., S * arg max RS (bt ) , where b is the belief distribution of the t a system states at time step t . B. Camera Selection Policy On the basis of our description of the camera selection problem, we formulate the POMDP model as follows. 1) System state vector The system state vector consists of the currently selected results and visual information scores. At time step t , the system state is represented as St [cti , st1 , st2 ,..., stN ] , where cti is the best camera i that is selected at time t , and stk , k • {1, 2,..., N } is the actual visual information score for camera k . If the state space is modeled as a continuous space, the POMDP will be computationally intractable. Therefore, the score values stk are uniformly discretized with m quantization levels and normalized to [0, 1] to produce the range stk • {0,1,..., m 1} .
2) Actions An action is a vector represented as at [at1 ,..., atN ] , where at time step t if the i th camera is selected, ati 1 ; else ati 0 . 3) Observation state The observation state is a collection of observations from all cameras and is defined as the vector i 1 2 N i Ot [c 't , ot , ot ,..., ot ] , and c 't is the camera i that is selected at time t . Because this component has no error, it should be the same as the component cti of the system state, which is similar to the definition of states. Each k observation ot , k • {1, 2,..., N } is the visual score computed for camera k at time step t . by our method in section According to the definition of POMDPs, the number of observations is equal to the number of system states. 4) State transitions The state transition probability p s '| a s describes the differences between the views taken in adjacent time steps. Because the selection action does not affect the visual measures and state transitions ps '|as ps '| s , the state transition will be based on the visual score component stk . We assume that the visual scores for the cameras are independent, i.e.,
Pr( S ' ( st1 1 , st2 1 ,..., stN 1 ) | S
3834
( st1 , st2 ,..., stN ))
Pr( st1 1|st1 ) ˜ Pr( st2 1|st2 ) ˜ ... ˜ Pr( stN 1|stN )
. ÂÂ
(1)
For each discrete state component stk • {0,1,...m 1} probability of transition to a neighboring state is higher than that to more distant states. Thus, we set the transition probabilities on the basis of distances between states as follows: Pr( sti 1 =u |sti
- 1 (u v) 2 (1 m 1 ) u , v • {0,1,..., m 1} . ° ° m 1 ( r v) 2 ¦ ® r 0 ° °¯0 u • {0,1,..., m 1} or v • {0,1,..., m 1}
v)
Â(2)
This equation defines the transition probabilities between all states in the state space.
5) Observation function The observation function po '|as ' indicates the likelihood of the observation state being o ' if the system state s ' performs action a . Because the selection actions do not affect the camera observations or the computed scores for static cameras, we set po '|as ' po '| s ' . Also, we assume that the states in the observation state space for different cameras are independent. The observation probability is defined as follows:
Pr(oti =u |sti
v)
- 1 (u v) 2 ) u , v • {0,1,..., m 1} . Â(3) ° m 1 (1 m 1 ° (r v) 2 ® ¦ r 0 ° °¯0 u • {0,1,..., m 1} or v • {0,1,..., m 1}
6) Immediate rewards For each action, we define an immediate reward or cost to measure the degree of optimization that would result from that action. We use the visual score from each camera as a positive reward and use the camera switching cost ccos t , which represents the visual jitter caused by frequent switching as a negative reward. Therefore, the immediate reward after camera selection is defined as ÂÂ R( St , ati ) W sti (1 W )ccos t G (ati 1 ) ,ÂÂÂÂ ÂÂÂ(4)
where W • [0,1] is the weight coefficient between the positive and negative rewards, and the G function G (ati 1 ) = 1 if the camera selected at time step t was also selected at time step t 1 ; else G (ati 1 ) = 0. Given the belief state b( s ) for the camera system state s at time step t , the scheduling agent attempts to maximize the * total reward V S (b( s )) by selecting the best camera atk on the basis of the optimal policy S * . This condition is represented as follows: * * VtS (b( s)) max{ R(b( s), atk ) J ¦ p( s, atk , s ')VtS1 (b( s '))}, ÂÂ(5) k
at •A
s '•S
where J • [0,1] is a discount factor that controls the impact of future rewards so that the effect of a reward decays exponentially with respect to elapsed time. If o is the observation after the action a , the next belief state bao ( s ') is calculated on the basis of Bayesian theory as follows: o
 ba ( s ')
po|s ' a po|ab
¦ p ( s ' | s, a ) b( s ) ,
po| ab
where po| ab
is
¦p ¦p o| s ' a
s '•S
a
s '| sa
s •S
normalized
constant
defined
as
b( s ) .
Formula (5) can usually be computed iteratively using dynamic programming; the computational complexity increases exponentially with respect to the scale of the problem. Therefore, a direct solution to our POMDP is unfeasible because the problem is intractable, then we use the Perseus method [12], which is a point-based approximation. We sample randomized a number of belief state points b as a belief set and compute the reward values for the belief set. The results are saved as a set of value vectors {D1 , D 2 ,..., D n } in which each vector D i is associated with a selected action ai so that the vector contains the same number of components as the state space. When the central controller selects the best camera online, it transforms the observed states into belief states and makes its decision with the following relation: a arg max(b( s ') x D i ) . ÂÂÂÂÂ(7) ai •A
The value vector D i that has its inner product with the current belief state at maximum is selected, and the corresponding action ai is selected as the best camera.
III. VISUAL INFORMATION MEASURE In this section, we propose a measure that evaluates the quality of the image captured by a camera by extracting features indicating motion, properties of objects, and special events and expressing features as a motion score Smi , an object i , and an event score Sei , respectively. The final visual score Sobs score as then calculated as follows: Â
 Si
w1Smi
i w2 Sobs
w3 Sei ,ÂÂÂ ÂÂÂ(8)
where w1 , w2 , and w3 are weight coefficients such that 3
¦w
To simplify the exposition that follows, we denote the motion scores, object scores, and event scores without a superscript for the camera as Sm , Sobs , and S e , respectively. i
i 1
1.
A. Global Motion The degree of motion in a video stream reflects the ability of the camera to capture real world changes. We adopt the method presented in [13] for detecting moving objects and mitigating the effects of illumination and shadows. Then we determine the degree of global motion in a video frame by calculating the ratio of the foreground area to the area of the entire image. We assume that the ratio will increase significantly when a new object enters the scene or objects are close to the camera. However, the ratio exceeding a certain threshold implies that noise has been introduced by factors such as illumination. Thus, we score the global motion contained in a binary motion image as follows:
ÂÂÂ(6)
s•S
3835
Sm
-1 °° O r r d O ,ÂÂÂÂÂ ÂÂÂ(9) ® °1 r r ! O, °¯1 O
where r is the ratio of the area of moving object to that of the entire image, with O is the best ratio required by applications.
B. Object Properties Individuals and special objects are the most attractive elements in video surveillance applications. To properly measure the properties of objects in video, we focus on motion saliency for individual objects and the degree to which the objects in a scene occlude one another. To detect the objects in a scene and appropriately separate objects that are partially overlapped, we use the method proposed by Hu [14] to extracting bounding boxes for objects even when they overlap. We then use the boxes to assign scores on the basis of local motion saliency and occlusion between two overlapped objects. Finally, we calculate an overall score for the properties of the object as a weighted sum of the two scores: Â Sobs
1 E OBS
¦
Se
S
k l
(1 E ) Soc , ÂÂÂ ÂÂ(10)
k •OBS
Slk
where is the local motion saliency score for object is the number of objects in the scene, Soc is the occlusion score, and E is a weight factor.
k , OBS
1) Local motion For general purposes, frame-to-frame changes can be calculated to detect motion in a video but this method cannot measure comprehensive changes across multiple time steps, particularly salient details at the local spatial–temporal level. In addition, methods based on pixels or blobs are subject to noise caused by illumination. Therefore, we use the Harris 3D space-time interest point method [15] to detect salient motion in the bounding boxes of objects and set the saliency measure Slk of the k th object to be the normalized number of 3D interest points detected in the bounding box. Although the interest point method is computationally expensive, we can control the computation process by limiting the search to a small region. Our experiments show that this method can be effectively applied in video surveillance applications. 2) Object occlusion One of the most serious issues in video surveillance, tracking, and other applications is occlusion between objects. In camera networks, selecting a camera with the least occlusion is an effective solution. For this purpose, we measure the degree of occlusion between the objects in a scene on the basis of intersection of the bounding boxes for different targets and use it to assign a score to reflect the occlusion. The larger the areas with intersections, the lower the score. Therefore, we define the occlusion score as: S oc
- 1 ° ® °1 ¯
OBS d 1 OBS OBS
¦ ¦ i 1
j i 1
an entrance region in the image called the “inner region” either by predefining it or through training and monitoring the ratio r1 of the area of motion in the inner region to the entire area of the inner region. When r1 exceeding the threshold Th1 , it indicates that an object possibly enters the scene. To avoid false detection with the ratio r1 due to the movement of individuals present in the scene, we introduce an external region around the entrance called the “outer region” and determine the ratio r2 of the area of motion in the outer region to the area of the entire outer region. When r2 exceeds a threshold Th2 , motion in the entrance is assumed to be the motion of an object that is already in the scene. Also, an entrance event is related to time. Therefore, we set
# ( R ci
Rc j )
m in (# R c i , # R c j )
, (11)
OBS ! 1
where Rci denotes the bounding box of the i th object and # denotes the area of the box.
C. Event Detection It is necessary to detect events of interest in videos. In this paper, we focus on the entrance of new objects. We determine
- VT2 ° ®e °¯ 0
r1 ! Th1 and r2 Th2 ,ÂÂ ÂÂÂ(12) otherwise,
where T is the time that has elapsed since the condition for a new object entering the scene is satisfied, so that the event score is gradually lowered as time elapses. IV. EXPERIMENTS AND RESULTS
A. Experimental Setup We conducted experiments on a personal computer to simulate the process of camera selection. We generated optimal policies offline using the Perseus’ point-based method implemented using C++ and quantified the camera scores on an eight-point scale ( m=8 ) and saved the optimal policies as value vectors. The visual feature extraction and evaluation were also implemented using C++ with the OpenCV library. We used publicly available datasets composed of sequences from POM[16], PETS 2006[17], and HUMANEVA[18] to evaluate the performance of our method. The POM and PETS 2006 datasets use four cameras, while the HUMANEVA dataset uses seven cameras. B. Visual Information Measurement and evaluation In this section, we evaluate visual information scores obtained from our method for the POM Terrace1 sequence. We denote the images from the four cameras as C0, C1, C2, and C3. In the scene captured by frames 1 through 700 of the Terrace1 sequence, no individual is present in the scene, thereafter, one person enters the scene, followed by two people with no occlusion, two people occluding each other , and eventually three people are present in the scene. As an example, Fig. 1(a)-(d) shows the original images for frame 190 in which one person is present, Fig. 1(e)-(h) shows the binary motion images for this frame, and Fig. 1(i)-(l) shows the results of salient point detection. The respective salience scores for the four cameras are 0.76, 0.04, 0.4, and 0.56, and these accurately reflect the significant local motion in the frame. Figure 2 shows the results of scoring the occlusion in frame 498 (occlusion is present) using the method presented in Section .B. The occlusion scores for both C1 and C2 are 1, which means that the cameras did not detect any
3836
(a) C0 Original image
(b) C1 Original image
(c) C2 Original image
(d) C3 Original image
(e) C0 Motion detection
(f) C1 Motion detection
(g) C2 Motion detection
(h) C3 Motion detection
(i) C0 Salient points
(j) C1 Salient points
(k) C2 Salient points
(l) C3 Salient points
Figure 1. The results of motion detection and salient point search for the Terrace1 sequence (frame 190)
(a) C0 Soc
0.56
(b) C1 Soc
(c) C2 Soc
1
1
(d) C3 Soc
0.72
Figure 2. Occlusion scores for the Terrace1 sequence (frame 498)
(a) C0 Frame 232
(b) C1 Frame 232
(c) C2 Frame 232
(d) C3 Frame 232
(e) C0 Frame 513
(f) C1 Frame 513
(g) C2 Frame 513
(h) C3 Frame 513
Figure 3. Frames 232 and 513 from POM Terrance1 sequence
occlusion, while the scores for C0 and C3 are 0.56 and 0.72, respectively. To detect an object entrance event, we manually set the inner region to be a rectangle with the top left corner (257, 14) and width and height of 32 and 80 pixels, respectively, and the outer region was set as the rectangle with the top left corner (245, 9) and width and height of 57 and 102 pixels, respectively. For frames 70 and 330, the number of motion pixels significantly varies while someone is entering the scene in frame 70, while frame 330 only contains the motion of an existing object. Figure 4 shows the event detection results and scores for the two frames. After several experiments and analysis, the parameters O , E , w1 , w2 , and w3 can be set respectively. Figure 5 shows the
trends of the resulting scores for global motion, properties of objects in the scene, and special events. The global motion score curve with O 0.6 shown in Fig. 5(a) reflects the ability of the camera to capture the motion in the scene, as well
(b) Se =0 for frame 330 (a) Se =0.641 for frame 70 Figure 4. Detection results for entrance of a new object(frames 70 and 330)
3837
as the number of objects and their distance from the camera. The numbers of salient points for different views are shown in Fig. 5(b). The object property scores with E 0.6 in Fig. 5(c) indicate that cameras can detect the same significant object properties, when global motion plays a dominant role in the videos. Also, when cameras provide similar global information, the local motion saliencies are different owing to different camera orientations and relative directions of motion of the objects from these perspectives, for example, the scores for frames 160 to 200 in Fig. 5(c) and the occlusion between objects is appropriately detected. The curves in Fig. 5(d) show the scores for measuring entrance events that appropriately indicate that two people have entered the scene during this period. Finally, Fig.4(e) displays the overall visual information score curves with the weights w1 , w2 , and w3 set to 0.5, 0.2, and 0.3, respectively. It imply that the evaluation of visual information by our method in the analyzed video streams can appropriately reflect changes in the scene and details and special events of interest to observers.
We compared our method of camera selection (POMDP) to the state-of-the-art camera selection methods based on greedy criteria for maximum visual scores (Max), game theory [4] (LYM), and Dynamic Bayesian Networks [5] (DBN). Our experiments used the visual information scores presented in as the measures for Max and POMDP, and the Section weight coefficient for the immediate reward was W =0.8 . Figures 6(a) and 6(b) show the camera selection results for the video sequences POM Terrace1 and HUMANEVA Walk1; the best camera was selected for each frame of the POM Terrace1 sequence on the basis of the camera quality measures described in Section .B, and the best camera was selected for the other sequences at intervals of 5 frames. The camera selection results indicate that there are some frequent camera switches using LYM and DBN owing to false selection because errors were introduced in the motion detection process when there are two people in the scene. Moreover, the LYM method was especially prone to frequent switching when the person utility is approximated to zero. In contrast, our method effectively predicted future trends of the visual information scores on the basis of history, and this reduced the number of false selections, resulting in smoother visual effects.
(a) Global motion scores
(b) Numbers of local salient points
(c) Object property scores, E
information scores 0.2412, 0.1302, 0.1584, and 0.1649 shown in Fig. 1 for the four cameras. In this case, our method selected camera C0. Fig. 3 shows that frame 232 has visual scores 0.1378, 0.1382, 0.1593, and 0.1372 for the four cameras, and our method selected camera C2. Thus, in the absence of occlusion, a view that has significant motion or a clear and frontal view will be selected. And then, we consider cases in which occlusion occurs in some views. For example, for frame 498 as shown in Fig.2 which the visual scores of the camera are 0.2940, 0.1692, 0.1713, and 0.2028. Our method selected camera C0, which has a clearer display and more significant motion, even though partial occlusion occurs. For frame 513 in Fig. 3(e)-(h), the camera C0 offers the highest ratio of foreground pixels (the global motion scores of the camera are 0.1777, 0.1367, 0.1074, and 0.0901) and occlusion occurs in some camera views (the occlusion scores of the camera are 0. 4022, 1, 1, and 0. 0882). The final visual scores in this case are 0.1754, 0.1919, 0.1961 and 0.0745. Then camera C1 or C2 should be selected according to the visual scores, and camera C2 has the higher visual score, however, our method selected camera C1 on the basis of previous selection results and trends of the visual scores toward smoother visual effects.
0.6
(d) Entrance event scores for C2
(e) Visual information scores Figure 5. Visual information measurement results
C. Selection Results and Analysis Our method selects the best camera on the basis of the visual score obtained in Section .B and the optimal policy offline computed. First, we discuss cases in which occlusion is not present in any view. For example, frame 190 has the visual
To confirm the effectiveness of our camera selection method, we conducted a user study involving 20 volunteers. Each participant was asked to select a camera for 5-frame intervals in the POM, PETS2006, and HUMANEVA sequences, and we used the user preference for each interval as the selected camera in that interval. A camera will scored 0.05 for each individual vote, and we summed the scores for the cameras. So higher objective score reflects a closer correspondence to the preferences of the testers. Table compares the objective scores for the different methods. In this table, the row labeled “Total” represents the total objective scores for an entire sequence and equals the total number of frames of the sequence. Our proposed method has higher objective scores than the other methods and its selections are closer to the user preferences. While the
3838
objective scores for all methods for the PET 2006 dataset are low because several objects in the scene dispersed the votes, the selections of our method correspond to majority of the tester choices.
computing power. Also, because our method focuses on local details and does not consider different levels of semantic information captured by cameras with different scales and angles, we plan to investigate the integration of global monitoring and local detail information. REFERENCES [1] [2]
(a) POM Terrace1 selection results
[3]
[4]
[5]
(b) HUMANEVA Walk1 selection results
[6]
Figure 6. Selection results for the different methods [7] TABLE I.
Method
OBJECTIVE SCORES FOR DIFFERENT CAMERA SELECTION METHODS
Terrace1
Total DBN LYM Max POMDP
Dataset PETS2006
POM
700 565.25 555.25 551.00 596.75
Passage Way1 2500 1504.50 1454.50 1598.50 1677.75
S1-T1-C
S7-T6-B
3020 1770.25 1646.00 1751.75 1809.75
3400 1731.50 1695.25 1693.00 1765.50
[8]
HUMANEVA Walk1 1200 868.75 802.50 833.00 872.25
Jog1
[9]
740 523.25 505.00 525.50 546.50
[10]
With regard to computational efficiency, because the optimal policy is computed offline, real-time selection can be met by a reasonable choice of the time step interval 't , for example, one second or two seconds. While the policy that is computed offline requires a high level of computational complexity and memory, we sampled the belief state set five times, randomly sampling 60 points each time for the video sequence datasets with four cameras. Thus, 300 points were totally sampled for these datasets. With approximately 35 iterations required for the convergence of each sampling, an optimal policy could be calculated in approximately four hours. When we reduced the number of sampling points to 30 and increased the number of sampling times to ten for the video sequence dataset HUMANEVA, which uses seven cameras, the optimal policy was computed in approximately 12 hours. V. CONCLUSIONS
[11]
[12]
[13]
[14]
[15] [16]
[17]
[18]
In this paper, a novel camera selection in camera networks is represented and the experimental results show that our proposed POMDP-based method has a higher degree of accuracy and is more stable than other methods. Our future research plans include studying the selection of multiple cameras with the constraints of network bandwidth and 3839
S. Soro, W. Heinzelman. “A survey of visual sensor networks”. Advances in Multimedia, pp. 1-22, 2009 Y.Mo, R.Ambrosino, and Bruno Sinopoli. “Sensor Selection Strategies for State Estimation in Energy Constrained Wireless Sensor Networks”. Automatica, vol 4, no 7, pp. 1330-1338. 2011. A. Gupta, A. Mittal, LS. Davis. “Cost: An approach for camera selection and multi-object inference ordering in dynamic scenes”. in Proc.of IEEE Int. Conf. on Computer Vision, Rio de Janeiro, 2007, pp.1-8 Y. Li, B. Bhanu. “Utility-Based Camera Assignment in a Video Network: A Game Theoretic Framework”. IEEE Sensors Journal, vol. 11, no 3, pp.676-687, 2009 F.Daniyal, M.Taj, Cavallaro. “A Content and task-based view selection from multiple video streams”. Multimedia Tools and Applications, vol. 46, pp.235–258, 2010 A.D. Bimbo, F. Pernici. “Towards on-line saccade planning for high-resolution image sensing”. Pattern Recognition Letters, 2006 , vol. 27, no. 15, pp.1826–1834. J. Park, C. Bhat, AC. Kak. “A look-up table based approach for solving the camera selection problem in large camera networks”. in ACM Workshop on Distributed Smart Cameras, 2006: E. Monari, KA. Kroschel. “Knowledge-based camera selection approach for object tracking in large sensor networks”. in Third ACM/IEEE Int. Conf. on Distributed Smart Cameras, Como, 2009 C. Shen, C. Zhang, S. Fels. “A multi-camera surveillance system that estimates quality-of-view measurement”. In Proc. of IEEE Int. Conf. on Image Processing, San Antonio, 2007. pp. III 193–III 196 L. Tessens, M.Morbee, Lee Huang, et al. “Principal view determination for camera selection in distributed smart camera networks”. Second ACM/IEEE Int. Conf. on Distributed Smart Cameras, 2008, pp.1-10 M. Littman. “A tutorial on partially observable markov decision processes”. Journal of Mathematical Psychology, 2009. vol. 53, no. 3, pp.119-125. M. Spaan, N. Vlassis. “Perseus: Randomized point-based value iteration for POMDPs”. Journal of Artificial Intelligence Research, 2005. vol 24, pp.195-220. S. Zhang, H. Ding, and W. Zhang. “Background Modeling and Object Detection Based on Two-Model”. Journal of Computer Research and Development, 2011.vol 48, no 11, pp.1983-1990. Hu W, Hu M, Zhou X, Tan T, et al. Principal axis-base correspondence between multiple cameras for people tracking. IEEE Trans. Pattern Analysis and Machine Intelligence, 2006. vol 28, no 4 pp.663-671. I. Laptev. “On space-time interest points”. International Journal of Computer Vision, 2005. Vol 64, pp.107-123. F. Fleuret, J. Berclaz, R. Lengane, P. Fua. “Multi-camera people tracking with a probabilistic occupancy map”. IEEE Transs Pattern Analysis and Machine Intelligence, 2008. vol 30, no 2, pp.267-282 X. Desurmont, R. Sebbeb, F. Martina, et al. “Performance evaluation of frequent events detection systems. In 9th IEEE Int. Workshop Performance Evaluation of Tracking and Surveillance, 2006 S. Leonid, O. Alexandru, J. Michael. “HUMANEVA: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion”. Journal of Computer Vision, 2010, vol. 87, pp.4-2