Cross-Layer Optimization for Automated Video ...

4 downloads 1388 Views 170KB Size Report
Email: [email protected] ... optimization solution for wireless automated video surveillance ... video sources capture and send video streams to a central sta-.
Cross-Layer Optimization for Automated Video Surveillance Mohammad Alsmirat

Nabil J. Sarhan

Department of Computer Science Jordan University for Science and Technology Irbid 48202, Jordan Email: [email protected]

Department of Electrical and Computer Engineering Wayne State University Detroit, Michigan 48202, USA Email: [email protected]

Abstract—This paper develops an accuracy-based cross-layer optimization solution for wireless automated video surveillance systems, in which multiple sources stream videos to a central proxy station. The proposed solution manages the application rates and transmission opportunities of various video sources based on the dynamic network conditions in such a way that maximizes the overall detection accuracy of the computer vision algorithm(s). The proposed solution utilizes a novel Proportional Integral Differential (PID) controller approach for estimating the effective medium airtime. We demonstrate the effectiveness of the proposed solution through extensive experiments. Index Terms—Automated Video Surveillance, Bandwidth Allocation, Computer Vision Algorithm, Cross-Layer Optimization, Detection Accuracy, Effective Airtime Estimation, Wireless Networks, WLAN.

battery-powered or outlet-powered. The central proxy station is connected with a high-bandwidth link to the access point, and thus this link is not deemed as a bottleneck. The proxy station runs computer vision algorithms to generate automated alerts whenever suspicious threats, events, or subjects are detected in the monitored site. Large systems can be composed of multiple such systems or cells. Backbone Networ

k Wireless Surveillance Cell

Wireless Surveillance Cell

Access Point

Wireless Medium High Speed Link Proxy Station

I. I NTRODUCTION The interest in video surveillance systems has grown dramatically. Automated Video Surveillance (AVS) serves as an efficient approach for the realtime detection of threats and for monitoring their progress. Most research on AVS focused on developing robust computer vision algorithms for the detection, tracking, and classification of objects and the detection and classification of unusual events [1], [2] (and sources within). Much less work, however, considered the scalability and cost of video surveillance systems. The scalability-cost problem arises because increasing the coverage through employing additional video sources leads to increasing the required bandwidth and thus the computational capability to process all these video streams. (The fact that increasing the bandwidth also increases the computational cost applies to most practical circumstances.) Even when a distributed processing architecture is used, the cost of such a system can still be a big concern as computer vision algorithms are computationally intense. Power consumption is another major problem, especially in battery-powered (wireless) video sources. Considering that video sensors consume orders of magnitude more resources than scalar sensors, reducing power consumption is essential even when the power is available [3]. This paper considers an AVS system in which multiple video sources capture and send video streams to a central station over a single-hop IEEE 802.11e wireless LAN (WLAN). As shown in Figure 1, the wireless video sources (video cameras or sensors) share the same medium and can be either This work was supported in part by NSF grant CNS-0834537. Part of the work was done when the first author was at Wayne State University.

Fig. 1: A Wireless Video Surveillance System The main objective in this study is to provide a cross-layer optimization solution that dynamically distributes and allocates the available network bandwidth among various video sources in such a way that optimizes the overall weighted detection accuracy. This solution utilizes and controls parameters in the application, link, and physical layers. A few studies [4], [5] considered distortion-based optimization in general video streaming systems, where videos are delivered from many sources to one destination. Our approach differs in the sense that we seek to optimize the overall threat, event, or subject detection or recognition accuracy, which is the single most important metric in AVS. Up to our knowledge, this is the first solution that optimizes the detection/recognition accuracy. Recent studies suggest that computer vision algorithms are not as sensitive as humans are to video quality [6], [7], and thus our proposed accuracy-based optimization approach is expected to be highly effective in AVS systems. Therefore, we formulate the bandwidth allocation problem as a cross-layer optimization problem of the sum of the weighted detection accuracy (or alternatively the sum of the weighted detection error), subject to the constraint in the total available bandwidth. The weights represent the current importance levels of various video sources and can depend on many factors, including the potential threat level, placement of video sources, and location importance. The total available bandwidth can be modeled by the effective medium airtime, which is the fraction of the network time that is used for delivering useful data. Study [5]

A. Optimization Problem Formulation The main idea of this study is to formulate the bandwidthmanagement problem as a cross-layer optimization problem of the overall weighted error in the detection accuracy of the computer vision algorithm running at the central proxy station. Since all video sources share the same medium, the bandwidth allocation solution should determine the fraction of airtime (f ) that each video source receives, with the total not exceeding the effective medium airtime (Aef f ). Specifically, the problem can be formulated as follows: F ind F ∗ = arg min F

s.t.

S X

S X

ws × AccuracyErrors (rs ) (1a)

s=1

fs = Aef f

(1b)

s=1

rs = fs × ys 0 ≤ fs ≤ 1

(1c) (1d)

s = 1, 2, 3, ..., S,

(1e)

where F ∗ is the set of optimal fractions (fs∗ ) of the airtime of all sources. The optimal application-layer rate of video source s is referred to as rs∗ . Since the solution requires characterizing the accuracy error and estimating the effective medium airtime, let us now discuss these aspects and then the problem solution and the proposed pruning mechanism.

AccuracyError = a × Z b + c,

0.5

0.2 Real Data Model

0.45

Real Data Model

0.18

0.4 0.35 0.3 0.25 0.2

0.16 0.14 0.12 0.1

0.15 0.08

0.1

B. Rate-Accuracy Characterization We seek to develop a model characterizing the relationship between the video bitrate and the accuracy error of computer vision algorithms. To keep the paper focused, we analyze only face detection. Experimenting with other computer vision

(2)

where Z is the frame size and a, b, and c are constants. Interestingly, the same model but with different constant values can be used to represent the distortion. This model is referred to as “Model” in Figure 2. The frame size Z can be calculated as Z = R/τ , where R is the video playback rate and τ is the video frame rate. The accuracy-based formulation in Equation (1a) uses the AccuracyError shown in Equation (2). The distortion model will be used to compare distortion-based optimization (the old approach proposed in [11]) with the accuracy-based optimization proposed in this paper.

Accuracy Error

II. P ROPOSED ACCURACY-BASED C ROSS -L AYER O PTIMIZATION S OLUTION Our main objective is to provide a cross-layer optimization solution that dynamically distributes and allocates the available network bandwidth among various video sources in such a way that optimizes the overall weighted detection accuracy. This solution utilizes and controls parameters in the application, link, and physical layers. As shown in Figure 1, the system has S ≥ 1 video sources and each source s streams a different encoded video at rate rs . Each video source s may have a different physical rate (ys ) and importance level or weight (ws ). The weight represents the importance level of a video source at the current time and depends on many factors, including the potential threat level, placement of video sources, and location importance. To keep the paper focused, we initially assume that the weights are predetermined.

algorithms is left for another study. We use the Viola-Jones algorithm [10] as implemented in OpenCV. We experiment with MJPEG surveillance video streams. Due to power consumption and other considerations, MJPEG is still used in many surveillance applications, especially in battery-operated cameras. We determine the rate and accuracy error relationship based on a variety of image datasets as described in Section III. For each image set, we use the IJG JPEG library to compress each image with quality factors ranging from 1 to 100, with 1 being the lowest, and then apply the computer vision algorithm on each image to calculate the accuracy using predefined ground truth about the location of the faces in the image. We use two metrics for the detection accuracy: positive index and negative index. The positive index is the number of correctly detected faces divided by the total number of faces, whereas the negative index is determined as the number of incorrectly detected faces divided by the total number of faces. Subsequently, we can find the average size, the average positive index, and the average negative index of all images with the same quality factor. The accuracy error can then be calculated as the sum of the total error: AccuracyError = (1 − P ositiveIndex) + N egativeIndex. For comparative purposes, we also perform rate-distortion characterization in a similar manner. We assess the distortion of each video frame against the uncompressed video frame using the Root Mean Square Error (RMSE) metric. The video distortion can then be calculated as the average frame distortion of all the frames in the video. Figure 2 shows the results of the rate-accuracy and ratedistortion characterizations for MJPEG The results with the other considered datasets that are not shown in the figure exhibit the same behavior. We determine that the accuracy error (AccuracyError) can be characterized as follows:

Accuracy Error

developed an online and dynamic effective airtime estimation algorithm, which addresses the limitations of the analytical models in [4], [8], but it may not always converge to an accurate value. The rest of this paper is organized as follows. Section II presents the proposed cross-layer optimization framework. Subsequently, Section III discusses the performance evaluation methodology. Finally, Section IV presents and analyzes the main results.

0.05

0.02

0.04

0.06

0.08

Size (MB)

(a) CMU/MIT

0.1

0.12

0.06

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045

Size (MB)

(b) Georgia Tech at 50% Res.

Fig. 2: Rate-Accuracy Characterization for MJPEG

C. Effective Airtime Estimation We propose an enhanced online and dynamic effective airtime estimation algorithm for wireless networks in the infrastructure mode. The algorithm utilizes a Proportional Integral Differential (PID) controller for adjusting the currently estimated value of the effective airtime to achieve a faster, more stable, and more accurate estimation. In addition, it runs continuously to react faster to any network changes. The PID controller adjusts the current effective airtime value based on the history and the rate of change of the error. The error depends on the dropping rate in the network. The PID controller has proportional, integral, and differential components, which are weighted by constants KP , KI and KD , respectively. The proportional component changes the effective airtime based on the immediate value of the error. The integral component considers the past values of the error, whereas the differential component anticipates the future, and thus they help reduce the steady state error and the overshoot, respectively. As shown in Figure 3, the algorithm proceeds as follows. First, it sets the throughput (ts ) of each video source s as the maximum medium physical rate divided by the number of sources. It then uses these throughput values to determine the initial value of the effective airtime (Aef f ) as follows: PS Aef f = s=1 ts /ys . Subsequently, during a period, called estimation period, each video source s assesses its own data dropping rate (ds ) while sending its video stream, and then sends this information to the access point (AP). The AP then determines the overall average dropping ratio as follows: PS A∆ = d s=1 s /ys . The PID error is then calculated as Athresh − A∆ and Aef f is adjusted by the PID controller to eliminate the error. Athresh controls the allowable dropping in the network, and it should be preset properly. while initialization period is not expired { Force each source to send at the max. physical rate divided by # sources; UpdatePts value for each source; } Aef f = S s=1 ts /ys ; At the endP of each estimation period{ A∆ = S s=1 ds /ys ; BeforeLastError=LastError; LastError=error; error = Athresh − A∆ ; Aef f = Aef f + KP × error − KI × LastError + KD × Bef oreLastError; }

Fig. 3: Simplified PID Algorithm for Dynamically Estimating the Effective Airtime We use the following well-established procedure in control theory to tune the three PID parameters. (1) Set KI and KD to zeros and increase KP until the output oscillates, and then set KP to half of that value. (2) Increase KI until any offset is corrected in adequate time. (3) Increase KD until the reference can be reached quickly after load disturbance. D. Cross-Layer Optimization Solution After determining the accuracy error function and the effective airtime, we can now solve the problem formulated in Equation (1). Since the problem meets the conditions of budget constrained convex programming problems, it can be solved using Lagrangian relaxation. Assuming that all video

sources have the same bs , which is empirically valid, we get the following solution: fs∗ = (

−λ∗ τs )(1/(bs −1)) , ws as bs ys (ys /τs )(bs −1)

(3)

where λ∗ = ( PS

Aef f

−τs (1/(bs −1)) s=1 ( wS as bs ys (ys /τs )(bs −1) )

)(bs −1) .

(4)

Therefore, the AP determines λ∗ using Equation (4) and sends it to all video sources in the network using the beacon packet. When a video source s receives λ∗ , it determines its fraction of the airtime fs∗ using Equation (4) and changes the application data rate (i.e., video encoding rate) according to the equation: rs∗ = fs∗ × ys . The optimization results can be enforced by the link layer through changing the TXOP limit. As in [5], each video source determines its TXOP limit as the time required to transmit the packets that belong to a single video frame along with all associated overhead. III. P ERFORMANCE E VALUATION M ETHODOLOGY We use OPNET to simulate the network, using two types of video traffic: MJPEG and MPEG-4 Part 2. We implement a traffic source in OPNET that streams MJPEG videos as Real-time Transport Protocol (RTP) packets. Moreover, we implement a realistic video streaming client at the application layer of the proxy station. This client receives and reassembles the RTP packets from various video streams, and then carries out error concealment [12] to mitigate the impact of packet loss. The use of MJPEG enables the use of standard image datasets that are suitable for surveillance applications. We use the following standard image sets to assemble the video streams: CMU/MIT, Georgia Tech., and SCFace. The frames are chosen randomly from the dataset used. The Georgia Tech. image set produces a highly limited range of bitrates for the studied network. To produce a better variety, we generate two image sets from the original Georgia Tech. image set by changing the resolution of each image in the set to 30% and 50% of the original resolution, respectively. At each video source, the streamer takes a bitrate, a frame rate, and an image set as inputs and produces a corresponding MJPEG video stream. We assume a frame rate of 20 fps. We conduct experiments with networks of one AP, one proxy station co-located with the AP, and a variable number of video sources. The video sources start sending the video streams to the proxy station randomly within 1 second of the beginning of the simulation. After every 1 second, each video source sends its status update as a small packet, called state report, to the AP. The state report includes the physical rate, dropping rate (buffer + retransmission), and possibly local accuracy error model parameters. These data will be utilized by the AP to execute the effective airtime estimation algorithm and to determine λ in the optimization solution. Table I summarizes the main simulation parameters. We compare the proposed accuracy-based optimization solution, referred to in the results as Weighted Accuracy Optimization (WAO), with the following two solutions. (1) The cross-layer solution in [5], called Enhanced Distortion Optimization (EDO), which is the best performer among exiting solutions. (2) A hypothesized version of EDO that uses

TABLE I: Summary of Simulation Parameters 0.46

0.55

0.44

Existing Estimator

PID Estimator

IV. R ESULT P RESENTATION AND A NALYSIS With extensive experimentation, we observe that the best values of the PID parameters depend on the workload and the desired value of Athresh . Table II summarizes these results. TABLE II: Summary of PID Parameter Tuning Workload Georgia Tech. Georgia Tech. Georgia Tech. Georgia Tech. CMU/MIT

at at at at

30% 30% 50% 50%

Resolution Resolution Resolution Resolution

Athresh 0.001 0.005 0.001 0.005 0.005

KP 1 1.5 0.5 1.5 0.5

KI 0.25 0.25 0.25 0.25 0.25

KD 0.25 0.25 0.25 0.25 0.25

Let us now compare the effectiveness of the proposed effective airtime estimation algorithm utilizing the PID controller with the estimator in [5]. Figure 4 compares the output effective airtime values over time for both algorithms, referred to as “PID Estimator” and “Existing Estimator”, respectively. Only the results for Georgia Tech. at 50% of the original resolution are shown because the other results exhibit a similar behavior. The results demonstrate that the PID estimator converges much faster and has smaller overshooting and undershooting. Figure 5 shows the average effective airtime versus the number of video sources for the three bandwidth allocation solutions. These results show that the effective airtime increases with the network size up to a point and then starts to decrease. The peak happens when the proper balance between node contention and network utilization is reached. The peak point in the figure is when the network size is 16, but this value varies with the total sending rate of the sources. In particular, the peak should happen at a smaller network size if the sources are more demanding for bandwidth and at a larger network size if the sources are less demanding. Note that the three solutions yield close airtimes, especially for larger networks.

0.45 0.4 0.35

0.34

0.3

0.32

0.25

0.3 20

30

40

50 Time (s)

60

0.2 0

70

Fig. 4: Effectiveness of the Proposed PID-Based Estimation [Georgia Tech. at 50% Res., 32 sources, Athresh = 0.005] 0.8

52 Sources 56 Sources

0.79 Weighted Accuracy

weights for various video sources, which is referred to here as Weighted Distortion Optimization (WDO). We analyze the following performance metrics: Weighted Accuracy, Overall Network Load, and Power Consumption. The weighted accuracy is the weighted detection accuracy of the video sources. The accuracy for each source is determined as the average accuracy of the received frames sent by that source. The accuracy of a dropped video frame is assumed to be 0. The overall network load is defined as the total load sent by the application layers of all video sources. Finally, power consumption is the average power consumption of the wireless interfaces of the video sources and is determined by using the power consumption model in [13].

Effective Airtime

0.4 0.38 0.36

0.78 0.77 0.76 0.75 0.74 0

0.005

0.01 A

thresh

(a) Weighted Accuracy

EDO WDO WAO

0.5

0.42

0.015

10

20

30 40 Network Size

50

60

Fig. 5: Comparing Various Solutions in Effective Airtime [MJPEG, Georgia Tech. at 30% Res., Athresh = 0.005] 1

Normalized Power Consumption

Video Frame Rate Physical characteristics Physical Data Rate Weight Buffer size Video TXOP limit Video CW Retry Limit Others

Model/Value(s) 16-76 10 min 1024 bytes Optimized, Default = max physical rate / #sources 20 frames/s Extended Rate (802.11g) Random from {12,18,24,36,48,54} Mbps Random from five levels between [0 1] 256 Kb Optimized, Default = 3008µs min = 15, max = 31 short = 7, long = 4 Beacon Interval = 0.02s State Report Interval = 1s

Effective Airtime

Parameter # video sources Simulation Time Packet Size Application Rate

0.95

0.9

0.85

0.8 52 Sources 56 Sources

0.75 0

0.005

0.01

0.015

Athresh

(b) Power Consumption

Fig. 6: Impact of Athresh with Proposed Weighted Accuracy Optimization [MJPEG, Georgia Tech. at 30% Res.]

Let us now discuss the impact of Athresh , which determines the allowable packet dropping in the network, on the weighted accuracy and power consumption. Figure 6 shows the results of running the proposed WAO solution for two different network sizes. The results demonstrate that the weighted accuracy improves with Athresh up to a certain point and then starts to worsen. The peak happens when Athresh is smaller than 0.01, suggesting that that the optimal accuracy is achieved when the dropping is very small. As expected, the power consumption increases with Athresh because the sending rate increases with Athresh . Therefore, Athresh should be selected based on the proper tradeoff between accuracy and power consumption. Figure 7 compares the effectiveness of the proposed crosslayer optimization solution (WAO) with EDO and WDO for the Georgia Tech. at 50% resolution The results with other combinations of datasets and Athresh values exhibit the same behavior and thus are not shown. These results demonstrate that the proposed accuracy-based optimization solution outperforms other solutions in all metrics. In particular, it improves both the weighted accuracy and power consumption by up to 10%. The improvement in power consumption is due to reducing the application rates (network loads) of various video sources. In addition, the results indicate that incorporating weights for various video sources with the distortion-based optimization has no noticeable improvement. On the contrary, it may worsen the weighted accuracy in some cases. By detailed analysis of the results, we observe that applying weights in WDO forces the video sources with low importance factors to send the video streams at extremely low bitrates, and in some cases, it prevents these sources from sending any video, leading to unacceptable levels of accuracy for these sources.

7

0.85

0.84

0.83

0.82

0.81

25

30 35 Network Size

40

x 10

EDO WDO WAO

1.2 1.1 1 0.9 0.8 0.7

(a) Weighted Accuracy

25

30 35 Network Size

(b) Overall Network Load

40

Normalized Power Consumption

Weighted Accuracy

EDO WDO WAO

Overall Network Load Rate (bps)

1.3

0.86

EDO WDO WAO

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55

25

30 35 Network Size

40

(c) Power Consumption

Fig. 7: Comparing Various Bandwidth Allocation Solutions [MJPEG, Georgia Tech. at 50% Resolution Athresh = 0.001] V. C ONCLUSIONS

AND

F UTURE W ORK

The main results can be summarized as follows. (1) The proposed solution significantly enhances both the detection accuracy and power consumption, compared to the distortionbased optimization. The reduction in power consumption is due to sending and dropping much less data. (2) The proposed effective airtime estimation algorithm is accurate and converges faster than the existing online estimation algorithm. In future work, we will develop a full AVS system prototype and experiment with a variety of computer vision algorithms. R EFERENCES [1] Qiang Chen, Zheng Song, Jian Dong, Zhongyang Huang, Yang Hua, and Shuicheng Yan, “Contextualizing object detection and classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 13–27, Jan 2015. [2] Kai-Wen Cheng, Yie-Tarng Chen, and Wen-Hsien Fang, “Gaussian process regression-based video anomaly detection and localization with hierarchical feature representation,” IEEE Transactions on Image Processing,, vol. 24, no. 12, pp. 5288–5301, Dec 2015. [3] Wu-Chi Feng, Ed Kaiser, Wu Chang Feng, and Mikael Le Baillif, “Panoptes: scalable low-power video sensor networking technologies,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 1, no. 2, pp. 151–167, 2005. [4] Cheng-Hsin Hsu and Mohamed Hefeeda, “A framework for cross-layer optimization of video streaming in wireless networks,” TOMCCAP, vol. 7, no. 1, pp. 5, 2011. [5] Mohammad A. Alsmirat and Nabil J. Sarhan, “Cross-layer optimization and effective airtime estimation for wireless video streaming,” in Computer Communications and Networks (ICCCN), 2012 21st Int’l Conf. on, 30 2012-aug. 2 2012, pp. 1 –7. [6] Pavel Korshunov and Wei Tsang Ooi, “Video quality for face detection, recognition, and tracking,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 7, no. 3, pp. 14:1–14:21, Sept. 2011. [7] Yousef Sharrab and Nabil J. Sarhan, “Accuracy and power consumption tradeoffs in video rate adaptation for computer vision applications,” in Proc. of the IEEE Int’l Conf. on Multimedia & Expo (ICME 2012), Melbourne, Australia, July 2012, pp. 410 – 415. [8] Ye Ge, Jennifer C. Hou, and Sunghyun Choi, “An analytic study of tuning systems parameters in IEEE 802.11e enhanced distributed channel access,” Computer Networks, vol. 51, no. 8, pp. 1955–1980, 2007. [9] N.S. Shankar and M. van der Schaar, “Performance analysis of video transmission over IEEE 802.11a/e WLANs,” IEEE Transactions on Vehicular Technology, vol. 56, no. 4, pp. 2346 –2362, July 2007. [10] Paul Viola and Michael Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), 2001, vol. 1, pp. 511–518. [11] Mohammad A. Alsmirat and Nabil J. Sarhan, “Cross-Layer optimization and effective airtime estimation for wireless video streaming,” in 21st Int’l Conf. on Computer Communications and Networks (ICCCN 2012), Munich, Germany, July 2012.

[12] Shahram Shirani, Faouzi Kossentini, Samir Kallel, and Rabab Ward, “Reconstruction of jpeg coded images in lossy packet networks,” Technical Report, 1997. [13] Yousef Sharrab and Nabil J. Sarhan, “Aggregate power consumption modeling of live video streaming systems,” in ACM Multimedia Systems (MMSys 2013), Oslo, Norway, February 2013.

Suggest Documents