Automated Activity Detection as a Pre-processing stage of Video Camera feeds. G. Xiao. 1 ... viewers contain sufficient activities for entertainment purposes.
Automated Activity Detection as a Pre-processing stage of Video Camera feeds G. Xiao1, J. Jiang1,2, and K. Qiu1 1
Faculty of Computer & Information Science, Southwest University, China
2
Digital Media & Systems Research Institute, University of Bradford, UK
Abstract: In this paper, we describe a simple but effective and fast video activity detection algorithm to provide an efficient pre-processing tool for video camera feeds. It also presents a range of applications, including post production of TV camera video streams, analysis of surveillance videos, visual scene classification, and camera control for automated recordings etc. Given a video sequence, our algorithm detects moving pixels by modeling their differences in Gaussian distribution through on-line estimation of their statistics to determine an adaptive threshold for final classification of active or static scenes. The originality of our contribution can be highlighted as: (i) a powerful activity detection tool inside videos to facilitate efficient visual content analysis; (ii) on-line estimation of statistics and adaptive determination of a threshold for automated classification of active or static scenes. Extensive testing supports that the proposed algorithm achieves excellent performances.
Indexing terms: automated video activity detection, video processing
1. Introduction
With the proliferation of video data and growing application of multimedia and camera monitoring systems, there exists an increasing need of advanced technologies and algorithms
1
for indexing, searching, analyzing, and pre-processing vast amount of videos, such as event detection, activity detection, and semantics concept extraction [1,2]. Practical content production from camera video streams normally require significant efforts in pre-processing and editing since raw videos often record relatively static scenes without too much meaningful activities for quite a while, and thus large sections of video sequences need to be eliminated before any event or activity could be identified and hence edited into acceptable video programmes. At present, such issues are primarily dealt with by TV broadcasters via manual process, which is not only time consuming but also cost intensive. Under LIVE [3], an EU funded integrated project within the FP6 research programme, a new concept of interactive TV is introduced and viewers are provided with facilities of interaction with broadcasters. One of such facilities is to select camera feeds to access video content directly captured by cameras and hence have their views at different angles with different scenes. To enable such interaction to be attractive and appealing to viewers, one of the essential requirements is to pre-process those camera feeds on real-time basis and ensure that such pre-processed raw videos accessed by viewers contain sufficient activities for entertainment purposes. Under HERMES [4], another EU funded STREP project within the FP7 research programme, it is required to record indoor activities on a 24/7 basis, such as meeting friends and domestic parties etc. in order to generate sufficient visual content for extracting semantics and metadata towards computer-aided memory management for elderly people. Without any automated control, however, the recordings could consume enormous storage spaces, creating problems not only with high costs but also with their content analysis. Based on these application scenarios, we propose to develop an automated activity detection algorithm to pre-process the camera captured video 2
streams automatically, and eliminate those no-activity present video frames from the process of their content analysis and productions. Relevant work reported in the existing literature can be roughly categorized into low-level based approaches and high-level based approaches. While the former relies on low-level image processing techniques such as segmentation [5,6], content description [7], and feature extraction [8], the latter essentially adopts machine learning techniques, such as neural network and SVM etc. to do the classification and detection. All these reported techniques, however, cannot be directly applied for activity detection without some revisions. The low-level approaches often use one or multiple thresholds to do the detection, and how to determine such thresholds adaptive to the input video frames remains to be an on-going research problem. The machine learning approaches, on the other hand, rely on their training process to learn from the input and complete the detection. The weakness of such techniques lies in the fact that its training design is often difficult, subject to a number of uncertain factors, including selection of training videos, coverage of training aspects, and characterization of various inputs. To this end, we propose a simple but effective activity detection algorithm via a statistics-based approach to resolve the problem of video activity detection for pre-processing of camera feeds, such as the applications described in both the HERMES and LIVE projects.
2. The Proposed Algorithm Design Given the input video frame sequence: {I 1 , I 2 ,...I i ,...I N }, we firstly calculate the difference frame between the two neighboring frames:
∆ i − t ,i = I i − I i − t
(1) 3
Where ∆ i −t ,i = ( p k , j ) is the difference frame and p k , j is the corresponding differential pixel. t is a step value used to indicate the frame to be differentiated with the current frame I i .
To identify those active pixels, we apply a simple condition test by examining every differential pixel obtained from (1), which can be described as follows: ⎧active if p k , j > T p pk , j = ⎨ else ⎩ static
(2)
As seen, the condition test is dependent on the selection of the threshold: T p . While a larger value of T p could wrongly determine those active pixels as static, a smaller value of T p could wrongly determine those static pixels as active. To ensure that such activity detection is accurate and useful, selection of T p should be automated and to be adaptive to the input video content. Even within the same video sequence, frame content could be different due to a range of factors, such as lighting changes, time difference between day and night etc. As a result, the threshold for each frame should be determined individually according to its content, and thus should be different from frame to frame, in order to achieve the best possible performances. To illustrate this point, we carried out some experiments to manually adjust the threshold T p and optimize the activity detection for the two video clips, LeftBag and OneStopEnter1front, inside our test set, which is publicly downloadable for evaluation and validation purposes [9]. The results are summarized in Table-I, which shows that, while T p =12 delivers the best possible performance with a combination of precision rate at 96% and the recall rate at 98% for the video clip LeftBag, the video clip OneStopEnter1front requires T p =10 to deliver the best possible performance with a combination of precision rate at 94% and the recall rate at 99%. In practice,
4
however, it is not appropriate to manually select the best possible threshold value for each individual video clip. To determine the threshold T p adaptively, we assume that the differential pixel
p k , j obeys Gaussian probability distribution. To validate such assumption, we randomly selected 10 frames from the videos, publicly available from the Internet [9], and produced 10 histograms of their differential pixel data to compare with the standard Gaussian distribution as shown in Figure 1. As seen, all the histograms are close to a Gaussian distribution but with different variance values reflected by their peaks. Therefore, to enable the activity detection to genuinely reflect the local content changes across neighboring video frames, we propose an on-line estimation of mean and variance for all differential pixels as follows:
µ=
1 W ×H
σ2 =
∑p
(3)
k, j
k, j
1 W ×H
∑ (p
− µ)
2
k, j
(4)
k, j
Correspondingly, the threshold can be adaptively calculated via the following condition test:
(
)
P pk , j > T p = α
(5)
Where α is a significance level, which is controlled within 0 < a < 1 .
For a standard Gaussian distribution X ~ N (0,1) , the relationship between its threshold z a
and its significance level α ( P{X > z a } = α ) is normally tabulated in mathematics handbooks.
5
A typical example is illustrated in Table-II. Therefore, for a non-standard normal distribution, pk , j ~ N ( µ , σ 2 ) ,we have:
pk , j − µ
σ
~ N (0,1) and hence the threshold T p can be determined as:
Tp = µ + σ × za
(6)
Consequently, implementation of the proposed active pixel detection can be summarized as: (i) for each given input video frame, its statistics (µ ,σ 2 ) are estimated via (3) and (4) to determine a threshold T p through (6); (ii) each pixel is then determined as active or static via the condition test as described in (5). To determine the active frames, regions of active pixels can be taken into consideration. From (5), it is seen that, proportionally, the number of active pixels inside the video frame should maintains to be close to α when divided by the total number of pixels inside the frame. In practice, not every active pixel contribute to active frames, especially those isolated active pixels, which can be regarded as noise from visual inspections. In other words, consideration of active frame can only be made for those active pixels that formulate a continuous neighborhood region. As a result, we apply a region-grow technique [12] by using the detected active pixels as seeds to identify all active regions, and then determine the active frames via the following condition test:
∑η
pk , j ∈Ω
Mi
pk , j
≥ λα
(7)
Where η pk , j counts the total number of active pixels that formulate continuous neighborhood regions, M i is the total number of pixels inside the ith frame, and λ is an adjustment parameter controlling the effect of those isolated active pixels. Our experiments show that 6
λ = 0.8 ~ 0.95 presents a reasonable performance without noticeable differences. Essentially, the above condition test monitors the distribution of those isolated active pixels and convergence of those active pixels within continuous neighborhood regions. When video frames present little activities, such as waving of tree leaves, flashing of lights or flags etc. active pixels detected via (5) will primarily present isolated distributions over the video frame and hence the ratio in (7) tends to be smaller than the adjusted significance level. In contrast, when video frames present strong activities, most active pixels detected via (5) will converge and formulate large continuous neighborhood regions and thus the ratio on the left side of (7) tends to be closer to α . Therefore, the entire process of activity detection is made adaptive to its individual frame content and on-line estimation of its statistics.
3. Experiments
To evaluate the proposed algorithm, we carried out extensive experiments with a test video set of around 10 video clips downloaded from a publicly available web site: CAVIAR (http://homepages.inf.ed.ac.k/rbf/CAVIAR). This is designed for the convenience of benchmarking the proposed algorithm in case any new algorithm or other relevant work are further developed, and thus comparisons can be carried out without repeating the work described in this paper. To measure the performances, we use the precision ρ and recall r , which is widely adopted in relevant research communities, and their definitions are given below.
7
ρ=
Nc Nc + Ne
(8)
γ=
Nc Nc + Nm
(9)
Where N c is the number of correctly detected active frames, N e the number of wrongly detected active frames, and N m is the number of missed active frames. Our experiment platform is established via a PC with 2.80G and 512M under the windows XP operation system and MicroSoft VC++ programming environment. By following the example of TRECVID video processing competition event [11], organized by American National Institute of Standards and Technology, a ground truth is established by manual inspections of each tested video, in a frame-by-frame manner, each time its automated activity detection is completed. By activity, it is meant that: (i) there exist significant pixel changes, which can be visually inspected as active content changes rather than noise, such as waving of tree leaves etc.; (ii) there exist sufficient active regions inside the frames, where such regional change means something to visual inspection and perception. All the experimental results are summarized in Table-III, from which it can be seen that the proposed algorithm performs very well with consistent high values of both the precision and the recall rates, ranging from 90% to 99%. Among all the video clips tested, the average precision rate is 0.97 and the average recall is 0.96, which are indeed very good. Specific evaluations of the proposed algorithm requires detailed examination of both
ρ and r . From their definitions given in (8) and (9), it can be seen that a higher value of ρ could be paid by a lower value of r . In other words, small N e could lead to a large number of missed detections, i.e., a higher value of N m and hence lower value of r . To make a 8
balanced measurement of the performances, TRECVID series of competition events [11] introduced a F1 value by taking into consideration of both ρ and r , which is defined as follows:
F1 =
2 ρr ρ+r
(10)
Accordingly, we also list all the experimental results in terms of F1 values in Table-IV. From the experimental results given in Table-I, it is known that, for the video LeftBag, the best threshold manually selected is 12 and its corresponding F1 value can be calculated as 0.970. Similarly, the best manually selected threshold is 10 for the video clip OneStopEnter1front, and its corresponding F1 value is 0.964. In comparison with the results given in Table-IV, the proposed algorithm actually outperforms the manually optimized process.
4. Conclusions
In this paper, we presented a simple yet effective algorithm for automated activity detection inside camera feed videos. In summary, the proposed algorithm for such automated activity detection can be characterized as: (i) frame-by-frame estimation of the mean and variance to enable the threshold to be determined adaptive to the input; (ii) active pixel is detected via a simple condition test governed by the adaptive threshold; and (iii) consistency is maintained between the active pixel detection and active frame detection to complete the requirement of
9
activity detection inside videos via region-grow techniques. Experimental results presented verify that the proposed algorithm performs well to meet our project needs, and outperform the manually optimized threshold selection process. Such a video processing tool can also provide solutions and building blocks for a range of other applications, examples of which include: (i) video summarization for broadcasting camera feeds to be edited and post-produced, where copies of static videos can be automatically removed to save the time as well as the cost; (ii) video retrieval and copy detection, where content can be classified in accordance with their activity level to achieve the retrieval efficiency and effectiveness; and (iii) semantics-based video content interpretation and analysis, where semantics features can be established and extracted from those active pixels and thus active regions inside the video frames by taking into consideration both the spatial activities and the temporal activities. Finally, the authors wish to acknowledge the financial support from the EU IST Framework Research Programme under both HERMES project (Contract No IST-027312).
References
[1] M. Shyu, Z. Xie and et. Al ‘Video semantic event/concept detection using a subspace-based multimedia data mining framework’, IEEE Trans. Multimedia, Vol 10, No 2, 2008, pp252-263; [2] A. Adam, E. Rivlin, et al ‘Robust real-time unusual event detection using multiple fixed-location monitors’, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 30, No 3, 2008 pp555-560 [3] http://www.ist-live.org/ [4] http://www.fp7-hermes.eu/ 10
[5]S. Chien, S. Ma and L. Chen ‘Efficient moving object segmentation algorithm using background registration technique’, IEEE Trans. On CSVT, Vol 12, No 7, 2002, pp577-586; [6] Zhang B., Jiang J. and Xiao G. (2007) ‘Video object tracking via central macroblocks and directional vectors’, Lecture Notes in Computer Science, Springer, Vol 4633, pp593~601; [7] Jiang J., Armstrong A. and Feng G.C. ‘Web-based image retrieval in JPEG compressed domain’, Multimedia System Journal , Vol 9, No 5, pp 424-432, 2004; [8] Z. Li, J. Jiang, and G.Q. Xiao (2006) ‘Fast scene change detection in MPEG compressed videos’, Lecture Notes in Computer Science, Image Analysis & Recognition, Springer, Vol 4141, No 1, pp206-214, 2006; [9] http://homepages.inf.ed.ac.k/rbf/CAVIAR [10] C. Burges. “A tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery(2), pp.121-167, 1998; [11] http://www-nlpir.nist.gov/projects/trecvid/ [12] Rolf Adams and L. Bischof ‘Seeded region growing’, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 16, No. 6, 1994, pp 641-647.
Table-I Experimental results summary for manual threshold selection (from 1 to 26)
11
Threshold values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
LeftBag
OneStop Enter1front
70.8%/100% 71.1%/100% 74.2%/100% 76.9%/100% 81.5%/100% 85.1%/99.8% 89.5%/99.8% 90.3%/99.6% 92.3%/99.4% 92.8%/99.4% 94.2%/99.2% 95.9%/98.8% 97.4%/96.6% 97.1%/94.4% 97.5%/93.8% 98.9%/91.7% 98.9%/89.7% 98.9%/89.1% 99.1%/88.5% 99.5%/86.1% 99.5%/83.3% 99.5%/79.9% 99.7%/77.0% 99.7%/75.3% 100.0%/72.1% 100.0%/71.1%
67.7%/100% 67.7%/100% 68.4%/100% 70.9%/100% 67.7%/100% 77.3%/100% 85.6%/100% 89.1%/100% 92.3%/99.2% 93.5%/99.2% 93.8%/96.6% 95.0%/94.1% 95.2%/93.9% 96.2%/91.3% 98.7%/89.1% 98.9%/87.7% 98.9%/87.3% 99.8%/84.4% 99.8%/82.2% 99.8%/81.2% 99.7%/78.2% 99.7%/74.7% 99.7%/73.9% 99.7%/71.7% 99.7%/70.1% 99.7%/68.7%
Table-II: Examples of tabulated relationship between z a and α a za
0.001 3.090
0.005 2.576
0.01 2.327
0.025 1.960
0.05 1.645
0.10 1.282
Table-III: Summary of experimental results in terms of (precision, recall)
12
Video clips
(ρ , γ )
Video clips
(ρ , γ )
1) LeftBag
(0.98,0.98)
6)TwoLeaveShop2front
(0.99,0.91)
2)OneStopEnter1front
(0.98,0.96)
7) ThreePastShop1front
(0.99,0.99)
3)Rest_FallOnFloor
(0.98,0.97)
8)LeftBag_PickedUp
(0.97,0.92)
4) Browse4
(0.96,0.97)
9)OneShopOneWait1front
(0.99,0.98)
5)Rest_InChair
(0.97,0.90)
10)OneLeaveShopReenter1co
(0.92,0.99)
Total average
(0.97,0.96)
Table-IV: Summary of experimental results in terms of F1 measurement Video clips
F1
Video clips
F1
1) LeftBag
0.984
6)TwoLeaveShop2front
0.948
2)OneStopEnter1front
0.970
7) ThreePastShop1front
0.990
3)Rest_FallOnFloor
0.975
8)LeftBag_PickedUp
0.944
4) Browse4
0.965
9)OneShopOneWait1front
0.985
5)Rest_InChair
0.934
10)OneLeaveShopReenter1co
0.954
Total average
0.965
13
4
5
x 10
gaussian data 4.5
LeftBag(F767-F765) Browse4(F307-F305)
4
LeftBag-PickedUp(F233-F231) OneLeaveShopReenter1cor(F79-F77)
Number of pixel
3.5
OneShopOneWait1front(F211-F209) OneStopEnter1front(F489-F487)
3
Rest-FallOnFloor(F787-f785) Rest-InChair(F741-F739)
2.5
ThreePastShop1front(F227-F225) TwoLeaveShop2fron(F257-F255)
2
1.5
1
0.5
0 -10
-8
-6
-4
-2
0
2
4
6
8
10
differential pixel data
Figure 1: Illustration of differential pixel histograms in comparison with Gaussian distribution
14