Maximum a Posteriori Probability Estimation for Online ... - IEEE Xplore

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 8, AUGUST 2014

1417

Maximum a Posteriori Probability Estimation for Online Surveillance Video Synopsis Chun-Rong Huang, Member, IEEE, Pau-Choo (Julia) Chung, Fellow, IEEE, Di-Kai Yang, Hsing-Cheng Chen, and Guan-Jie Huang

Abstract— To reduce human efforts in browsing long surveillance videos, synopsis videos are proposed. Traditional synopsis video generation applying optimization on video tubes is very time consuming and infeasible for real-time online generation. This dilemma significantly reduces the feasibility of synopsis video generation in practical situations. To solve this problem, the synopsis video generation problem is formulated as a maximum a posteriori probability (MAP) estimation problem in this paper, where the positions and appearing frames of video objects are chronologically rearranged in real time without the need to know their complete trajectories. Moreover, a synopsis table is employed with MAP estimation to decide the temporal locations of the incoming foreground objects in the synopsis video without needing an optimization procedure. As a result, the computational complexity of the proposed video synopsis generation method can be significantly reduced. Furthermore, as it does not require prescreening the entire video, this approach can be applied on online streaming videos. Index Terms— Maximum a post er i or i (MAP) estimation, video summarization, video surveillance, video synopsis.

I. I NTRODUCTION

W

ITH the development of surveillance cameras, recording daily events that happen in the environments becomes possible. However, how to efficiently browse such a huge number of videos becomes one of the most important issues in visual surveillance. To provide a fast browsing mode, some surveillance systems [1], [2] record key frames of foreground objects in the scene. However, these static key frames do not contain continuous object motions, which are important for event analysis [3]–[8]. Thus, video abstraction Manuscript received June 11, 2013; revised October 11, 2013; accepted February 24, 2014. Date of publication February 26, 2014; date of current version August 1, 2014. This work was supported by the National Science Council of Taiwan under Grants NSC-100-2221-E-005-085, NSC-101-2221E-005-086-MY3, and NSC-101-2221-E-006-262-MY3. This paper was recommended by Associate Editor C. Shan. C.-R. Huang is with the Department of Computer Science and Engineering, and the Institute of Networking and Multimedia, National Chung Hsing University, Taichung 402, Taiwan (e-mail: [email protected]). P.-C. Chung is with the Institute of Computer and Communication Engineering, and the Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Taiwan (e-mail: [email protected]). D.-K. Yang and H.-C. Chen are with the Institute of Computer and Communication Engineering, National Cheng Kung University, Tainan 70101, Taiwan (e-mail: [email protected]; [email protected]). G.-J. Huang is with the Department of Computer Science and Engineering, National Chung Hsing University, Taichung 402, Taiwan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2014.2308603

methods are proposed recently, in which significant images or video segments are retrieved to represent the original videos. Among various abstraction methods, video synopsis [9]–[16] suggests an alternative approach by rearranging all foreground objects into a condensed video to allow fast browsing of each foreground object. Compared with other abstraction methods, the activities and the dynamics of the objects can be genuinely revealed in the synopsis video. However, existing video synopsis methods require screening the entire video at first to obtain complete trajectories of foreground objects. Then, spatial and temporal positions of these objects are rearranged using optimization methods. The computational complexities of these optimization methods are extremely high, and are exponentially proportional to the total number of foreground objects in the surveillance video. Because of the requirement of screening the entire video and the high computational complexity, these methods cannot be applied to online streaming surveillance videos. Although some recent works [11], [12] claim that their synopsis methods can be applied to online cameras, their approaches still require an additional offline phase for synopsis video generation. For example, in [11] and [12], two phases are required for synopsis video generation. The first phase is an online phase, which is used to detect and track foreground objects. Then, the second offline response phase generates synopsis videos. Because their methods require complete trajectories of foreground objects for optimization, the synopsis video generation usually requires a few minutes or more, depending on the number of the foreground objects and the length of the video. Moreover, the appearing (chronological) order of each foreground object is not retained. In other words, despite the results in [11] and [12], it still lacks a truly realtime and online process for synopsis video generation. Please refer to Section V-C for more details. With the drastic growth of the installation of surveillance cameras, the quantity of video data significantly increases. Thus, a video synopsis generation approach that can achieve real-time efficiency is imperative to process such a huge amount of data. Moreover, current IP cameras usually online transmit video data to the client during monitoring. This also increases the need to develop video synopsis generation algorithms, which can compile online videos into synopsis videos in real time. Thus, in this paper, a new video synopsis generation algorithm is proposed to generate synopsis videos from

1051-8215 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1418


online streaming videos in real time. To the best of our knowledge, [17] presents the only method, which can really obtain synopsis videos from endless streaming video sources and achieves real-time efficiency. In this paper, we propose an improvement to [17] using maximum a posteriori (MAP) probability estimation with a new synopsis table to generate synopsis videos from online streaming videos in real time. With MAP estimation, the need of thresholds of foreground instance representation models in [17] can be alleviated. Moreover, integrating all of the models into MAP estimation provides the overall considerations to achieve a better accuracy. With the new synopsis table, foreground instances of a foreground object can be more efficiently arranged to the synopsis video. Based on the MAP with the synopsis table, our method can chronologically arrange objects frame by frame in real time. Moreover, our method does not need to screen the video in advance to obtain complete trajectories of all foreground objects, which is necessary for most of the existing optimization approaches. As a result, no offline steps are required. This paper is organized as follows. Section II describes the related work. Section III presents the synopsis video generation using MAP estimation with the synopsis table. Foreground instance representation is presented in Section IV. Experiments are shown in Section V. Finally, the conclusion is drawn in Section VI. II. R ELATED W ORK The video abstraction methods can be divided into three categories: 1) video summarization; 2) video skimming; and 3) video synopsis. Video summarization methods usually partition a video into shots via shot [18] or scene detection [19]. Zhang et al. [20] use color-histogram differences to find key frames of shots, but their approach may mistakenly choose the frames in the transition between video shots. The same problem also occurs in [21]. Zhuang et al. [22] cluster a video segment by the similarity of the color histograms. Besides considering color differences, Lagendijk et al. [23] consider motion directions of pixels to find key frames. Their approach fails when the camera translates or rotates. Ju et al. [24] apply motion estimation to retrieve key frames from school teaching videos, by first performing shot detection, and then retrieve key frames from each shot. Ma et al. [25] propose using user attention models to extract key frames of target videos. However, video shots rarely appear in the surveillance videos, causing selecting key frames impossible. Furthermore, key frames also lack of continuous foreground motion information. Thus, the key frame-based video summarization methods are not applicable in surveillance systems. Li et al. [26] propose summarizing foreground objects in uniform motions to mosaic images. Although this is helpful to reduce the sizes of the surveillance videos, the images obtained by their approach contain only stationary snaps, and are short of continuous activity and motion information. The second category is video skimming, which produces a shorter video representing the original video based on different sampling methods. Pfeiffer et al. [27] propose the VAbstract system by segmenting scenes from different parts of a movie, and then recomposing them into a new video

regarded as a trailer. In [28]–[30], video objects are clustered by comparing the visual similarity between frames of the video. Then, frames of different clusters are combined to form a skimming video. Sundaram and Chang [31] partition a video into different scenes according to the lighting and sound of video frames. A skimming video is constructed based on the partitioned scene periods. Ma et al. [32] find out fragments that attract the attention of viewers by attention models. They propose motion, static, face, and camera attention models to obtain foregrounds, and then combine the results into a summarized video. Li et al. [33] use long-term and short-term audiovisual tempo analyses to detect different substories of a video and combine them for video skimming. Similar to video summarization, most of the video skimming approaches are mainly based on video shots, so they are hard to be applied to surveillance videos. Uniform sampling [34] of frames is also adopted to obtain the condensed time-lapse video. However, the events of fast moving objects may be lost in the uniformly sampled time-lapse video. To avoid this disadvantage, Bennett and McMillan [35] propose nonuniform sampling, which uses higher sampling rates for objects with larger motions and lower sampling rates for objects with smaller motions. Besides directly sampling images from a video, Chen and Sen [36] propose video carving, which defines a cost function by the space-time volume of the video to remove pixels with lower cost via 2-D graph cuts. The remaining pixels are then recombined to a new video. Similar to [36], Li et al. [37] propose the video ribbon to reduce the duration of the video. Because these methods lack the complete foreground information, false or incomplete foreground extraction will occur when fast moving objects suddenly stop. Moreover, ghost shadows also occur in the shortened video. Instead of using frame sampling [34], video synopsis methods are developed to rearrange spatial and temporal locations of foreground objects to generate a new condensed video. Because foreground objects are tracked in advance, video synopsis methods can ensure that important foreground information will not be lost. By this approach, nonoverlapping foreground objects are condensed in the synopsis video. In [9]–[12], Markov random field (MRF) [38] is used to rearrange foreground objects of a surveillance video, and then a shorter time-lapse synopsis video is generated for fast browsing. With the same concept, Xu et al. [13] propose an object-based method of the video synopsis. They consider an object as a spatial–temporal collection, and rearrange objects by maximizing visible information to shorten the video. Vural and Akgul [14] use the results of the eyegaze tracker to construct the energy matrix for minimization. Wang et al. [15] add a constraint of importance of foreground objects to the energy function proposed in [11] to realize multiscale scalable browsing. Unlike the methods mentioned above, Kang et al. [16] combine the first-fit and graph cut optimization techniques to shorten the video volume. Their approaches would result in obvious seams in the new videos. Furthermore, the established synopsis video will also contain ghost shadows because of directly merging video volumes.

HUANG et al.: MAP PROBABILITY ESTIMATION

III. TABLE D RIVEN MAP E STIMATION FOR S YNOPSIS V IDEO G ENERATION A. Preliminary Synopsis video generation is to rearrange foreground objects of the original video into a condensed video, called synopsis video. The original video is represented as VO , while the condensed video is represented as VS in this paper. A foreground object contains many foreground instances. Each foreground instance is a region (may be delineated as bounding box) of the foreground object appearing in a frame of the original video and is extracted by the background subtraction method [39]. The foreground instances of the same object form a trajectory of the object, which will be mentioned as the foreground object. In general, a synopsis video VS should have the following properties as indicated in [9] and [10]: 1) VS should be substantially shorter than VO ; 2) the activities and dynamics of the foreground objects should be preserved in VS as complete as possible; and 3) the fragments among objects should be avoided. In summary, the activities of the objects appearing in VO should also be preserved in VS and VS is a condensation of VO . As shown in [9]–[12], the rearrangement of foreground objects of VO to VS is formulated as a minimization problem with activity cost, collision cost, and temporal consistency cost of trajectories of foreground objects. To solve the problem, MRF [38] with a simple greedy algorithm is used to evaluate the space of all possible temporal mappings. Each state of MRF describes foreground instances and their mappings in the synopsis video. Because of the computational complexity of MRF, they restrict the temporal shifts of foreground tubes to be in jumps of ten frames to accelerate computation. Local optimal solutions are obtained due to their greedy algorithm. Before the synopsis video generation, their methods also need to obtain tubes (trajectories) of foreground objects in advance. By computing three costs of these tubes, the minimization problem can then be solved. Using such an optimization-based approach, it is hard to apply their methods in real time or on endless streaming videos. As a result, they have to separate their synopsis video generation into two steps: 1) the online phase, which is used to obtain tubes of foreground objects, and 2) the response phase, which really generates synopsis videos in an offline stage. Thus, their method has the limitation of capabilities for processing real-time and online streaming videos. This highly limits its practicability because incoming frames of surveillance cameras are endless and require real-time processing. B. Problem Formulation and Symbol Definition Let On,t represent the nth foreground instance extracted from the tth frame of VO . For each foreground instance On,t , it belongs to either: 1) an existing foreground object, some of which foreground instances have been arranged in the synopsis video or 2) a new foreground object, which has never appeared previously. In order to differentiate these two cases, each foreground instance On,t is featured by three models, which are the

1419

O , the motion prediction model M O , appearance model An,t n,t O . Similarly, an existing and the temporal continuity model Tn,t foreground object in the synopsis video is also featured by the three models. Let Pm be the mth existing foreground object in VS , where m is the foreground object index of Pm . Pm contains many foreground instances. Each foreground instance of Pm is represented by Pm,s , which is the foreground instance of Pm appearing in the sth frame of VS . Then, the three models describing Pm are denoted as the appearance model AmP , the motion prediction model MmP , and the temporal continuity model TmP , respectively. If On,t matches with certain existing foreground object Pm in the synopsis video well regarding the three models, On,t is considered as a foreground instance of Pm . Then, On,t should be arranged temporally following the latest foreground instance of Pm in VS to maintain properties 2 and 3. If On,t does not match with any existing foreground objects in VS , it is then the first foreground instance of a new foreground object. In this situation, the temporal location of On,t in VS needs to be computed and a new foreground object index needs to be given. To achieve the above mentioned purpose, the posterior probability function p(Pm |On,t ) is employed to identify the most possible Pm ∗ from all of the existing foreground objects for On,t as

Pm ∗ = arg max p(Pm |On,t )

(1)

Pm

where Pm ∗ is the most similar existing foreground object with the maximal posterior probability with respect to On,t . If the posterior probability p(Pm ∗ |On,t ) is higher than a threshold τ , On,t is considered as an existing foreground instance of Pm ∗ . Otherwise, On,t is a foreground instance of a new coming foreground object. For the better understanding of this paper, mathematical notations are listed in Table I. C. Synopsis Table Once the MAP formulation finds the most possible existing foreground object Pm ∗ for On,t , the remaining question is how the locations of the foreground instance in the synopsis video can be efficiently determined in real time, while properties 1–3 are maintained. In order to achieve this purpose, rather than going through time-consuming optimization process, we propose the use of a 2-D synopsis table to record the latest situation during the synopsis video generation. Each entry in the synopsis table is corresponding to the spatial position of the video frame and contains two elements: 1) S p (x, y) and 2) St (x, y), where (x, y) is associated with the pixel position (x, y) of the video frame and is also used as the index to the table. The S p (x, y) records the foreground object index of the foreground object whose foreground instance is lately occupying the pixel (x, y) in VS . The St (x, y) records the temporal frame of (x, y) in VS to which the latest foreground instance of the existing foreground object indicated in S p (x, y) is assigned. For example, if the foreground instance lately occupying (x, y) is a foreground instance of Pm and this latest foreground instance of Pm is placed in the sth frame in VS , the values of S p (x, y) and St (x, y) are then set to be

1420


TABLE I S YMBOLS OF THE MAP S YNOPSIS M ETHOD

m and s, respectively, in the synopsis table. By so designed, the synopsis table keeps on updating with the most up-todate spatial and temporal arrangements of foreground objects during the synopsis video generation process. Thus, when a new foreground instance is detected and determined whether it is an instance of one existing foreground object Pm , we can efficiently determine its location in the synopsis video through looking up S p and St in the synopsis table without going through the optimization process for computation [10]. D. MAP Formulation With Synopsis Table To identify if a foreground instance belongs to an existing foreground object in VS , we formulate the MAP problem. The posterior probability function p(Pm |On,t ) can then derived by using the formula of Bayes’ theorem as p(Pm |On,t ) ∝ p(On,t |Pm ) p(Pm )

(2)

where p(On,t |Pm ) is the likelihood function, which indicates the similarity between On,t and Pm . The p(Pm ) here is the prior probability function of Pm in the synopsis video. As described, each foreground instance and existing foreground object in the synopsis video is characterized by the appearance model, the motion prediction model, and the temporal continuity model. Thus, the likelihood probability function p(On,t |Pm ) can be rewritten as O O O , Mn,t , Tn,t |AmP , MmP , TmP (3) p(On,t |Pm ) = P An,t O , M O , and T O and A P , M P , and T P are the where An,t n,t n,t m m m appearance, the motion prediction, and the temporal continuity models of On,t and Pm , respectively. Since these three models are independent, (2) can be rewritten as O P O O P |Am p Mn,t |MmP p Tn,t |Tm p(Pm ). p(Pm |On,t ) ∝ p An,t

(4)

The prior knowledge is important to reduce the computation time for the MAP estimation. When an existing foreground object Pm leaves the scene, it is not necessary to compare the new incoming foreground instance with Pm . To reduce the computation time, the prior function p(Pm ) is defined with respect to Pm as 0, if Pm disappears from the original video p(Pm ) = (5) 1, otherwise the prior probability is set to zero for Pm , which disappears in the frames. Thus, we do not have to compare the new incoming foreground instance with all of the ever appeared foreground objects. As a result, the computation complexity is significantly reduced. Then, by computing the maximum posterior probability using (4), the most similar foreground object Pm ∗ can be retrieved. As mentioned, synopsis video generation is to determine the locations (spatial and temporal) for each foreground instance On,t and to which foreground object (either an existing one or a new one) the foreground instance belongs to in the synopsis video. Also mentioned in Section III-B, On,t can be a foreground instance either of an existing foreground object Pm ∗ appearing in the synopsis video or of a new foreground object. Let ID(On,t ) denote the foreground object index that On,t belongs to in the synopsis video. If p(Pm ∗ |On,t ) is larger than a threshold τ , On,t is considered as a foreground instance of Pm ∗ . Then, ID(On,t ) is naturally assigned as m ∗ . To retain properties 2 and 3, the temporal location of On,t , denoted as TL(On,t ), is arranged following to the last instance of Pm ∗ in VS . Conversely, On,t is a foreground instance of a new foreground object. Then, the temporal location of On,t should be placed best avoiding collision with foreground instances of existing foreground objects.


1421

In the following, we will describe the equations that can efficiently compute the temporal location TL(On,t ) and the foreground object index ID(On,t ) of On,t in VS based on the synopsis table. If On,t belongs to Pm ∗ , as described, the ID(On,t ) is m ∗ . If On,t belongs to a new foreground object, a new ID(On,t ) needs to be determined based on S p (x, y) in the synopsis table. According to the values of S p (x, y), which records the latest up-to-date foreground object indices, simply assigning ID(On,t ) as max{S p (x, y)} + 1 can guarantee that the ID(On,t ) is an index not in use. Based on the descriptions, the ID(On,t ) is determined as ∗ ID(On,t ) = m , if p(Pm ∗ |On,t ) > τ (6) m , otherwise where τ is set to 0.5, m ∗ is the index of the existing foreground object Pm ∗ that On,t is most similar to, and m = max{S p (x, y)} + 1. On the other hand, when assigning the temporal location TL(On,t ) of On,t in the synopsis video, it should be precautious that the properties 2 and 3 can be maintained at the best possibility. The activities and dynamics of the foreground objects should be preserved in VS and the fragmentation among objects should be avoided. Based on this consideration, as mentioned previously, if On,t is considered of a foreground instance of an existing foreground object Pm ∗ , it is placed right after the latest foreground instance of Pm ∗ . Let s∗ denote the synopsis frame of the latest foreground instance of Pm ∗ in VS . TL(On,t ) shall be assigned as s ∗ + 1 for continuity. Conversely, if On,t is a foreground instance of a new foreground object, the assignment of its temporal location should be performed not to collide with existing foreground objects causing foreground object fragmentation. Thus, TL(On,t ) is determined as ∗ TL(On,t ) = s + 1, if p(Pm ∗ |On,t ) > τ (7) s, otherwise where s = max {St (x, y)|∀(x, y) ∈ On,t } + 1. x,y

(8)

Please note that St (x, y) records the temporal location (the frame number) in the synopsis video of the latest foreground object occupying the spatial location (x, y). Thus, (8) indicates that On,t is placed in a temporal location after all of the instances, which might collide with On,t , and therefore On,t will not overlay any previous appearing foreground objects in the synopsis video. In this way, given a foreground instance, its temporal location can be decided by (7) and (8) immediately without complicated optimization procedures. In addition, the computational complexity is dependent only on the number of foreground instances in frame t and the number of existing foreground objects recorded in the synopsis table. Please note that after a foreground object leaves the field of view of the surveillance camera, it will no longer be computed in the MAP using (5). As a result, the computational complexity of our method is much less than that of optimization-based approaches, and therefore real-time efficiency can be achieved. Each time when a foreground instance is arranged into the synopsis video, the synopsis table is updated to keep the

latest up-to-date values of S p (x, y), the foreground instance occupying (x, y), and St (x, y), the temporal location of the foreground instance is assigned, in the synopsis video. The updating of S p (x, y) and St (x, y) is performed as S p (x, y) = ID(On,t ) ∀ (x, y) ∈ On,t (9) S p (x, y), otherwise and

(10) St (x, y) = TL(On,t ) ∀ (x, y) ∈ On,t St (x, y), otherwise respectively. Based on the descriptions aforementioned, we can see that with the synopsis table containing S p (x, y) and St (x, y), the assignment of the foreground instances in the synopsis video can be efficiently performed in real time. E. Illustrating Example

An example illustrating the operations of the proposed MAP estimation with the synopsis table is shown in Fig. 1. A white motorcycle is extracted as a foreground instance O1,t in frame t, as shown in Fig. 1(a). According to MAP, O1,t is considered as a foreground instance of a new foreground object, and therefore is labeled as belonging to a new foreground object P1 . It is then arranged to the s th frame in the synopsis video based on (7). P1 ,s represents the foreground instance of P1 appearing at the s frame in the synopsis video, and the latest synopsis time of each occupied pixel, as shown in the synopsis table at t + 1. In the following, a foreground instance O2,t +1 is extracted from frame t + 1. Via (1), we can find that O2,t +1 is also the foreground instance of the same white motorcycle as P1 in the synopsis video. Based on properties 2 and 3, O2,t +1 will be connected after P1 ,s . Thus, the synopsis frame of O2,t +1 is s +1 and O2,t +1 is represented as P1 ,s +1 , which is the second foreground instance of P1 in the synopsis video. Then, the synopsis table is updated to record the most recent occupied pixels and the latest synopsis time of each occupied pixel, as shown in the synopsis table at t + 2. After k frames, the synopsis table records recent foreground instances of existing foreground objects and their temporal locations in the synopsis video, as shown in Fig. 1(b). Then, a new black car appears from the right side of the video at frame t + k. It is extracted as O12,t +k . After MAP estimation using (1), O12,t +k is considered as a foreground instance of a new foreground object P2 in the synopsis video. According to the synopsis table, the spatial locations of O12,t +k are not occupied by any other existing foreground objects. Thus, O12,t +k can also be arranged to the frame s in the synopsis video. As a result, the synopsis frame of O12,t +k is s and O12,t +k is determined as the foreground instance of P2 ,s . After the update, we can obtain the synopsis table at frame t + k + 1, as shown in Fig. 1(b). With (1) and the synopsis table, O13,t +k+1 is matched to P2 and becomes the second foreground instance P2 ,s +1 of P2 at frame s + 1 in the synopsis video. In the frame t + k + 2, two foreground instances are extracted. The first one is O14,t +k+2 , which represents the black car, and the other one is O15,t +k+2 , which represents a following motorcycle. Again, based on (1) and

1422


Fig. 2.

CCH representation of the foreground object.

and the temporal continuity model are used. In the following, we will introduce these three models. O of a foreground instance To build an appearance model An,t On,t , we consider a descriptor-based approach, which is robust to geometric and photometric transformations. We modify the intensity-based descriptor, contrast context histogram (CCH) [40], which has been shown an effective and efficient descriptor compared with the state-of-the-art descriptors, such as scale-invariant feature transform [41] and local binary pattern [42] in many applications [18], [39] to represent the appearance model. Given a foreground instance On,t , the region in the bounding box of On,t is defined as R Q . R Q is quartered into four quadrants: 1) q1 ; 2) q2 ; 3) q3 ; and 4) q4 . Let pc be the center pixel located in R Q . We obtain the color of pc via computing the mean of four center pixels, as shown in the yellow regions of Fig. 2. The color contrast value C j,k (p) of the pixel p in R Q is defined as C j,k (p) = C j (p) − C k (pc )

(11)

is the j th color channel of p, j ∈ {R, G, B} and where C k (pc ) is the kth color channel of pc , k ∈ {R, G, B}. As shown in [39], we simplify the color combinations of j and k, which are ( j, k) ∈ {(R, R), (R, G), (R, B), (G, G), (G, B), (B, B)}. In each quadrant qi , where i ∈ {1, 2, 3, 4}, we calculate positive and negative contrast value histograms. The positive contrast histogram of the quadrant qi is defined as j,k C (p)|p ∈ qi and C j,k (p) ≥ 0 j,k CCHqi + (pc ) = (12) #qi + C j (p)

Fig. 1. Example showing the operation of the MAP estimation and synopsis table. (a) White motorcycle is extracted from continuous frames, t, t + 1, and t + 2. (b) New black car appears and it is followed by a black motorcycle. (c) Synopsis table of the frame t + k + 3.

the synopsis table, O14,t +k+2 will be represented as P2 ,s +2 , because it is matched to P2 . O15,t +k+2 is a foreground instance of a new foreground object, and cannot be matched to any existing foreground objects. In addition, according to the synopsis table, the locations of pixels occupied by O15,t +k+2 are already occupied by P2 ,s . Thus, according to (7), O15,t +k+2 will not be stitched to the frame s of the synopsis video, but will be placed in frame s + 1, and O15,t +k+2 is represented by P3 ,s +1 meaning the foreground instance of P3 at the frame s + 1 in the synopsis video. Finally, the synopsis table is updated for frame t + k + 3, as shown in Fig. 1(c). IV. F OREGROUND I NSTANCE R EPRESENTATION To represent each foreground instance, three models including the appearance model, the motion prediction model,

and the negative contrast histogram of the quadrant qi is defined as j,k C (p)|p ∈ qi and C j,k (p) < 0 j,k CCHqi − (pc ) = (13) #qi − where and #qi + and #qi − are the numbers of pixels with positive contrast values and negative contrast values in qi , respectively. Since there are four quadrants in R Q , six color combinations for computing color contrast and two contrast histograms for each quadrant, the dimension of the descriptor CCH(On,t ) of On,t is 48 (= 6 × 4 × 2). In order to avoid linear lighting changes, the descriptor is normalized to the unit length. Let the descriptor of the first foreground instance Pm,a of Pm be CCH(Pm,a ), where a is the first frame that Pm appears in VS , and the CCH descriptors of the foreground instances of Pm can be represented as {CCH(Pm,a ), CCH(Pm,a+1 ), . . . , CCH(Pm,s )}. The appearance model AmP of Pm is represented by a Gaussian mixture


1423

TABLE II T ESTING V IDEOS

model, which is composed by K weighted Gaussian distributions. In our case, K is three for real-time computation. The O |A P ) is defined as likelihood function p(An,t m K O P |Am = πk,s G(CCH(On,t )|μk,s , k,s ) p An,t

(14)

k=1

where πk,s , μk,s , and k,s are the weight value, the mean vector, and the covariance matrix of the kth Gaussian to model the appearance model of Pm at the sth frame of VS , and G is a Gaussian probability density function. The covariance matrix k,s is assumed to be uniform diagonal for fast computation as 2 I k,s = σk,s

(15)

2 is the variance of the kth Gaussian and I is the where σk,s identity matrix. Besides the appearance model, the motion prediction model takes advantage of the prediction of the position of the foreground object. Such model can restrict the searching region in the synopsis table, and reduce the chance of mismatching. To predict the possible foreground instance motions, we build motion prediction models MmP for each existing foreground object Pm based on the previous tracking results, similar to [17] and [46]. Since the velocities and the sizes of a foreground object are similar between adjacent frames, the position d(Pm,s+1 ) of the next foreground instance Pm,s+1 of Pm can be predicted as

x s + (x s − x s−1 ) d(Pm,s+1 ) = (16) ys + (ys − ys−1 )

where x s and ys are the x and y coordinates of Pm,s at frame s. O |M P ) is then defined as follows The likelihood function p(Mn,t m O and M P : to assess the motion likelihood between Mn,t m O p Mn,t |MmP = exp − d(Pm,s ) − d(On,t ) (17) where d(On,t ) = [x t yt ]T are the x and y coordinates of the center of On,t at frame t. The third model is the temporal continuity model, which represents the temporal connectivity of foreground instances between adjacent frames. Because of the continuous motions of a foreground object in the surveillance video, the size of the foreground instance of a foreground object in frame t will be similar to that of the same object in frame t − 1. Thus, if On,t follows Pm in the synopsis video, they should appear in nearby foreground regions. This can be evaluated based on the number of overlapped pixels with respect to the latest foreground instance Pm,s of Pm appearing at the frame s of the synopsis video. We define the temporal continuity likeO |T P ) between the temporal continuity lihood function p(Tn,t m O models Tn,t of On,t and TmP of Pm as FO ∪ FP O P n,t m,s p Tn,t |Tm = exp − O (18) P Fn,t ∩ Fm,s O and F P are the foreground masks of O where Fn,t n,t and m,s Pm,s , respectively.

With these three models, Pm ∗ can be retrieved using (1) and then On,t is arranged to the synopsis video. If the incoming foreground instance On,t is matched to Pm ∗ , the update of the appearance model, the motion prediction model, and the temporal continuity model of Pm ∗ is required. The appearance model AmP ∗ of Pm ∗ will be updated according to CCH(On,t ) as μk,s ∗ +1 = (1 − ρ)μk,s ∗ + ρCCH(On,t )

(19)

and 2k,s ∗ +1 = (1 − ρ) 2k,s ∗ + ρ(CCH(On,t ) − μk,s ∗ )T ×(CCH(On,t ) − μk,s ∗ ) (20) where ρ = αG(CCH(On,t )|μk,s ∗ , k,s ∗ ) and α = 0.1 is a learning rate. The new position of Pm ∗ ,s ∗ +2 is updated as

x t + (x s ∗ +1 − x s ∗ ) d(Pm∗,s∗+2 ) = (21) yt + (ys ∗ +1 − ys ∗ ) where x t and yt are the x and y coordinates of On,t at frame t. The temporal continuity model represents the temporal connectivity of a foreground object between adjacent frames. Once On,t is matched to Pm ∗ , the silhouettes of Pm ∗ ,s ∗ +1 is O , where F P then updated by On,t , i.e., FmP∗ ,s ∗ +1 = Fn,t m ∗ ,s ∗ +1 O and Fn,t are the foreground masks of Pm ∗ ,s ∗ +1 and On,t , respectively. V. E XPERIMENTS A. Experiment Setting and Performance Metrics In order to evaluate the performance of our MAP-based surveillance video synopsis method, we recorded four testing videos including three outdoor scenes and one indoor scene. The first video presents the traffic situation of a cross road. The second video captures a street scene, which contains groups of pedestrians and parking cars. The third video captures the street scene and pedestrians containing some illegal moving motorcycles. The fourth video captures a hall scene with several entrances. The resolutions of these videos are 320 × 240. The characteristics of the videos including the duration, the number of frames, and the total numbers of foreground objects of each video are shown in Table II. The proposed method is implemented on an Intel Core i7 computer with a 3.07-GHz CPU and 6-GB memory. Since our method aims to process online streaming videos, our method did not screen the entire video. To extract foreground instances from the original surveillance video, background modeling methods [39], [43]–[45] have been proposed. In the experiment, we used the

1424


hierarchical background modeling approach [39], which has been shown of the real-time and effective performance for foreground instance extraction. The hierarchical background modeling method applies descriptor-based background models to retrieve foreground candidates and then apply GMM [43] on the candidates to extract the foreground instances. For quantitative comparisons, four performance metrics are used. The first metric is the frame reduction rate (FR), which is computed as the ratio of the number of the frames of the synopsis video to the number of the frames of the original video. The smaller the FR, the fewer frames the synopsis video contains. The second one is the average frame computation time per frame (AT), which indicates the efficiency for realtime applications. The third one is the average frame compact rate (CR), which is used to evaluate if the frames of the synopsis video are well utilized to arrange the foreground objects. The higher CR indicates that the generated synopsis video has more compact content. CR is computed as CR w

S 1 {1| if v(x, y, s) ∈ foreground in VS } w · h · #VS #V

=

h

s=1 x=1 y=1

(22) where v(x, y, s) represents a pixel (x, y) of the sth frame in VS , w and h are the width and the height of the frame, and #VS is the number of frames of VS . The fourth one is the chronological disorder (CD), which is used to evaluate if the foreground objects appearing in the synopsis video follows the chronological order. If the temporal order of a foreground object with respect to the remaining foreground objects in the original and the synopsis videos is consistent, the foreground object has a low CD. The CD is computed as 1 {1|D O (Pm , Pm )D S (Pm , Pm ) < 0} CD = #F Pm ∈F Pm ∈F

(23) where F is the union of all of the foreground objects in the synopsis video, #F is the total number of the foreground objects in the synopsis video, and D O (Pm , Pm ) and D S (Pm , Pm ) are the differences of the first appearing frame indices of Pm and Pm in the original and the synopsis videos, respectively. B. Comparative Baselines For comparison, we implement [10] and [17]. As mentioned in [10], the computational complexity of the MRF algorithm is high. To reduce the computational complexity, they suggest reducing the number of foreground objects in the video. Please note that computational complexity of [10] is related to the number of foreground objects and the number of the frames of the synopsis video. Thus, a long video, which contains thousands of foreground objects, is downsampled to multiple short videos, which contain the same number of foreground objects. In this way, each short video will have the same computational complexity. Please note that because of the false tracking of foreground objects using median filters in [10],

TABLE III C OMPARISONS OF N UMBERS OF S YNOPSIS F RAMES AND FR

TABLE IV AVERAGE C OMPUTATION T IME P ER F RAME (S ECONDS )

the numbers of foreground objects of each video are more than those of the ground truth of each video. In the following experiments, the numbers of foreground objects of each down sampled video are set to κ#F, where κ = 0.25 or κ = 0.5 to increase the computation speed. Moreover, we also arrange all of the detected foreground objects of [10] to the synopsis video for comparison. Table III shows the comparison results of the proposed method and [17] in terms of the numbers of frames obtained in the synopsis video and the FR, respectively. Compared with [17], the proposed method can achieve better (the lower) FR in most of the cases. As shown in Table III, the numbers of frames of the synopsis videos are reduced to about onefifth to one-tenth of the original surveillance videos with a relatively busy scene. In Table III, we do not show the FR of [10], because it is a user-specified number in [10]. That is, the method in [10] performs the condensation of the video based on a manually assigned FR (or the assigned number of frames in the synopsis video). Because of this reason, in the following comparisons of computational complexity and average frame CR, the required number of frames obtained by [10] is set as the same number generated by the proposed method. In this way, we can fairly compare the computation time and visual quality between [10] and the proposed method under the same setting. The averaged computation time per frame of [10], [17] and the proposed method is shown in Table IV. As mentioned in [10], the computational complexity will grow exponentially with respect to the number of foreground objects. Thus, the AT of κ = 0.5 significantly increases compared with that of κ = 0.25 for all of the testing videos. As we have expected, the average computation time of arranging all of the foreground objects using MRF requires much more AT for each frame, which limits the applicability of [10] in practical situations. Because the computational complexity of the proposed method is only dependent on the number of foreground instances in each frame of the original video and the foreground objects in the synopsis table, the AT of the proposed method is less than 0.015 s. It is much shorter than that of the state-of-theart optimization-based approaches. Such results show that our


1425

TABLE V AVERAGE F RAME CR

TABLE VI C HRONOLOGICAL D ISORDER

TABLE VII F ILE S IZE U NDER H.264 C OMPRESSION (B YTES )

Fig. 3. Example synopsis results on the cross road data set. (a) Four input frames from the original video. (b) Synopsis video frame.

method can be applied to process online streaming videos in real time. Please note that the AT of our method contains the computation time of MAP and synopsis video generation using the synopsis table. In contrast, the AT of [10] only contains the computation time of synopsis video generation. Thus, the proposed method is more efficient than [10]. As shown in Table V, the proposed method has similar average CRs compared with those of [17] and those of [10] with different settings. Such results indicate that the proposed method can effectively utilize the synopsis frames as [10] under the same number of frames in the synopsis video. The results are expectable, since the proposed method can efficiently comply with properties 2 and 3 in arranging foreground objects. Table VI shows the average CDs of each method. Because [10] arranges foreground objects to the synopsis video by MRF, the foreground objects appearing at the later frames are possibly moved to the previous frames of the synopsis video. Thus, the CDs of [10] are higher than those of [17] and the proposed method. The results also show that when more foreground objects are arranged by MRF (with larger κ), the CDs will increase. Compared with [17], the proposed method can achieve better (lower) CDs in most of the cases. Table VII shows the file sizes of the original and the synopsis videos under H.264 compression. As expected, the file sizes of the synopsis videos are smaller than those of the original videos. It implies that the synopsis video not only can provide a fast browsing, but also can reduce the file sizes for storage. The example images of the synopsis videos are shown in Figs. 3–6. As shown in Fig. 3, the taxi, the black car, and three motorcycles in Fig. 3(a) are detected and stitched with other vehicles from other frames into a synopsis frame shown in

Fig. 4. Example synopsis results on the street data set. (a) Four input frames from the original video. (b) Synopsis video frame.

Fig. 3(b) in the synopsis video. As a result, these vehicles appearing in different frames are condensed to a synopsis frame. Similar results can also be found in Figs. 4–6 for the street, hall, and sidewalk videos, respectively. For detailed results, please refer to the demo video.1 In addition to the synopsis videos obtained by the proposed method, the synopsis videos obtained by [10] and the original video are also posted for reference. 1 Demo video is available at http://www.cs.nchu.edu.tw/∼crhuang/file/MAP_ synopsis_demo.wmv

1426


Fig. 5. Example synopsis results on the hall data set. (a) Four input frames from the original video. (b) Synopsis video frame.

Fig. 6. Example synopsis results on the sidewalk data set. (a) Four input frames from the original video. (b) Synopsis video frame.

C. Discussion 1) Relation to the Previous Work: In [17], the motion prediction model is applied first to obtain candidate foreground objects for each foreground instance. On the obtained candidate foreground objects, the temporal continuity model and the appearance model are further applied to determine the foreground object for the foreground instance. Such a cascade approach implies that different thresholds are required for each model to obtain the matched foreground object. It is hard to find unified thresholds of each model with respect to different videos. In contrast, this paper is an improvement

to the earlier work by integrating the three models into MAP estimation to obtain the most similar foreground object. With MAP, the decision is based on the probability resulted from overall considerations, which can achieve a better accuracy. Furthermore, the need of threshold for each model can be alleviated. Another improvement of this paper to [17] is in the design of the synopsis table. The table in [17] has only one entry, recording the time only without the foreground object information. Because of this reason, object trajectory cannot be strictly retained. However, in this paper, the synopsis table is designed with two entries, so as to link foreground objects with the time (the frame number in the synopsis video). Therefore, we can achieve more efficient and effective synopsis video generation. Because of the above reasons, the mechanism for assigning foreground instances into synopsis video is different. This paper also adds the mechanisms for the maintenance of the synopsis table. It is worth to note that the FR of the proposed method is slightly worse than that of [17] in the street video. In this video, many visually similar pedestrians walk in the street and become occluded to each other. Thus, using the motion prediction model to filter out visually similar pedestrians provides better tracking results and thus [17] leads better FR. Nevertheless, the CR and CD of [17] remain worse compared with those of the proposed method. 2) Computational Complexity: In our approach, the computation time of the MAP estimation problem is dependent on the number N of detected foreground instances and the number M of the existing foreground objects appearing in the synopsis table. Thus, the computational complexity of our method for each frame is O(M N). Although the more foreground objects the longer the computational time is required, the complexity is linear with respect to the number of foreground instances in each frame. The total computational complexity is O(T M N), where T is the length of the original video. This highly contrasts to Pritch et al.’s [9]–[12] approach, which requires time complexity of O(T K ) with the total number of K foreground objects and T time steps for synopsis video generation. Thus, the computational complexity of their algorithm will grow exponentially. For a relatively busy scene and a relatively long video, K could be very large (K M and K N), i.e., the foreground object density of the original video is high; it is hard to achieve real-time performance by their approach, as shown in Table III, without modification. Because of such a problem in [9] and [10], it was mentioned that the older or smaller foreground objects may be considered as less interesting and are removed in advance in [9] and [10], to accelerate the optimization speed. However, how to decide the older and smaller foreground objects will be a challenging issue for different surveillance videos. In [9] and [10], one of their test video of the parking scene recorded 24 h has 262 foreground objects and another video of the airport scene recorded 30 h has 500 foreground objects. The foreground object densities in these two videos are relatively low compared with some practical situations. For example, the cross road video contains more than 1600 objects in only an hour; the foreground object density is much higher than that of the test videos in [9] and [10]. For videos of such high foreground object densities, it would be very challenging,


if not impossible, applying the existing methods. In comparison, the complexity of our method in synopsis arrangement depends only on the number of foreground instances in each individual frame, which is much smaller compared with the number of foreground objects in the entire videos. As a result, our method can achieve real-time performance; the processing performance of our method can achieve more than 60 frames per second for the cross road video with high foreground object densities. In addition, their approach requires to screen the entire video to retrieve complete trajectories of each foreground object in advance, which more or less limits the real-time capability. In contrast, our method can directly generate the synopsis video from each incoming frame incrementally, and therefore can be applied on either streaming videos or stored videos. 3) Sparsity of Synopsis: Based on properties 2 and 3 of the synopsis video, our method arranges the first foreground instance of a foreground object in the synopsis video and lets the rest foreground instances of the foreground object follow the previous ones. Using the synopsis table, the first foreground instances of different foreground objects will not collide with any other objects. However, as the motions and locations of the following foreground instances are unpredictable, occlusions among foreground instances of different foreground objects may occur, i.e., the subsequent trajectories of different foregrounds intersect. Indeed, the situations of conflicting and overlapping between objects also occur in [10]. The method in [10] leaves the length l of the synopsis video as a user-specified value. Given the length of the synopsis video, a collision term is added in the optimization cost to handle the collisions. However, more often the collision is unavoidable. If the weight of the collision term is high so as to reduce collision, in our experiences, it often results in the optimization difficult to converge. A background image is stored and the colors of the pixels of the image are updated for stitching foreground instances. The foreground instances are stitched to the synopsis video based on TL(On,t ) computed from the synopsis table. Regarding to this question, it is possible for the user to control a parameter so as to reduce the collision. However, this will increase the length of synopsis video and reduce the condensation rate. Fig. 7 shows sampled frames of the synopsis video with different density of objects. 4) Failure Cases: Fig. 8 shows two example frames of the synopsis video. In Fig. 8(a), a white jeep stops at the topright location of the frame and waits for the left turn. Because it stops for a long time, it is updated as the backgrounds by the background modeling method. Thus, after the jeep moves, the backgrounds will still contain the white jeep. As a result, a ghost shadow occurs. This can be solved by adjusting the updating parameters of the background modeling method. Fig. 8(b) shows another failure situation that the same white hatchback appears twice in the same synopsis frame. This is due to an occlusion of the white hatchback by the white sedan, causing the detected foreground instance containing the foreground instances of both the hatchback and sedan. After these two cars separate from the occlusion, the tracking

1427

Fig. 7. Effect of the synopsis density parameter. (a) Dense synopsis. (b) Sparse synopsis.

Fig. 8. Examples of synopsis errors. (a) Ghost shadows due to the foreground instance detection. (b) Repeated appearing cars due to occlusions.

of the hatchback fails. Thus, the foreground instances of the hatchback extracted in the following frames are considered as foreground instances of a new foreground object. The temporal location of the mistracked hatchback is decided by the synopsis table, which is also at the same synopsis frame of the hatchback appearing previously. As a result, the white hatchback appears twice. To solve the tracking problem due to occlusion, the occlusion detection methods [46]–[48] that aim to identify individual foreground objects from occlusions can be applied. However, the computation time will increase. Please refer to [46]–[48] for the analysis of computational complexity. VI. C ONCLUSION This paper proposes MAP estimation with the synopsis table to achieve video synopsis. According to the synopsis table, each foreground instance can be arranged to a proper temporal location in the synopsis video without using a complicated optimization procedure. Moreover, the proposed synopsis table

1428


and MAP estimation can also preserve three properties of synopsis video generation. As a result, real-time and online synopsis video generation can be achieved. Since our method is not necessary to screen the entire video, it can be applied on surveillance cameras for preprocessing and condensing the data before transmission. In the future, we will focus on analyzing the foreground behaviors according to the results of synopsis videos. R EFERENCES [1] R. Feris, Y.-L. Tian, and A. Hampapur, “Capturing people in surveillance video,” in Proc. IEEE Conf. CVPR, Jun. 2007, pp. 1–8. [2] A. Hampapur et al., “Searching surveillance video,” in Proc. IEEE Conf. AVSS, Sep. 2007, pp. 75–80. [3] Y.-T. Chen and C.-S. Chen, “Fast human detection using a novel boosted cascading structure with meta stages,” IEEE Trans. Image Process., vol. 17, no. 8, pp. 1452–1464, Aug. 2008. [4] C.-R. Huang, P.-C. Chung, K.-W. Lin, and S.-C. Tseng, “Wheelchair detection using cascaded decision tree,” IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 2, pp. 292–300, Mar. 2010. [5] W.-H. Cheng et al., “Semantic analysis for automatic event recognition and segmentation of wedding ceremony videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 11, pp. 1639–1650, Nov. 2008. [6] C. Piciarelli, C. Micheloni, and G. L. Foresti, “Trajectory-based anomalous event detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 11, pp. 1544–1554, Nov. 2008. [7] G. Zhu et al., “Event tactic analysis based on broadcast sports video,” IEEE Trans. Multimedia, vol. 11, no. 1, pp. 49–67, Jan. 2009. [8] A. Utasi and C. Benedek, “A Bayesian approach on people localization in multicamera systems,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 1, pp. 105–115, Jan. 2013. [9] A. Rav-Acha, Y. Pritch, and S. Peleg, “Making a long video short: Dynamic video synopsis,” in Proc. IEEE Conf. CVPR, Jun. 2006, pp. 435–441. [10] Y. Pritch, A. Rav-Acha, and S. Peleg, “Non chronological video synopsis and indexing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1971–1984, Nov. 2008. [11] Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg, “Webcam synopsis: Peeking around the world,” in Proc. 11th ICCV, Oct. 2007, pp. 1–8. [12] Y. Pritch, S. Ratovitch, A. Hendel, and S. Peleg, “Clustered synopsis of surveillance video,” in Proc. 6th IEEE Int. Conf. AVSS, Sep. 2009, pp. 195–200. [13] M. Xu, S. Z. Li, B. Li, X.-T. Yuan, and S.-M. Xiang, “A set theoretical method for video synopsis,” in Proc. 1st ACM Int. Conf. MIR, 2008, pp. 366–370. [14] U. Vural and Y. S. Akgul, “Eye-gaze based real-time surveillance video synopsis,” Pattern Recognit. Lett., vol. 30, no. 12, pp. 1151–1159, 2009. [15] S. Wang, J. Yang, Y. Zhao, A. Cai, and S. Z. Li, “A surveillance video analysis and storage scheme for scalable synopsis browsing,” in Proc. IEEE ICCV Workshops, Nov. 2011, pp. 1947–1954. [16] H. Kang, Y. Matsushita, X. Tang, and X. Chen, “Space-time video montage,” in Proc. IEEE Conf. CVPR, Jun. 2006, pp. 1331–1338. [17] C.-R. Huang, H.-C. Chen, and P.-C. Chung, “Online surveillance video synopsis,” in Proc. IEEE ISCAS, May 2012, pp. 1843–1846. [18] C.-R. Huang, H.-P. Lee, and C.-S. Chen, “Shot change detection via local keypoint matching,” IEEE Trans. Multimedia, vol. 10, no. 6, pp. 1097–1108, Oct. 2008. [19] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho, and I. Trancoso, “Temporal video segmentation to scenes using highlevel audiovisual features,” IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 8, pp. 1163–1177, Aug. 2011. [20] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, “An integrated system for content-based video retrieval and browsing,” Pattern Recognit., vol. 30, no. 4, pp. 643–658, 1997. [21] M. M. Yeung and B. Liu, “Efficient matching and clustering of video shots,” in Proc. ICIP, vol. 1. 1995, pp. 338–341. [22] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame extraction using unsupervised clustering,” in Proc. ICIP, vol. 1. 1998, pp. 866–870. [23] R. L. Lagendijk, A. Hanjalic, M. Ceccarelli, M. Soletic, and E. Persoon, “Visual search in a smash system,” in Proc. ICIP, Sep. 1996, pp. 671–674.

[24] S. X. Ju, M. J. Black, S. Minneman, and D. Kimber, “Summarization of video-taped presentations: Automatic analysis of motion and gestures,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 686–696, Sep. 1998. [25] Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, “A generic framework of user attention model and its application in video summarization,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 907–919, Oct. 2005. [26] C. Li, Y.-T. Wu, S.-S. Yu, and T. Chen, “Motion-focusing key frame extraction and video summarization for lane surveillance system,” in Proc. 16th IEEE ICIP, Nov. 2009, pp. 4329–4332. [27] S. Pfeiffer, R. Lienhart, S. Fischer, and W. Effelsberg, “Abstracting digital movies automatically,” J. Vis. Community Image Represent., vol. 7, no. 4, pp. 345–353, Dec. 1996. [28] A. Hanjalic and H. J. Zhang, “An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1280–1289, Dec. 1999. [29] A. M. Ferman and A. M. Tekalp, “Multiscale content extraction and representation for video indexing,” Proc. SPIE, vol. 3229, pp. 23–31, Oct. 1997. [30] A. Stefanidis, P. Partsinevelos, P. Agouris, and P. Doucette, “Summarizing video datasets in the spatiotemporal domain,” in Proc. 11th Int. Workshop Database Expert Syst. Appl., Sep. 2000, pp. 906–912. [31] H. Sundaram and S.-F. Chang, “Condensing computable scenes using visual complexity and film syntax analysis,” in Proc. IEEE ICME, Aug. 2001, pp. 389–392. [32] Y. Ma, L. Lu, H. Zhang, and M. Li, “A user attention model for video summarization,” in Proc. 10th ACM Int. Conf. Multimedia, Dec. 2002, pp. 533–542. [33] Y. Li, S.-H. Lee, C.-H. Yeh, and C.-C. J. Kuo, “Techniques for movie content analysis and skimming—Tutorial and overview on video abstraction techniques,” IEEE Signal Process. Mag., vol. 23, no. 2, pp. 79–89, Mar. 2006. [34] B. M. Wildemuth et al., “How fast is too fast?: Evaluating fast forward surrogates for digital video,” in Proc. 3rd ACM/IEEE-CS JCDL, May 2003, pp. 221–230. [35] E. P. Bennett and L. McMillan, “Computational time-lapse video,” ACM Trans. Graph., vol. 26, no. 3, p. 102, 2007. [36] B. Chen and P. Sen, “Video carving,” in Proc. Eurograph., 2008, pp. 1–4. [37] Z. Li, P. Ishwar, and J. Konrad, “Video condensation by ribbon carving,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2572–2583, Nov. 2009. [38] V. Kolmogorov and R. Zabih, “What energy functions can be minimized via graph cuts?” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 2, pp. 147–159, Feb. 2004. [39] Y.-T. Chen, C.-S. Chen, C.-R. Huang, and Y.-P. Hung, “Efficient hierarchical method for background subtraction,” Pattern Recognit., vol. 40, no. 10, pp. 2706–2715, 2007. [40] C.-R. Huang, C.-S. Chen, and P.-C. Chung, “Contrast context histogram—An efficient discriminating local descriptor for object recognition and image matching,” Pattern Recognit., vol. 41, no. 10, pp. 3071–3077, 2008. [41] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [42] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002. [43] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. IEEE Conf. CVPR, vol. 2, Jun. 1999, pp. 246–252. [44] V. Reddy, C. Sanderson, and B. C. Lovell, “Improved foreground detection via block-based classifier cascade with probabilistic decision integration,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 1, pp. 83–93, Jan. 2013. [45] W.-C. Liu, S.-Z. Lin, M.-H. Yang, and C.-R. Huang, “Real-time binary descriptor based background modeling,” in Proc. IAPR Asian Conf. Pattern Recognit., 2013, pp. 722–726. [46] W. Hu, X. Zhou, M. Hu, and S. Maybank, “Occlusion reasoning for tracking multiple people,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 1, pp. 114–121, Jan. 2009. [47] Z. Wu, T. H. Kunz, and M. Betke, “Efficient track linking methods for track graphs using network-flow and set-cover techniques,” in Proc. IEEE Conf. CVPR, vol. 1. Jun. 2011, pp. 1185–1192. [48] C.-R. Huang, Y.-I. Chiu, P.-C. Chung, and Y.-C. Hung, “Occluded object tracking based on trajectory links in surveillance videos,” in Proc. IEEE ISCAS, Jun. 2014, pp. 337–340.


Chun-Rong Huang (M’05) received the B.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, in 1999 and 2005, respectively. He was with the Institute of Information Science, Academia Sinica, Taipei, Taiwan, where he had been a Post-Doctoral Fellow since 2005. In 2010 he was an Assistant Professor with both the Institute of Networking and Multimedia and the Department of Computer Science and Engineering, National Chung Hsing University, Taichung, Taiwan. His research interests include computer vision, computer graphic, multimedia signal processing, image processing, and medical image processing. Dr. Huang is a member of the IEEE Circuits and Systems Society and the Phi Tau Phi Honor Society.

Pau-Choo (Julia) Chung (S’89–M’91–SM’02– F’08) received the Ph.D. degree in electrical engineering from Texas Tech University, Lubbock, TX, USA, in 1991. She was with the Department of Electrical Engineering, National Cheng Kung University (NCKU), Tainan, Taiwan, in 1991 and became a Full Professor in 1996. She applies most of her research results to healthcare and medical applications. Her research interests include image/video analysis and pattern recognition, biosignal analysis, computer vision, and computational intelligence. Dr. Chung is a member of the Phi Tau Phi Honor Society, was a member of the Board of Governors of CAS Society from 2007 to 2009 and from 2010 to 2012, and is currently an ADCOM Member of the IEEE CIS and the Chair of CIS Distinguished Lecturer Program. She was an IEEE CAS Society Distinguished Lecturer from 2005 to 2007. She was the Director of the Institute of Computer and Communication Engineering from 2008 to 2011, the Vice Dean with the College of Electrical Engineering and Computer Science in 2011, the Director with the Center for Research of E-life Digital Technology from 2005 to 2008, and the Director with the Electrical Laboratory from 2005 to 2008, all with NCKU. She was a Distinguished Professor at NCKU in 2005. She currently is the Chair with the Department of Electrical Engineering, NCKU. Dr. Chung was the Program Committee Member in many international conferences. She was a member of the IEEE International Steering Committee, the IEEE Asian Pacific Conference on Circuits and

1429

Systems from 2006 to 2008, the Special Session Co-Chair of the ISCAS in 2009 and 2010, the Special Session Co-Chair of the ICECS in 2010, and the TPC of the APCCAS in 2010. She was the Chair of the IEEE Computational Intelligence Society, Tainan Chapter, from 2004 to 2005. She was the Chair of the IEEE Life Science Systems and Applications Technical Committee from 2008 to 2009 and a member of the BioCAS Technical Committee and the Multimedia Systems and Applications Technical Committee of the CAS Society. She also is an Associate Editor of IEEE T RANSACTIONS ON N EURAL N ETWORKS and the Editor of Journal of Information Science and Engineering, the Guest Editor of Journal of High Speed Network, the Guest Editor of IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS -I, and the Secretary General of Biomedical Engineering Society of China. She is one of the Co-Founders of Medical Image Standard Association (MISA) in Taiwan and is currently on the Board of Directors of MISA.

Di-Kai Yang received the B.S. degree in electrical engineering from the Institute of Computer and Communication Engineering, National Cheng Kung University, Tainan, Taiwan, in 2012, where he is currently working toward the M.S. degree.

Hsing-Cheng Chen received the B.S. and M.S. degrees in computer science and information engineering from National Chung Cheng University, Chiayi, Taiwan, and the Institute of Computer and Communication Engineering, Tainan, Taiwan, in 2009 and 2011, respectively.

Guan-Jie Huang received the B.S. and M.S. degrees in computer science and information engineering from Fu Jen Catholic University, Taipei, Taiwan, and the Department of Computer Science and Engineering, National Chung Hsing University, Taichung, Taiwan, in 2010 and 2013, respectively.