Multimed Tools Appl (2010) 46:175–205 DOI 10.1007/s11042-009-0348-y
A robust framework for joint background/foreground segmentation of complex video scenes filmed with freely moving camera Slim Amri & Walid Barhoumi & Ezzeddine Zagrouba
Published online: 15 September 2009 # Springer Science + Business Media, LLC 2009
Abstract This paper explores a robust region-based general framework for discriminating between background and foreground objects within a complex video sequence. The proposed framework works under difficult conditions such as dynamic background and nominally moving camera. The originality of this work lies essentially in our use of the semantic information provided by the regions while simultaneously identifying novel objects (foreground) and non-novel ones (background). The information of background regions is exploited to make moving objects detection more efficient, and vice-versa. In fact, an initial panoramic background is modeled using region-based mosaicing in order to be sufficiently robust to noise from lighting effects and shadowing by foreground objects. After the elimination of the camera movement using motion compensation, the resulting panoramic image should essentially contain the background and the ghost-like traces of the moving objects. Then, while comparing the panoramic image of the background with the individual frames, a simple median-based background subtraction permits a rough identification of foreground objects. Joint background-foreground validation, based on region segmentation, is then used for a further examination of individual foreground pixels intended to eliminate false positives and to localize shadow effects. Thus, we first obtain a foreground mask from a slow-adapting algorithm, and then validate foreground pixels (moving visual objects + shadows) by a simple moving object model built by using both background and foreground regions. The tests realized on various well-known challenging real videos (across a variety of domains) show clearly the robustness of the suggested solution. This solution, which is relatively computationally inexpensive, can be used under difficult conditions such as dynamic background, nominally moving camera and shadows. S. Amri : W. Barhoumi (*) : E. Zagrouba Equipe de Recherche Systèmes Intelligents en Imagerie et Vision Artificielle (SIIVA) Institut Supérieur d’Informatique, 2 Rue Abou Rayhane El Bayrouni, 2080 Ariana, Tunisia e-mail:
[email protected] URL: www.isi.rnu.tn S. Amri e-mail:
[email protected] E. Zagrouba e-mail:
[email protected]
176
Multimed Tools Appl (2010) 46:175–205
In addition to the visual evaluation, spatial-based evaluation statistics, given hand-labeled ground truth, has been used as a performance measure of moving visual objects detection. Keywords Video segmentation . Motion compensation . Moving objects . Background . Shadow identification
1 Introduction The ability to identify and to analyze background and foreground objects in indoor/outdoor environments under a moving camera is a fundamental prerequisite task for the design and the implementation of numerous intelligent video content analysis (VCA) systems. There are various applications of such systems like automated visual surveillance (security and traffic monitoring), smart video data mining (semantic indexing and retrieval), video conferencing, robotics and virtual reality, vision-based human-machine interaction, sports enhancement, tracking and object-based video coding (MPEG-4 and MPEG-7 standardization). These systems are often used to save human resources and to reduce false alarms. However, there is no definite way of doing this and much research has been done to find efficient and reliable systems to accomplish this crucial task. While it is known among researchers that many efficient background-foreground segmentation solutions are proposed in the case of static and constrained moving camera, not much attention is paid to the general case of freely moving camera [48,54,55,59]. The task is challenging in the general case due to: – – –
–
–
Camera motion (also referred as ego-motion or visual motion [37]): freely mowing camera often creates background image changes due to its own motion [28], especially when the camera rotates and/or changes its viewpoint; Background-foreground similar appearance: portions of the foreground and background objects may share similar geometric and photometric features; Foreground motion: foreground objects in motion can be very rapid or very slow. Additionally, these objects can have non-rigid appearance (such as athlete) and can be followed by their shadows. Moreover, a moving object size may be relatively large and hence substantial occlusions/deocclusions may frequently occur, either partially or completely, between a foreground object and the background or between two (or more) foreground objects [20]. Last, relatively small foreground objects (such as tennis ball) should not be misclassified as outliers particularly in noised videos. Dynamic background: objects may be added or removed from the filmed scene (such as starting and stopping vehicles), the proposed solution should react quickly to consider these changes in the background model (non-novel objects). Moreover, nonstationary and constantly moving background natural objects (such as micro motion of fluttering vegetation, snow, rain and rippling water) and/or man-made objects (such as clocks, fountains and escalators) should not be modeled as moving objects. Fluctuations caused by the acquirement conditions: if the context of the video acquisition is dynamic (noise, lighting, coloring, contrast, resolution…), many nonmeaningful objects may appear as global (such as sun being covered/uncovered by clouds) or local (such as shadows, highlights and inter-reflections cast by moving objects) changes [16].
Considering these challenges, video background-foreground segmentation has been given more and more attention lately. Existing solutions can be classified into two main classes. The first one concerns layer-based motion segmentation (LBMS) methods which
Multimed Tools Appl (2010) 46:175–205
177
assume that the background and foreground objects have different motion patterns. The main disadvantages of such solutions are the expensive computation cost and the production of many over-segmented objects which are very difficult to post-process in order to form meaningful objects [56]. To resolve the former drawback, many authors have proposed to realize off-line supervised learning using ground-truth for different environments [11,61], which would further intensify the computational cost. Besides, LBMS methods frequently misclassify the scene objects in the case of freely moving camera, especially in the presence of rapidly moving foreground objects, low-resolution objects and background moving objects. Many of the investigations undertaken used both intensity and motion information in order to overcome challenges due to differences between the intensities of background and foreground objects which do not constitute significant objects and/or sufficient target pixels [45]. Nevertheless, it is yet hard to efficiently and precisely segment complex videos filmed with freely moving camera, by the use of LBMS-based techniques. The second class includes background subtraction (BS) methods which use an appropriate thresholding procedure on the difference map between each frame of the studied video and a predefined model of the background [47]. Most of existing BS solutions assume that the background scene is known a priori or is at least static, which is not the case in many real-world applications. In the absence of any prior knowledge about scene object and environment, the most widely adopted approaches for video backgroundforeground segmentation are based on background subtraction [12]. These approaches are acknowledged to provide best compromise between performance and reliability [47]. Nevertheless, video segmentation based on background modeling may still be confused by moving background objects or motionless foreground objects. In fact, in the case of freely moving cameras, BS approaches often fail to model the background (and therefore the foreground objects) accurately, particularly in the presence of sudden scene changes (reactivity) [12]. This is reflected by the detection of false moving objects (often called “ghosts”). Among the proposed BS methods, we distinguish the pixel-level modeling and the region-level (even object-level) modeling techniques. The major advantage of pixelbased techniques is that they do not require the detection of special 2D features [43]. However, they fail to take into account the substantial degree of correlation between neighboring pixels [42] and are therefore more sensitive not only to noise [12] but also to sudden changes in the scene and to multiple non-rigid foreground objects [47]. Note that we have not considered the methods based on successive image difference (SID) since they only focus on tracking the foreground objects [49,54,57]. Moreover, SID solutions often breakdown when dealing with small moving objects identification [45], with simultaneous rapid/slow foreground objects and in the presence of unmodeled illumination changes or noisy background motion clutter [43]. This work presents a robust BS-based framework for joint background-foreground segmentation, which makes use of region information to overcome the aforementioned drawbacks of BS-based approaches. The background is modeled using a mosaic representation of the video to fully exploit the spatio-temporal information in the video scene, thereby achieving precise moving objects detection and localization. The background modeling is based on the compensation of the camera motion using image alignment. It provides an electronically stabilized view in which ego-motion is eliminated for points that lie on the surface, leaving only residual motion due to parallax and independent motion of the camera [15]. Then, the suggested framework jointly identifies background objects and foreground ones, while discriminating between moving visual objects and shadows, by the use of region-level information as well as pixel-level
178
Multimed Tools Appl (2010) 46:175–205
information aggregated over a group of frames. The contribution of the proposed framework is challenging especially that most of the above-mentioned challenges are accounted for in a graceful manner. We assume that the camera is freely moving (random unknown motion, zoom, viewpoint change…), the background is non-stationary (significant and insignificant moving background object, illumination change, occultation…) and the motion of the foreground objects can be very complex (rapid, non-rigid, occultation…). To the best of our knowledge, the proposed framework is much less constrained than the majority of the proposed solutions [20]. Indeed, we only assume the scene planarity hypothesis which is the case in most of the realworld video sequences, while assuming that the models of the foreground and background objects and their eventual motions (often called velocities) are random, which achieves maximum application independence. Note that most of the rarely proposed solutions to the proposal of non-planar scene video segmentation require the disposition of the depth information [24, 31]. Such a requirement would require either data from two image sources or complex estimations of 3D object placement from 2D images [51]. The rest of this paper is structured as follows. In Section 2, we present a general overview of the general framework proposed for joint background-foreground segmentation, while summarizing its contributions. Section 3 is devoted to a brief description of the used motion compensator based on multi-feature registration. For more a detailed description of the used motion compensation approach interested readers can refer to our previous paper [58], where an objective evaluation of this approach on challenging video sequences is offered. The joint video background-foreground segmentation is presented in Section 4. The experimental study and the assessment of the proposed framework are presented in Section 5. Lastly, Section 6 is devoted to a synthesis of summary contributions of this paper with some of the suggestions for future works.
2 Overview of the proposed framework The proposed framework belongs to the class of background-foreground segmentation methods based on background subtraction. Nevertheless, our solution is partially unconstrained regarding the background scene. In fact, except for the planarity hypothesis, the filmed background is not static with neither a priori knowledge nor a bootstrapping process [29], which assumes that a short sequence with no foreground moving objects is available [26]. The suggested framework aims to segment non-ideal videos captured with freely moving camera filming complex background and large moving non-rigid foreground objects. The first phase of our framework consists in the region-based motion compensation in an attempt to reject apparent changes, resulting from camera motion, from real changes [16]. This phase allows an efficient estimation of the camera motion between each couple (It,It+1) of successive frames ð1 t < nÞ belonging to the input video V (Card(V)=n). Indeed, after automatically segmenting each input frame into a set of salient regions, a set of quadruplet couples, ðCt ; Ctþ1 Þ 2 It Itþ1 , of highly correlated regions verifying the relative position constraint is defined in order to estimate the salient motion of the camera between the two frames considered. This motion is firstly modeled by a rigid transformation. Indeed, the translation in the x and the y directions, the rotation around the optical axe of the camera and the scale factor relating the two successive frames are preliminary estimated to model a rigid transformation between It and It+1. We only focus on the potential correlated regions which verify a spatial likeness between quadrilaterals formed by their centers of gravity in order to keep only meaningful background regions for estimating only the salient motion of the camera, while rejecting non-stationary and
Multimed Tools Appl (2010) 46:175–205
179
constantly insignificant motions of background objects. Noting that our goal is not to match all the regions, but we are concerned with matching only some regions correctly. The produced estimation is further refined using a cost-efficient interest points matching in order to derive a precise model of the camera motion, between each couple of successive frames, with eight projective transformation parameters (since we assume that the scene is planar). In fact, preliminary estimated translations of the camera are used to substantially delimit the search window of the potential homologous interest-points. This increases the interest point matching, and thereby the projective motion estimation, reliability and decreases the combinatory complexity of the projective motion estimation process. Once the motion is reliably estimated between each couple of successive images, the second phase of the proposed framework is the joint background-foreground segmentation. It is mainly based on dynamic pixel/region grouping and splitting, which takes into account the spatial correlations of image pixels [14]. In fact, a preliminary panoramic-based modeling of the background IB is first produced by applying the estimated motion for only a lowcardinal subset of key-frames. The resulting panorama can be considered only as an initial estimation of the video background, since it is deviated from the correct background color. Indeed, some parts of the moving objects, particularly the most rapid ones and the ones occulted by other moving objects, can partially appear in the produced mosaic as ghost-like traces, since they represent parts of the background which are never, or partially, visible in the video stream. The motion relating two (non-necessary adjacent) frames ðIt ; It0 Þ ð1 t < t0 nÞ is defined as the composition of the successive motions relating the couples of successive images ðIt ; Itþ1 Þ; . . . ; ðIt0 1 ; It0 Þ. This allows to well characterizing the overlap differences what reduce thereby parallax effects in moving object detection [39]. Besides, the decomposition of region-based motion estimation between two images into a set of successive motions relating successive images permits to partially overcome the challenge of the foreground objects which are added or removed from the scene. Then, given the initially estimated background IB, each input key-frame It is re-sampled before being projected on a resampled version of the initial background and the difference map between the projected image I′t and IB defines approximately a binary mask Ft isolating the foreground objects in the frame t. The parts of I′t that differ significantly from IB, define an initialization of foreground objects masks which will be refined later; if this is unfulfilled, we will have to claim that it is background. In order to minimize the effects of the intrinsic capture noise and of the presence of small regions belonging to the ghost-like traces of the moving objects in the preliminary background image, we have proceeded to the resize of the images I′t and IB before generating the binary mask Ft. Next, Ft and IB are cooperatively refined while considering the region map of the image I′t. In fact, each foreground object mask can be reduced or augmented while considering the regions belonging to this mask, which reduces considerably the semantic gap between the produced mask and real foreground objects. This update process of the binary foreground masks should be then followed by a similar process allowing a precise modeling of the final background. Moreover, contrarly to the already proposed approaches dealing with shadow processing [12], shadows have not been handled separately within the proposed framework. Indeed, regions representing shadow effects are firstly located inside foreground objects masks and then considered as semantically outlier since they must be attributed neither to the background class (as their color properties are different from the background) nor to the foreground one. Moving shadows should thus be iteratively detected, removed and replaced in the final background-foreground segmentation (Fig. 1). By using image segmentation as a pre-processing step, we combine in the joint background-foreground segmentation phase the traditional pixel-wise labeling procedure with a lower-dimensional binary labeling procedure. This bridges the modeling gap
180
Multimed Tools Appl (2010) 46:175–205
Original sequence
…
…
Region segmentation
…
…
Camera Motion Compensation
1
.
1
Initial Background Modeling
…
… Initial Foreground Modeling
…
…
…
…
Joint Background-Foreground Model Refinement Foreground Model Refinement Background Model Refinement Shadow Processing
…
…
Final refined moving visual objects model
Final background model
Fig. 1 Flowchart of the proposed framework for joint background/foreground segmentation of complex videos
Multimed Tools Appl (2010) 46:175–205
181
between the semantic background-foreground scene model and the produced video segmentation since regions illustrate efficiently the semantic content of an object, particularly when it is in motion. We note that the final background-foreground segmentation depends strongly on the initial background which is itself dependent on the camera motion compensation. As we only considered quadruplets of regions performing high correlation scores and verifying the relative position constraint, in most of the cases, only salient background regions respecting the general camera motion are used to estimate the salient motion parameters. This minimizes the portions of foreground objects which can appear in the initial estimated background, which thereafter optimizes and accelerates the joint background-foreground separation. Notice that, in what follows, we illustrate the results of each step of the proposed framework while using the “volley-ball” video sequence (composed of 52 frames of resolution 680×425) since it illustrates the main aforementioned challenges of background-foreground segmentation of videos filming large non-rigid foreground objects under a freely moving camera. In fact, the camera makes many motions (translation, rotation, zoom and viewpoint changing). Besides, background and foreground are composed of different colored and textured objects in this sequence, and the color ranges of the different background and/or foreground objects are strongly overlapping. This sequence is also characterized by a variety of moving objects (small vs. large and rapid vs. slow), with different color and texture features, on a complex background. As a matter of fact, each player in this sequence is a highly deformable nonrigid object that can suddenly change its appearance and its speed and it often overlaps with other moving objects (players, referees and/or supporters). It is also a good test sequence when trying to deal with shadow identification and removal in different lightning and coloring effects. Finally, the static background restriction is relaxed in the “volley-ball” sequence since a great proportion of the background (supporters) is in micro-motion what is semantically non-meaningful and must be not considered during the background modeling.
3 Camera motion compensation The goal of this phase is to compensate the camera motion Ht;tþ1 ð1 t < nÞ between each couple of successive frames ðIt ; Itþ1 Þ 2 V V , while combining the pixel and the region levels. The most motion models commonly used are rigid, affine and projective. Other more complicated transformations, producing better estimation of the global motion model, are much more complex and are more sensitive against errors [1]. We adopt here projective transformation, which is frequently used to describe global object motion as a geometric transformation in the image plane [17], since we exploit the assumption of planar background. Thus, the mapping from the world coordinate system to the image coordinate one can be described by a plane-to-plane mapping (homography) [23]. While choosing coordinate system such that background plane is located at z=0 and using homogeneous coordinates which are scaling invariant (a33 =1), the camera motion is modeled by 8-parameter projective motion model which is the model mostly used for arbitrary rotation and zoom [35] guarantying a good trade-off between complexity and representativity [27]. Indeed, the transformation Ht,t′ (homography) between two input frames, It and It′, is the composition (1) of the homographies Ht and Ht′ describing successively the mapping of It and It′ on the background plane. 0 1 a11 a12 a13 o Ht;t0 ¼ Ht Ht0 ¼ @ a21 a22 a23 A ð1Þ a31 a32 1
182
Multimed Tools Appl (2010) 46:175–205
To estimate the projective transformation relating two images, the existing approaches may be explicitly classified into two main classes. The first one, namely direct methods, does not use a process of features detection and the estimation operates directly on the image intensities. The second class, known as feature-based methods, is based on the extraction and the matching of salient and distinctive features. The feature-based registration methods have many advantages over the direct ones. In fact, they do not require initialization steps, they can handle small overlapping regions and they are very efficient against light change. Moreover, they are more flexible regarding image rotation, noise, zoom and moving objects [13]. The most used features in the literature are regions and interest points. The majority of the existing approaches use the points of interest due to the precision of their detection. Nevertheless, their matching is the most critical task because it needs huge computational cost. Besides, radiometric attributes associated to a point are not sufficiently discriminatory and can vary a lot according to lighting changing. Furthermore, the geometric position of a point moves randomly when the motion is undefined (particularly for the rotation around the optical axis and for scale factor changing) and when a priori knowledge is not available [58]. On the other side, regions are the 2D structures with the richest semantic attributes and the most stable under any transformation. However, the methods of regions extraction are not sufficiently precise on the borders. Thus, our idea is to exploit the duality between the regions and the points of interest in order to benefit from the complementarities of these two primitives advantages. Indeed, for the estimation of the transformation Ht,t+1 relating two successive frames, we used our efficient automatic registration approach based on the regions and the Harris points features matching [58]. Salient region matching is first used in order to estimate the global motion of the camera, while considering relationships and correlations amongst nearby pixels. This motion is preliminary modeled as a rigid transformation (2), which is the composition of a rotation around z (the optical axis), translation along x, translation along y and a scale factor s (due to translation along z).
xtþ1 ytþ1
cosðaÞ sinðaÞ xt Δx ¼ s: þ sinðaÞ cosðaÞ yt Δy
ð2Þ
To estimate the rigid transformation, the original color images are first coarsely segmented into regions (Fig. 2). To do this, we first use an unsupervised unseeded watershed algorithm on the gradient image [52], which proves its efficiency in obtaining initial partitions for a wide range of images’ types [22]. Since general videos are acquired by standard color cameras, the color space used during the watershed pre-segmentation process is the basic RGB space. Besides, in order to overcome over-segmentation effects, usually caused by watershed-based segmentation techniques, a post-treatment merges neighboring over-segmented regions which are “similar”. To characterize the inter-regions similarity, we compare their features in the L*a*b* color space (also called CIELAB or CIE L*a*b*) which enables us to quantify the colors visual differences. This color space is obtained through a non-linear transformation in order to achieve perceptual uniformity which especially optimizes the determination of perceptual color differences for the purpose of a good simulation of the color differencing of the human visual system [32]. Then, a prediction/validation registration process is applied to the obtained two regions maps, relative to the couple of images (It,It+1), in order to keep only the set Ω of couples ðCt ; Ctþ1 Þ 2 It Itþ1 performing high correlation scores and moving in a consistent direction for a period of time. These regions combine spatial and photometric coherency to define the salient rigid motion of the camera. The prediction is made by measuring the
Multimed Tools Appl (2010) 46:175–205
183
correlation scores, and the validation is achieved by verifying the relative position constraint for four couple of highly-correlated regions which verify a spatial likeness between the quadrilaterals formed by their centers of gravity [58] (Fig. 2). As the regions “touching” images borders can perform often false matching, since they may not have correspondents in the next frame, they are not considered during the prediction/validation process. The retained regions (⊂ Ω) are then used to describe the salient rigid motion of the background while discarding uninteresting motion. In fact, from the analysis of the matched-regions relative positions, we can partially estimate the translations in x and y directions, the angle α of the camera rotation around the optic axis and the scale factor s q! u and and I . In fact, the angle α (3) between the two vectors G G relating the frames I t t+1 t t ! Gptþ1 Gvtþ1 (where Rqt ; Rptþ1 and Rut ; Rvtþ1 are the two couples of regions recording the best correlation scores among the set Ω and Gj ¼ ðxj ; yj Þ denotes the gravity center of a given region Rj) represents a first estimation of the rotation angle. Besides, the ratio (4) of the distances between their centers of gravity Gqt ; Gut and Gptþ1 ; Gvtþ1 can be considered as a first estimation of the scale factor s between the two consecutive frames. v u p x xtþ1 xt xqt arctg ð3Þ a ¼ arctg tþ1 p yvtþ1 ytþ1 yut yqt ! q u Gt Gt s ¼ !2 p Gtþ1 Gvtþ1
ð4Þ
2
Then, the combination of the rotation by the angle α with the scale factor s is applied to each pixel of the image It+1, while using a bi-cubic intensity interpolation (the aligned image is noted I′t+1). This is done by projecting the image It+1 on the plan of the image It in order to cancel the rotation and the zoom effects caused by the camera motion between these two frames. Last, once the two frames are defined on the same plan (of the image It+1), translations in the x (Δx) and the y (Δy) directions are defined by (5):
Δdirection ¼
jk jk max ddirection j min ddirection k ðRtj ;Rktþ1 Þ2 Ω ðRt ;Rtþ1 Þ2 Ω
ð5Þ
where, dxjk (resp. dxjk ) denotes the Euclidean distance between the x-coordinates (resp. y-coordinates) of the gravity centers relative to the couple of associated regions Rtj ; Rktþ1 (∈Ω). The suggested region-based method of preliminary estimation of the camera motion between two successive frames makes it possible to maintain a plausible rigid motion model although the eventual concurrent motions of foreground and background pixels. It proved its efficiency under many changes in imaging conditions (scale and rotation changes, illumination, image blur and view point changes) while minimizing the intervention of foreground moving objects during the camera-motion compensation process. However, the rigid motion model is used as only an approximation of the general motion of the camera since it cannot handle camera pan and tilt and is only practical for very small pan and tilt angles at large focal lengths [17]. Thus, to refine the global motion obtained at the first stage, Harris points of interest [58] in the images It and I′t+1 are employed to derive precise projective model approximating the camera motion. In fact, for each point of interest (x,y) in It, we use the already estimated translations of the camera, Δx and Δy, to predict a reduced search window of its most potential homologous correspondents while
184
Multimed Tools Appl (2010) 46:175–205
(a)
(b)
(c) Fig. 2 Segmentation in regions and regions matching. a Two successive frames of the input video sequence, b correspondent region maps, c the set Ω of couples performing high correlations scores
exploring window in the image I′t+1. This window is centered on a reduced-size ðx0 ; y0 Þ ¼ Ht;tþ1 ðx; yÞ and of dimensions Δx and Δy. The framing of the search space of the homologous interest points allows the optimization of the interest points matching results, while avoiding blind search of correspondence between the points of interest in two successive images. Indeed, when the motion around the optical axis of the camera is significant, interest points matching by using one of the traditional scores of correlation is impossible. Even for a reduced rotation angle, the interest points matching remains a very difficult task because of the blind search of the interest points homologous in the all image. Advantageously, while using the proposed registration approach and for a large set of 640× 480 videos, a reduction of the search window to the mean size of 5×5 has been achieved, which made the estimation of the motion parameters more reliable [58]. Next, given the estimation of the search space of each point of interest, the matching is based on computing zero-mean normalized cross correlation scores between the potential homologous points followed by the verification of the uniqueness constraint (Fig. 3). Moreover, the RANSAC consensus [19] is used to identify and to remove matching outliers (Fig. 3). Last, to estimate the eight projective motion parameters, we adopte the QR-factorization technique followed by a relaxation algorithm method to iteratively minimize the sum of distances between all matching pairs (Fig. 3).
Multimed Tools Appl (2010) 46:175–205
185
(a)
(b)
(c) Fig. 3 Interest point detection for the estimation of the projective motion model. a Interest points extraction with Harris detector from the two successive images used in Fig. 2, b homologous interest points, c the homologous interest points kept after applying RANSAC
This approach has proved, both subjectively and objectively [58], its capability of precise projective global motion estimation and its robustness against: camera motions, presence of moving objects, variations in illumination, acquisition conditions and imaging noise (Fig. 4). The used region-based motion compensation method fails only if it does not
(a)
(b)
Fig. 4 Difference map between the two successive frames of Fig. 2. a The camera motion is not compensated, b the projective-modeled motion is compensated
186
Multimed Tools Appl (2010) 46:175–205
match at least four couples of regions correctly, which rarely occurs in video streams as the overlapping area between two successive images is often relatively sufficient [58]. Theoretically, if it is the case we would only focus on dealing with the points of interest matching. Indeed, the matching is immediate for the residual points in the small overlapping area since the number of these points is extremely reduced and there is almost no ambiguity about the matching of each interest point.
4 Joint background-foreground segmentation The first stage of the proposed joint background-foreground segmentation phase consists in aligning a set of m (m