Unsupervised Tracking of Stereoscopic Video ... - Semantic Scholar

Unsupervised Tracking of Stereoscopic Video Objects Employing Neural Networks Retraining Anastasios D. Doulamis, Klimis S. Ntalianis, Nikolaos D. Doulamis, Kostas Karpouzis and Stefanos D. Kollias National Technical University of Athens Electrical and Computer Engineering Department 9, Heroon Polytechniou str. Zografou 15773, Athens, Greece E-mail: (adoulam, kntal, ndoulam)@image.ntua.gr

Abstract A novel approach is presented in this paper for improving the performance of neural network classifiers in video object tracking applications, based on a retraining procedure at the user level. The procedure includes (a) a retraining algorithm for adapting the network weights to the current conditions, (b) semantically meaningful object extraction which plays the role of the retraining set and (c) a decision mechanism for determining when network retraining should be activated. The retraining algorithm takes into consideration both the former and the current network knowledge in order to achieve good generalization and reduce retraining time. Object extraction is accomplished by utilizing depth information, provided by stereoscopic video and incorporating a multiresolution implementation of the Recursive Shortest Spanning Tree (RSST) segmentation algorithm. Finally the decision mechanism in this framework depends on a scene change detection algorithm. Results are presented which illustrate the performance of the proposed approach in real life experiments. Keywords: neural network, tracking, stereoscopic video sequneces

1 Introduction The success of the new emerging multimedia applications, such as video editing, content-based image retrieval, video summarization, objectdependent transmission and video surveillance depends on the development of new sophisticated algorithms for efficient description, segmentation and representation of the visual content [1]. Such a content-based approach offers a new range of

capabilities in terms of access, identification and manipulation of the visual information [2]. In particular, a) it provides high compression ratios by allowing the encoder to place more emphasis on objects of interest [2],[3]. b) It offers multimedia capabilities and interactivity since an object can be handled independently and c) It facilitates sophisticated video queries and content-based retrieval operations on image/video databases [4]. The MPEG4 standard introduced the concept of Video Objects (VOs) for content-oriented description and coding of video sequences. Each VO consists of arbitrarily shaped regions with different color, texture or motion. So content-based segmentation remains a challenging task for many applications apart, perhaps, from the case of video sequences produced in a studio environment using the chroma-key technology. In stereoscopic video, however, the problem of contentbased segmentation can be addressed more effectively. This is due to the fact that depth information can be estimated more reliably and provides an efficient content description, since a video object is usually located on a specific depth plane [5]. Furthermore, neural networks have not played a significant role in the development of video coding standards, such as MPEG-1 and MPEG-2. Nevertheless their superior non-linear classification abilities can make neural networks a major analysis tool in the forthcoming multimedia oriented standards (MPEG-4 and MPEG-7). Several techniques and algorithms have been proposed in the literature for image segmentation and tracking. Some color-oriented methods have been recently proposed based on the morphological watershed [6] or by using split and merge techniques [7]. However, an intrinsic property of video-objects is that they usually consist of regions of totally different color characteristics and consequently the main

problem of any color-oriented segmentation scheme is that it oversegments an object into multiple color regions. Other segmentation schemes are motionoriented, such as the algorithms proposed in [8], [9] and [10]. Although motion description provides a more reliable content representation than color information, object boundaries cannot be identified with high accuracy mainly due to erroneous estimation of the motion vectors. Other much faster semi-automatic tracking approaches are presented in [11], [12], [13] and [14]. In these methods the current dynamic state of an object is based on the estimation at the previous time instance. In this case, the user initially selects an object of interest and then the algorithm follows the object through time exploiting motion information of the curve describing the object. In this paper, an efficient algorithm for contentbased tracking is proposed oriented to stereoscopic video sequences. The adopted method is a fully automatic (unsupervised) scheme. Initially, a video sequence is processed and scene change detection is performed. After finding the different shots, for a given shot, some of the first pairs of frames (unsupervisedly or by user interaction) are analyzed and a depth map without occlusions for each stereo pair is produced. Then, a multiresolution implementation of the Recursive Shortest Spanning Tree segmentation algorithm (M-RSST) is applied to the depth map and the left or right image channel to produce the segmented color and depth maps. Next a segmentation fusion algorithm is employed which is based on the projection of color segments onto depth segments so that video objects with reliable boundaries are extracted. After finding the video objects, the retraining algorithm uses the new information to optimally adjust the network weights [15]. Finally the retrained network is applied to the rest of the frames within a shot, to classify whether image blocks belong to the object of interest or to the background. The decision mechanism simply decides whether retraining is needed or not. More specifically when a scene change is detected, the decision mechanism signals a new retraining phase. Experimental results on real life stereoscopic video sequences indicate the prominent performance of the proposed scheme.

2 Problem Formulation

For classification purposes it is required to classify each block, or a transformed version of it of the same size, to one of, say, p available classes &i, i=1,2,…,p. Let x i be a vector containing the lexicographically ordered values of the ith block to be classified. A neural network classifier will produce a p-dimensional output vector y ( x i ) defined as follows y ( x i ) =  pωi pωi ... pωi   1 2 p  

T

(1)

where pωi denotes the probability that the ith image j

block belongs to the jth class. Let us first consider that a neural network has been initially trained to perform the previously described classification task using a specific training set, say, S b = ( x′1 , d ′1 ), , x′ m , d ′ m , where vectors x′i

{

(

b

b

)}

and d′i with i = 1,2, , mb denote the ith input training vector, i.e. the ith image block or a transformed version of it, and the corresponding desired output vector consisting of p elements. Let y ( x i ) denote the network output when applied to the ith block of an image outside the training set. Whenever a change of the environment occurs, new network weights should be estimated through a retraining procedure, taking into account both the former network knowledge and the current situation. Let us consider retraining in more detail. Let wb include all weights of the network before retraining, and wa the new weight vector that is obtained through retraining. A training set Sc is assumed to be extracted from the current operational situation composed of, say, mc image blocks of 8x8 pixels; S c = {( x1 , d 1 ), ( x m , d m )} where x i and d i c

c

with i = 1,2,, mc similarly correspond to the ith input and desired output retraining data. The retraining algorithm, whenever a change of the environment is detected, computes the new network weights wa , by minimizing the following error criterion with respect to the weights, E a = E c , a + ηE f , a (2) with Ec, a = E f ,a =

1 mc ∑ z (x ) − d i 2 i =1 a i

1 mb ∑ z ( x ′ ) − d ′i 2 i =1 a i

2

2

and

(2a)

(2b)

xi, j , i = 0,..., M1 − 1, j = 0,..., M 2 − 1 denote the

where Ec, a is the error performed over training set S c

image intensity at pixel (i, j). In most image/video coding and analysis applications the image is processed in partitions, or blocks, of, say, 8x8 pixels.

over training set Sb (“former” knowledge); z a ( x i )

Let

(“current” knowledge), E f , a the corresponding error

and z a ( x′i ) are the outputs of the retrained network, corresponding

to

input

vectors

xi

and

x′i

be written in terms of the input vector and the weights wi0,{a, b} ui,{a, b} ( x j ) = f ( wi0,{a, b} )T ⋅ x j . Thus,

respectively, of the network consisting of weights wa . Similarly z b ( x i ) would represent the output of the network, consisting of weights wb , when accepting

u{a, b}

where

( ( x ) = f (( W

W{0a , b}

[

T 0 {a , b} )

j

is a

Jxq

]

⋅x j

)

)

(5)

matrix defined as

and f (⋅) is vector-valued

vector x i at its input; when retraining the network for

W{0a , b} = w10,{a, b} ...w0q,{a, b}

the first time z b ( xi ) is identical to y ( xi ) . Parameter

function the elements of which represent the activation function of a corresponding hidden neuron. The hyperbolic tangent sigmoidal function is used in this paper.

η is a weighting factor accounting for the significance

of the current training set compared to the former one.

3.2 The Retraining Algorithm

3 The Retraining Technique

The goal of the training procedure is to minimize (2) and estimate the new network weights wa , i.e.,

3.1 The Network Structure Let us, for simplicity, consider i) a two-class classification problem, where classes ω1 , ω 2 refer, for example, to foreground and background objects in an image and ii) a feedforward neural network classifier which includes a single output neuron providing classification in two categories, one hidden layer consisting of q neurons, and accepts image blocks of, say, J pixels at its input. Let us also ignore neuron thresholds. Extension to classification problems and networks of higher complexity can be performed in a similar way. Let w1{a, b} denote a vector containing the q × 1 weights between the output and hidden neurons

and w0k ,{a, b} , k=1,2,..,q

denote the J × 1 vector of

weights between the kth hidden neuron and the network inputs, where subscripts {a, b} refer to either the situation “after” or “before” retraining respectively. Then

[

w{a , b} = ( w10,{a , b}) T ...( w0q ,{a , b}) T ( w1{a , b})T

]

T

(3)

is a vector containing all network weights. For a given input vector x j , corresponding to the jth image block, the output of the final neural network classifier is given by z{a , b} ( x j ) = f ( w1{a , b} )T ⋅ u{a , b} ( x j ) (4)

(

)

where f(.) denotes the sigmoidal function, and u{a, b} ( x j ) = u1,{a, b} ( x j ) ...u q,{a, b} ( x j ) T is a vector

[

]

containing the hidden neuron outputs when the network weights are wb (before retraining) or wa (after the retraining). The network output in (4) is scalar, since we have assumed a single network output. The output of the ith neuron of the first hidden layer an

Wa0 and w1a respectively. Let us first assume that a

small perturbation of the network weights (before retraining) wb is enough to achieve good classification performance. Then, w1a = w1b + ∆ w1

Wa0 = Wb0 + ∆W 0 and

(6)

where ∆W 0 and ∆ w1 are small increments. This assumption leads to an analytical and tractable solution for estimating wa , since it permits linearization of the non-linear activation function of the neuron, using a first order Taylor series expansion. To stress the importance of current training data in (2), one can replace (2a) by the constraint that the actual network outputs are equal to the desired ones, that is za ( xi ) = di i = 1,..., mc , for all data in Sc (7) It can be shown through linearization, that solution of (7) with respect to the weight increments is equivalent to a set of linear equations [15] c = A ⋅ ∆w (8)

[

where ∆ w = ( ∆ w0 )T (∆ w1)T

{

}

]

T

{

}

, ∆w0 = vec ∆W 0 , with

vec ∆W 0 denoting a vector formed by stacking up all

columns of ∆W 0 ; vector c and matrix A are appropriately expressed in terms of the previous network weights. In particular T T c = za ( x1 ) za ( x m ) − zb ( x1 ) zb ( x m ) , expressing c c

[

] [

]

the difference between network outputs after and before retraining for all input vectors in S c . Based on (7), vector c can be written as

[

c = d1 d mc

]T − [zb ( x1) zb ( xm )]T c

(9)

Equation (8) is valid only when weight increments ∆ w are small quantities. It can be proved that, given a

tolerated error value, proper bounds ϑ and φ can be computed for the weight increments, for each input vector x i in Sc − ϑ ( xi ) ≤ (∆ w0k )T ⋅ xi ≤ ϑ ( xi ), k = 1,2, q and − φ ( xi ) ≤ ( ∆ w)T ⋅ a( xi ) ≤ φ ( xi )

(10)

order Taylor series. In this case, neither (7) nor (11) can be expressed in a simple form and thus the new network weights should be estimated through conventional minimization of (2) by applying, for example, the backpropagation algorithm.

4 Retraining Set Extraction

where a is a vector depending on the former weights and on the input vector. Equation (10) should be satisfied for all x i in S c , so as to assure that each

In the previous section the retraining algorithm was presented. So if the retraining set is estimated the network can be directly retrained. In this section the retraining set extraction is performed for a given shot.

linear equation in (8) is valid within the given tolerated error. A less strict approach would require that the mean value of the inner products in (10), for all x i , be

4.1 The M-RSST Segmentation Algorithm

with E f ,b defined similarly to E f , a , with z a replaced

Let us consider that we have extracted a reliable depth plane of a stereo pair, after performing the depth estimation method presented in [17]. A multiresolution implementation of the Recursive Shortest Spanning Tree (RSST [18]) algorithm called M-RSST, is then used both for color and depth segmentation. The M-RSST recursively applies the RSST to images of increasing resolution. Consider an image I of size M 0 × N 0 pixels. Initially, a multiresolution decomposition of the image I is performed with a lowest resolution level of L0 so that a hierarchy of images I(0)=I, I(1),…,I(L0) is constructed. I(1) is of size M 0 / 2 × N 0 / 2 , while I(L0) of size

by z b in the right hand side of (2b).

M 0 / 2 L0 × N 0 / 2 L0 .

smaller than the mean value of the bounds. The dimension of vector c is in general smaller than the number of unknown weights ∆ w , since a generally small number of training data m c is chosen. Uniqueness is imposed by an additional requirement which is due to the term E f , a in (2b). In particular, Eq. (8) is solved with the requirement of minimum degradation of the previous network behavior, i.e., of minimization of the following error criterion. E S = E f , a − E f ,b (11) 2

It can be shown [15] that (11) takes the form of ES =

1 (∆ w)T ⋅ K T ⋅ K ⋅ ∆ w 2

(12)

where the elements of matrix K are expressed in terms of the previous network weights w b and the training data in S b . Thus, the problem results in minimization of (12) subject to constraints (8) and (10). The error function defined by (12) is convex since it is of squared form. The constraints in (8) are linear equalities, while the constraints in (10) are linear inequalities. Thus, the solution should lie on the hypersurface defined by (8), satisfy the inequalities (10) and simultaneously minimize the error function given in (12). A variety of methods can be used to estimate the weight increments based on minimization of (12) subject to (8) and (10). The gradient projection method has been adopted in this paper [16]. As described in section 5, the decision mechanism ascertains whether a small adaptation of the network weights is sufficient to provide accurate classification. In case that such a small adaptation is not adequate, the above training algorithm cannot be used, since the activation function cannot be approximated by a first

Next the image of the lowest resolution is first partitioned into regions (segments) of size 1x1 pixel, i.e., M 0 / 2 0 × N 0 / 2 0 segments are first created. Then, links are generated for all 4-connected region pairs and a weight is assigned to each link, equal to the distance between two adjacent regions, say S1 and L

L

S 2 . The Euclidean distance of color or depth values between the two adjacent segments, weighted by the harmonic mean of their areas, is used as a weight function for color and depth segmentation: a ( S1 )a ( S 2 ) (13a) δ c ( S1 , S 2 ) = c( S1 ) − c( S 2 ) a ( S1 ) + a ( S 2 )

δ d ( S1 , S 2 ) = d ( S1 ) − d ( S 2 ) where

δ c (⋅)

a( S1 )a( S 2 ) a( S1 ) + a( S 2 )

(13b)

expresses the distance for color

segmentation, while δ d (⋅) for depth segmentation. The function a (⋅) returns the area, i.e., the number of pixels, of a segment, d (⋅) corresponds to the average depth and c(⋅) is a 3 × 1 vector containing the average

color components in the RGB or YCrCb color space. Finally, the weights of all links are sorted in an ascending order.

(a)

(b)

(c) Figure 1: Reliable depth map estimation (a) The left image channel. (b) The right image channel and (c) The depth map. At this point, the phase 1 is completed and the phase 2 is activated. Initially, the RSST iteration takes place. In particular, the following steps are involved: 1. The two closest segments (segments corresponding to the least weight link) are merged and the average intensity as well as the size of the newly created segment are calculated. 2. The link weights of the new segment from all its neighbors are recalculated and sorted. 3. Any duplicate links are removed. The RSST iteration phase terminates when either the total number of segments or the minimum link weight (distance) reaches a target threshold. The minimum link weight is preferable since it results in different number of segments according to the image content. In the following steps the segmentation results of the lowest resolution level are propagated to the next resolution level, and the tasks mentioned below are repeated until the full resolution is reached. 1. Each boundary pixel of all resulting segments at the current resolution level is split into four new segments. 2. New link weights are calculated and sorted. 3. Segments are recursively merged using the RSST iteration phase. The selection of the RSST, which is actually the basis for the M-RSST, was made due to the fact that it is considered as one of the most powerful tools for image segmentation. [19]. As far as the computational cost is concerned, the RSST is also the fastest algorithm among all the examined ones [19]. However, the complexity of the RSST still remains

very high, especially for images of large size. Instead, the proposed M-RSST algorithm for color and depth segmentation requires a much smaller computational cost, compared to the conventional RSST. This is due to the fact that the number of segments that is propagated to the following resolution level is considerably reduced. However, the computational complexity of the M-RSST cannot easily be calculated since it depends on the number, shape and size of the extracted segments.

(a) (b) Figure 2: Segmentation results (a) Color map and (b) Depth map. Another benefit of the proposed M-RSST segmentation scheme is that it eliminates regions of small size, which are not in general preferable in the framework of an efficient visual content representation. This is due to the fact that the algorithm begins with a low image resolution and no segments are created or destroyed at the higher resolution levels. Thus, small segments cannot be propagated from the lowest to the highest layer, resulting in a kind of filtering or smoothing in the segmentation domain. Depending on the initial resolution level, objects smaller than this level are destroyed and consequently oversegmentation is avoided. In Figures 1(a) and 1(b) the left and right channel of a stereo shot from the sequence “Eye to Eye” are presented, while in Figure 1(c) the depth map can be seen. Additionally in Figures 2(a) and 2(b) the segmented depth and color planes for the left image channel at full resolution are illustrated respectively. As we can observe color segments describe accurately the objects contour, while depth segments provide a rough approximation of the semantic object. 4.2 Color and Depth Information Fusion In this section color and depth information, provided by the M-RSST algorithm, is exploited so as objects with accurate contours are extracted. This is accomplished as color segments provide reliable contours while the whole object in a rough form is represented by a depth segment. For this reason, color and depth segments are appropriately fused together.

In particular, a video object is identified by merging several color segments using the information provided by the depth segmentation. Let us assume that

K c color segments and

K d depth segments have been extracted by the aforementioned M-RSST algorithm, denoted as S ic , i = 1,2, , K c and S id , i = 1,2, , K d respectively. The segments S ic and S id are mutually exclusive, i.e.,

S ic ∩ S kc = ∅ for any i, k = 1,2, , K c , i ≠ k and, similarly, S id ∩ S kd = ∅ for any i, k = 1,2,, K d ,

i ≠ k . Let us also denote by G c and G d the output masks of color and depth segmentation, which are defined as the sets of all color and depth segments respectively: G c = {S ic , i = 1,2, , K c }, (14) G d = {S id , i = 1,2, , K d } Color segments are projected onto depth segments so that video objects provided by depth segmentation are retained and, at the same time, object boundaries given by color segmentation are accurately extracted. For this reason, each color segment S ic is associated with a depth segment, so that the area of intersection between the two segments is maximized. This is accomplished by means of a projection function: p( S ic , G d ) = arg max{a( g ∩ S ic )}, g∈G d (15) i = 1,2, , K c where we recall that a (⋅) is the area, i.e., the number of pixels, of a segment. Based on the previous equation, K d

sets of color segments, say C i ,

i = 1,2, , K d , are defined, each of which contains all color segments that are projected onto the same depth segment S id : C i = {g ∈ G c : p ( g , G d ) = S id },

(16) i = 1,2, , K d Then, the final segmentation mask, G, consists of

K = K d segments S i , i = 1,2,, K , each of which is generated as the union of all elements of the corresponding set C i :

Si =

g,

g∈Ci

i = 1,2, , K

(17)

G = {S i , i = 1,2, , K } (18)

In other words, color segments are merged together into K = K d new segments according to depth similarity. The final segmentation consists of these segments that contain the same image regions as the corresponding depth segments, but with accurate contours obtained from color segments. Segmentation fusion results are presented in Figure 3 for the analyzed shot. Depth segmentation, shown with two different gray levels as in Figure 2(b), is overlaid in Figure 3(a) with the white contours of the color segments, as obtained from Figure 2(a). Fusion of the two segmentation results provides one segment for each object with the correct boundaries, as shown in Figures 3(b) and 3(c). As is observed, the final segmentation provides a very reliable training set.

(a)

(b) (c) Figure 3: Segmentation fusion results (a) Projection of the color map onto the depth map, (b) Foreground object and (c) Background object.

5 The Decision Mechanism The purpose of this mechanism is to detect when a change of the environment occurs and consequently to activate the retraining algorithm. Let us index images or video frames in time, denoting by x( k , N ) the kth image or image frame following the image at which the Nth network retraining occurred. Index k is therefore reset each time retraining takes place, with x(0, N ) corresponding to the image where the Nth retraining of the network was accomplished. Figure 4 indicates a scenario with two retraining phases (at frames 3 and 6 respectively of a video sequence composed of 8 frames) and the corresponding values of indices k and N. It can be seen that x(0, N + 1) = x(k 0 , N ) where k 0 indicates that after k 0 images from the Nth retraining phase, a new retraining phase, i.e., the (N+1)th takes place.

5HWUDLQLQJ LV QHHGHG

5HWUDLQLQJ LV QHHGHG

,QLWLDO 7UDLQLQJ

[

[

[

[

[

[

[

[

9LGHR VHTXHQFH FRQVLVWLQJ RI IUDPHV

Figure 4. Scenario of a video sequence consisting of 8 frames in which retraining at frames 3 and 6 has been accomplished. In the framework of this paper retraining is performed every time the beginning of a new scene is detected. For this reason a shot cut detection algorithm is applied so that frames or stereo pairs of similar visual characteristics are gathered together. Several algorithms have been reported in literature for scene change detection of 2-D video sequences, which deal with the detection of cut, fading or dissolve changes either in the compressed or the uncompressed domain [20], [21]. Since a shot change occurs at the same frame instance for the two channels, the aforementioned algorithms can be applied to one image channel, e.g., the left channel. In our approach the algorithm proposed in [20] has been adopted for shot detection due to its efficiency and low computational complexity.

(a)

(b)

MIRAGE project [22] in collaboration with AEA Technology and ITC. Experimental evaluations have shown that the proposed method can provide very good results, in many cases, by using the extracted semantic object of the three first frames of a shot, to retrain the network. In Figures 5(a), 5(b) and 5(c) the retraining set of the foreground object is presented while in Figures 5(d), 5(e) and 5(f) we can observe the corresponding set for the background. Finally tracking results are shown in Figure 6. For presentation purposes 6 frames are presented which are located at different equidistant time instances (every 25 frames) among the sequence.

Frame #8053

Frame #8078

Frame #8103

Frame #8128

(c) Frame #8153 Frame #8178 Figure 6: Tracking results for the stereo shot beginning at frame #8028 and ending at frame #8179.

(d) (e) (f) Figure 5: Retraining sets (a),(b),(c) Retraining set for the foreground object and (d),(e),(f) Retraining set for the background object.

6 Experimental Results In this section, the performance of the proposed scheme is investigated. The results have been obtained using a stereo shot of duration 151 frames, taken from the 3D television program “Eye to Eye”, which has been produced in the framework of the ACTS

7 Conclusions Semantic video object extraction has created a new challenging research direction. Several obstacles should be confronted till a generally applied scheme can be produced. In this paper a novel unsupervised method has been presented based on a retraining procedure. When a scene change is detected, the decision mechanism activates the retraining procedure. The first stereo pairs of a shot are analyzed and the segmented depth and color planes are generated, by incorporating a multiresolution implementation of the RSST

algorithm. The retraining set is extracted by effectively fusing color and depth segments. Having extracted the video objects, the network is retrained to adjust its weights to the current conditions but without forgetting the previous knowledge. Finally the trained network is applied to the rest of the frames of the shot, in order to follow the semantic object. Promising experimental results have been demonstrated. In future works we should focus on implementing methods that will consider object’s motion so as to provide some effective constraints to the network and so to accomplish faster classification. Additionally small weights perturbation has been considered which is valid in several situations, especially when scene evolution is slow. But in order to take care of rapidly changing shots a more sophisticated decision mechanism should be implemented to detect environment changes within the scene. Then network retraining can be activated inside a shot so that scenes with rapid evolution, significant motion or illumination changes could be handled more efficiently. Finally a scheme with multiple neural networks for multiple objects’ tracking could be tested which would benefit the classification procedure.

[3] N. Doulamis, A. Doulamis, D. Kalogeras and S. Kollias, “Very Low Bit-Rate Coding of Image Sequences Using Adaptive Regions of Interest,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 8, pp. 928-934, Dec. 1998.

Acknowledgements

[9] J. Wang and E. Adelson, “Representing Moving Images with Layers,” IEEE Trans. Image Processing, Vol. 3, pp. 625-638, Sept. 1994.

The research is funded by the Institute of Governmental Scholarships. Furthermore the authors wish to thank very much Mr. Chas Girdwood, the project manager of the ITC (Winchester), for providing the 3D video sequence “Eye to Eye”, which was produced in the framework of ACTS MIRAGE project. Furthermore the authors want to express their gratitude to Professor Siegmund Pastoor of the HHI (Berlin), for providing the video sequences of the DISTIMA project.

References [1] K. N. Ngan, S. Panchanathan, T. Sikora and M.-T. Sun, “Guest Editorial: Special Issue on Representation and Coding of Images and Video,” IEEE Trans. on Circuits and Systems for Video Techn., Vol. 8, No. 7, pp. 797-801, November 1998. [2] T. Sikora, “The MPEG-4 Video Standard Verification Model”, IEEE Trans. Circuits and Systems for Video Technology, Vol. 7, No. 1, pp. 19- 31, February 1997.

[4] B. Furht, S.W. Smoliar and H. Zhang, Video and Image Processing in Multimedia Systems. luwer Academic Publishers, 1995. [5] N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis and S. Kollias, “Efficient Summarization of Stereoscopic Video Sequences,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 10, No. 4, June 2000 (to appear). [6] F. Meyer and S. Beucher, “Morphological Segmentation,” Journal of Visual Communication on Image Representation, Vol. 1, No. 1, pp.21-46, September 1990. [7] M. Kunt, A. Ikonomopoulos and M. Kocher, “Second Generation Image Coding Techniques,” Proc. IEEE, Vol. 73, pp. 549-574, April, 1985. [8] W. B. Thompson and T. G. Pong, “Detecting Moving Objects,” Int. J. Comput. Vision, Vol. 4, pp. 39-57, 1990.

[10] G. Avid, “Determining Three-dimensional Motion and Structure from Optical Flow Generated by Several Moving Objects,” IEEE Trans. Pattern Anal. Machine Intell., Vol. PAMI-7, pp. 384-401, 1985. [11] C. Gu and M.-C.Lee, “Semiautomatic Segmentation and Tracking of Semantic Video Objects,” IEEE Tran. Cicuit. Syst. Video Techn., Vol. 8, pp. 572-584, 1998. [12] F. Bremond and M. Thonnat, “Tracking Multiple Nonrigid Objects in Video Sequences, ” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 5, pp. 585-591, Sept. 1998. [13] I. K. Sethi and R. Jain, “Finding Trajectories of Feature Points in a Monocular Image Sequence,” IEEE Trans. Pattern Anal. Machine Intell., Vol.PAMI-9, No. 1, pp. 56-73, 1987. [14] Y. S. Yao and R. Chellappa, “Tracking a Dynamic Set of Feature Points,” IEEE Trans. Image Processing, Vol. 4, No. 10, 1995.

[15] A. Doulamis, N. Doulamis, S. Kollias, “On Line Retrainable Neural Networks: Improving the Performance of Neural Networks in Image Analysis Problems,” IEEE Trans. on Neural Networks, Vol. 11, No. 1, January 2000. [16] D. J. Luenberger, Linear and Non-Linear Programming, Addison-Wesley, 1984. [17] A. D. Doulamis, N. D. Doulamis, K. S. Ntalianis and S. D. Kollias, “Unsupervised Semantic Object Segmentation of Stereoscopic Video Sequences,” Proc. of IEEE Int. Conf. on Intelligence, Information and Systems, Washington D.C, USA, November 1999. [18] O. J. Morris, M. J. Lee and A. G. Constantinides, “Graph Theory for Image Analysis: an Approach based on the Shortest Spanning Tree,” IEE Proceedings, Vol. 133, pp.146-152, April 1986. [19] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel and T. Sikora, “Image Sequence Analysis

for Emerging Interactive Multimedia Services The European Cost 211 Framework,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 7, pp. 802- 813, Nov. 1998. [20] B. L. Yeo and B. Liu, “Rapid Scene Analysis on Compressed Videos,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 5, pp. 533544, Dec. 1995. [21] N. V. Patel and I. K. Sethi, “Video Shot Detection and Characterization for Video Databases,” Pattern Recognition, Vol. 30, No. 4, pp. 583-592, April 1997. [22] C. Girdwood and P. Chiwy, “MIRAGE: An ACTS Project in Virtual Production and Stereoscopy,” IBC Conference Publication, No. 428, pp. 155-160, Sept. 1996.