Fusion of multiple cues for video segmentation

In Proc.

of Intl.

Conf.

on Information Fusion, 1999, July 6-8, Sunnyvale, CA

Fusion of multiple cues for video segmentation Bikash Sabata and Moises Goldszmidt SRI International 333 Ravenswood Ave Menlo Park, CA 94025, U.S.A.

fsabata, [email protected]

Abstract The segmentation of video into contigu-

features provide a cue for the understanding of video and is usually sucient to make an approximate interpretation about the content of video. However, the features many times provide con icting evidence. Further, since there is strong correlation between the dierent features, it is not easy to fuse the information from the features to make interpretations. This problem of correlated and con icting evidence from dierent sources is a common occurrence in multisensor systems and other complex systems. Our goal is to develop a framework to address the problems of information fusion when the features are noisy and highly correlated. The source of these features may arise out of processing of dierent sensors and/or dierent lters applied to sensor data such as video. In this paper we address the speci c problem of segmenting structured video that arises in broadcast video and movies. However, the techniques for fusion we develop are more general. We present a method based on Bayesian Networks that model the dependence between the segmentation decision and the dierent features. The Bayesian network model also explicitly represents the correlations between the dierent features. These correlations may be known a priori (because of the domain knowledge) or may be learned from the data. We incorporate the prior knowledge into the model and learn the other dependency structures by learning the Bayesian networks from the data. In section 2 we present a brief overview of the diverse set of techniques used to segment video data. The important lesson here is that

ous scenes is becoming an important problem in many applications. Since video data is a rich source of spatio-temporal information, dierent types of features can be computed in the video data. Each of these features provide a cue for the segmentation of video and is usually sucient to perform an approximate segmentation of video. However, the features many times provide con icting evidence for segmentation. Further, since there is strong correlation between the dierent features, it is not easy to fuse the information from the features to make segmentation decisions. We present a method based on Bayesian Networks that model the dependence between the segmentation decision and the dierent features. This framework using Bayesian Networks is promising and provides an extensible mechanism for fusion of information.

Keywords:

Video Processing, Bayesian Networks

1 Introduction The information in video data is being used increasingly for many decision and interpretation tasks. For example, we would like to determine when one scene ended and a new one started so that the relevant segments of the video may be retrieved for display. There is a critical need for ecient management and processing of video data. However, the sheer volume of information in the video data makes it dicult to device algorithms that are ecient and robust. The decisions and interpretations based on video use the feature vectors extracted from the video data. Each of these 1

many of techniques that segment video perform reasonably well for restricted classes of videos. In dierent situations dierent techniques perform better than others. An important consequence of our framework is that we are able to select the best set of techniques that are sucient to make reliable and robust decisions for a given class of video data. In section 3 we present the speci c set of features we compute in the video to make the segmentation decision. A common problem we have seen in the past approaches to computing the feature vectors in video is that the feature vectors are computed in individual frames and the temporal dimension is added in as a dierence operation. We depart from this approach and present a novel multi-scale lter to compute the color and texture features for each individual frame and their variations along the time axis. In addition, we compute features that are unique to video. These include the spatio-temporal motion tracks of points in video and edge features in the spatiotemporal volumes. The Bayesian Network based framework for fusion of information from the dierent features is presented in section 4. In sections 5 and 6 we present the results of our experiments and conclude with some discussions about the current and future work.

2 Background The majority of the techniques for video segmentation use low-level image features such as pixel dierences, dierences in the statistical properties of the feature values, histogram comparisons, edge dierences, and motion vectors. The key problem is that there are many events in the video that have the same characteristics as scene changes in the low level features. For example, fast camera panning in the scene may have the same color histogram characteristics as a dissolve. Reducing the number of false positive triggers is the main objective of research activities in this area. A large class of segmentation algorithms compute the boundary between two segments

by examining the local pixel values and their statistical properties in frames in a temporal window 2 around the candidate boundary frame b. Zhang, Kankanhalli and Smoliar [1] compute the number of pixels that change value more than a threshold to decide if a boundary has been detected at frame i. Yeo [2] improves the above technique by taking the difference on spatially reduced frames over a symmetric temporal window [b ? ; b + ] around the candidate frame b. Yeo also detects the gradual change regions by detecting the \plateaus" in the distance measure over temporal windows. Kasturi and Jain [3] segment each frame into regions and compare statistical measures of the pixels in the regions over the frames. To improve the computational eciency, Taniguchi, Akutsu and Tonomura [4], take temporal samples to process the frames and incrementally increase the sampling where a candidate scene change is detected. Another approach to boundary detection is to model the transition in terms of the statistics of the pixel values of the frames in the temporal window constituting the transition. Aigrain and Joly [5] and Hampapur, Jain, and Weymouth [6] use such model based methods to capture the dierent shot transitions. An alternative class of approaches that use the pixel information more compactly is the histogram of the frames. The histograms could be the intensity histograms or the color histograms. Histograms, of course lose the spatial information of the frames entirely, and therefore are robust with respect to camera motions and reasonable amounts of noise. Zhang, Kankanhalli and Smoliar [1], and Yeo [2] compare the histograms by using a bin-wise dierence of the histograms of the two consecutive frames. Ueda, Miyatake and Yoshizawa [7] use the rate of change of the color histograms to nd the shot boundaries. Nagasaka and Tanaka [8] compare many dierent statistical measures for the histogram distributions. They report that partitioning the frames into 16 regions and using a 2 test on color histograms of the regions over the frames performs the best for shot boundary detection.

Their method is robust against zooming and panning but fails to detect gradual changes. Prior statistical properties of the video are incorporated into the decision process by Swanberg, Shu and Jain [9]. They use intensity histogram dierences in regions weighted with the likelihood of the region changing in the video. In addition to the features in the individual frames, the video data also has motion related features. Zabih, Miller and Mai [10] use a method based on edge tracking over frames to determine shot boundaries. The edge distances are measured using the Hausdor distance measure. Shahraray [11] uses a method similar to the motion vector computation algorithms in most MPEG codecs to compute scene change points. Hsu and Harashima [12] model the scene changes and activities as motion discontinuities. The motion is characterized by considering the sign of the Gaussian and Mean curvature of the spatio-temporal surfaces. Clustering and split-and-merge approaches are then used to segment the video. In summary the past literature has a large collection of techniques that extract one of the many features from video to detect the segment boundaries. Beyond simple ad-hoc schemes that try to integrate the information from the dierent features using some kind of weighting procedure, we have not seen any eort to use sophisticated techniques to fuse the evidence from the dierent cues.

There are four basic types of features we compute and use in the fusion module for making the nal segmentation. The rst set of features is based on the color distributions in each frame of the video. The second feature set is based on the response of each frame to a set of texture lters. The third feature set scores the frames for a segmentation boundary based on the tracking of signi cant point features in the video. Finally the fourth feature class computes the likelihood of a segmentation boundary by detecting edges in the spatio-temporal volume that represents the video. The segmentation is computed by comparing the change across the candidate segment boundary. The change can be measured as a distance D = F (Sb? ; Sb+ ) between the video features, Sb? and Sb+, in the two temporal intervals [b ? ; b] and [b; b + ] around the boundary b. Every procedure for segmentation that has been discussed in the past literature can be mapped to this formulation. The distance measure is then compared with a threshold value to determine if there exists a boundary at b. The dierent approaches select dierent properties Sb? and Sb+ to represent the intervals and the function that evaluates the distance. In section 4 we will present fusion techniques that either use the D from each module for making the fused decision or the individual decisions from each module to make the fused decision.

3 Video Features

3.1 Color Features

The lesson learnt from past work and our experiments with the various techniques in the literature is that although the mechanisms don't perform well for all situations, they perform well in a subset of the situations. Our approach is to select a set of these methods and fuse the output of each module to make the nal decision about segmentation. In the process, we modi ed some of these algorithms to make them more robust and ecient. In addition, we propose some new features from video that capture information in video that has not been addressed in the past approaches.

The most common feature used to segment video are the color histograms of each frame. Using the results from Nagasaka and Tanaka [8], we designed a distance measure based on the 2 measure between two histograms. We de ne the histogram of the video frames over the temporal intervals [b ? ; b] and [b; b + ] around the candidate frame b. Further, we weigh the contributions of the frames to the joint histogram using a Gaussian Mask. This method of computing the color feature vector for video is novel and inspired from the scale space methods in computer vision and the multi-scale lters for edge detection. The

multi-scale Gaussian windowing approach provides a robust mechanism to reliablely estimate the scene boundaries in the presence of noise. Let ht (i) be the normalized histogram1 of the tth frame in the video. The weighted histogram for a temporal window W = [ts ; te ] is then computed as

h[ts ;te ](i) =

X

2

t [ts ;te ]

wt ht (i)

where wt is the weight associated with the tth frame in the window. The weights are

computed using the one dimensional Gaussian x function G(x) = e? multiplied with a normalization factor. The Gaussian window size is dierent for the dierent scales. The distance measure for the Gaussian windowed histograms for the scale s is given by X (h[l](i) ? h[r](i))2 (1) Dcolor (s; t) = (h[l] (i) + h[r] (i)) i 2

2

2

where [l] = [t ? (s); t]; [r] = [t; t + (s)]: The set of distance measures for the dierent scales and colors together form the set of features that capture the relevant color information in video. Figure 1 shows the distance measure at three dierent scales as a function of time for an example video. Yeo [2] proposed a ash detector by noticing that ashes produce two closely spaced sharp peaks in the 2 distance scores of the video frames. We implemented the above detector to give the candidate frames at which ashes were detected.

3.2 Texture Features

The texture information in each frame can also be used to evaluate the continuity of a segment. The use of texture information for segmentation in video is a novel use of texture lters. We propose a novel distance measure that

P

Normalization makes i ht (i) = 1 and ht (i) can be interpreted as the probability of a pixel taking value i in frame t. 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 1000

1050

1100

1150

1200

1250

1300

Figure 1: Distance score at each frame for three dierent scales. At the noise, fast camera motion, and ashes the maxima reduce with increasing scale, while at true boundaries, the maxima remain the same or increase. uses the texture energy to compute the distance between temporal windows across candidate segment boundaries. The texture energy is computed using the Gabor lters proposed in [13]. The Gabor energy method measures the similarity between neighborhoods in an image and Gabor masks. Each Gabor mask consists of Gaussian windowed sinusoidal waveforms with parameters of wavelength , orientation , phase shift , and the standard deviation . The lter is given by: x?X y?Y G(x; y) = e? sin( 2(x cos ? y sin ) + ) A set of lters is generated by varying the and . The texture energy for a lter ( xed and ) is calculated as the sum over the phases of (

)2 +( 2 2

)2

the squared convolution values. We implement a total of twelve lters by quantizing the into four values and into three values. Next, similar to the case of the color distance measure, we use the texture energy response of each frame to nd the dierence between the adjacent group of frames. To compute the distance measure at frame i, a Gaussian window at scale s is selected around the frame. The weighted average texture energy is calculated

0.6

1.4 Tracks Score Missed Tracks 1.2

0.5

1 0.4 0.8 0.3 0.6 0.2 0.4

0.1

0 1000

0.2

1050

1100

1150

1200

1250

1300

Figure 2: Distance score at each frame for three dierent scales for one texture lter. to the left and right of the frame. The normalized distance between the average texture energy is used as the estimate of the change across the segment boundary. Figure 2 shows the distance measure computed at the dierent scales for one of the texture lters on an example video.

3.3 Tracking Based Features Motion features are important features as they correspond to the real physical phenomenon captured by the video sensor. Using tracks of objects and points in the video for detecting scene changes was used in the past by Zabih, Miller and Mai [10] who tracked edge segments over frames and used a Hausdor distance measure to evaluate the segment boundaries. A critical problem in tracking based systems is that of feature selection. We use the method proposed by Shi and Tomasi [14] to detect good point features to track. The features are selected from the frames and tracked across the video. We assign a score to the tracking of the features and use it to evaluate the segment boundaries. The score is computed by weighting the contribution of each feature that is tracked from the last frame to the current frame. The weighting function looks at the history of the feature and assigns a weight that is proportional to the history of the track, however, the incremental increase in the weight for

0 1000

1050

1100

1150

1200

1250

1300

Figure 3: Tracking based distance score and the percentage missed features at each frame. the feature decreases exponentially with the history length. The weight for the ith feature is computed as

wi = 1 ? e

?pi k

(2)

where pi is the number of frames in the past through which the ith feature was tracked and k is some constant that determines the sensitivity to the history of the tracks. In addition to evaluating the feature tracks in a frame, we also measure the fraction of features that cannot be tracked in each frame from the past frame. This fraction also gives the evidence about the boundary between video segments. To compute a distance measure across the segment boundary, we compute the dierence between the average track scores in windows on either side of the candidate boundary frame. At frame i, a window of size S is selected around the frame and the average track score is calculated to the left and right of the frame. The dierence between the average track scores and the fraction of the missed tracks is used as the distance measure of the tracking module. Figure 3 shows the tracking based distance scores and the fraction of the point features missed in the tracking for each frame.

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 1000

Figure 4: x ? t Section through the spatiotemporal video volume. The t axis is along the horizontal direction.

3.4 Edges In Spatio-Temporal Volumes Video data is three dimensional data where the temporal dimension is the third dimension. To detect the segmentation boundaries we should study the patterns in the data along the temporal dimensions. This idea was investigated by Otsuji and Tonomura [15] who proposed a projection detection lter to detect cuts in video. They projected the video data along the x ? t and y ? t planes to generate images that they then use to detect cuts. This construction is based on the work on spatio-temporal surfaces by Baker and Bolles [16]. We use a similar idea and generate sections through the video volume using planes parallel to the x ? t and the y ? t planes (Figure 4). The edges perpendicular to the t axis in these sections indicate possible video segment boundaries. The fraction of the pixels at any t covered by the horizontal edges is taken as a measure of the segment boundary. Averaging this measure across many sections gives a probability measure of the existence of a segment boundary based on the evidence from edges in spatio-temporal volumes. Figure 5 shows the probability measure evaluated for the dierent frames in an example video.

1050

1100

1150

1200

1250

1300

Figure 5: Distance score based on the edges in spatio-temporal volumes at each frame.

4 Bayesian Network Based Fusion We use capital letters X; Y; Z for variable names, and lower-case letters x; y; z to denote speci c values taken by those variables. Sets of variables are denoted by boldface capital letters X; Y; Z and assignments of values to the variables in these sets are denoted by boldface lowercase letters x; y; z. A Bayesian network over a set of variables X = fX1 ; : : : ; Xn g is an annotated directed acyclic graph that encodes a joint probability distribution over X. Formally, a Bayesian network is a pair B = hG; Li. The rst component, G, is a directed acyclic graph whose vertices correspond to the random variables X1 ; : : : ; Xn , and whose edges represent direct dependencies between the variables. The second component of the pair, namely L, represents a set of local conditional probability distributions (CPDs) L1 ; : : : ; Ln , where the CPD for Xi maps possible values xi of Xi and pa(i) of pa(i), the set of parents of Xi in G, to the conditional probability (density) of xi given pa(i). A Bayesian network B de nes a unique joint probability distribution (density) over X given by the product

PB (X1 ; : : : ; Xn ) =

Y L (X jpa(i)) : n

i

i=1

i

(3)

When the variables in X take values from nite discrete sets, we typically represent CPDs as tables that contain parameters xijpa(i) for all possible values of Xi and pa(i). When the variables are continuous, we can use various parametric and semi-parametric representations for these CPDs. In this paper, we treat information fusion as a pattern classi cation problem. We assume that there is one variable Ai for each feature, and a distinguished variable Outcome that can take value from the set f0; 1; 2g depending on whether the frame is \normal," a \boundary, or a \ ash." The objective is given a set of vectors X = fA1 ; : : : ; An; Outcomeg, to induce a probability distribution Pr(A1 ; : : : ; An ; Outcome) from this data in the form of a Bayesian network. Given this network the decision on a new scene will be given by: argmax Pr(Outcome = Oja1 ; : : : ; an ); (4)

O

which is the classic de nition of a Bayesian classi er [17]. Note that we have translated the fusion problem to that of inducing a probability distribution linking the various features with a decision on the nature of the frame. There is a recent substantial body of work on inducing Bayesian networks from data (see [18] for example, and references therein). In [19] Friedman et al argue convincingly for using specialized graph structures for classi cation tasks. As an example, consider a graph structure where the Outcome variable is the root, that is, Pa(Outcome) = ;, and each feature has the Outcome variable as its unique parent, namely, pa(Ai ) = fOutcomeg for all 1 i n. For this type of graph structure, Equation 3 yields Q Pr(A1; : : : ; An; Outcome) = Pr(Outcome) ni=1 Pr(Ai jOutcome). From the de nition of conditional probability, we get jA1 ; : : : ; An) = Pr(Outcome) QPr(ni=1Outcome Pr(Ai jOutcome); where is a normalization constant. This is the de nition of the naive Bayesian classi er commonly found in the literature [17]. The naive Bayesian classi er has been used extensively for classi cation. It has the at-

Outcome

color2 0.086548 color5 0.074735 color6 0.378057 color3

0.085417 color8

0.388196

0.440057

color9

color7 0.440312 color1 0.225639 color4

Figure 6: A TAN model learned using only features that take color into account. The numbers on the arcs indicate conditional mutual information between the features. tractive properties of being robust and easy to learn|we only need to estimate the CPDs Pr(Outcome) and Pr(Ai j Outcome) for all attributes. Nonetheless, the naive Bayesian classi er embodies the strong independence assumption that, given the value of Outcome, features are independent of each other. Friedman, Geiger and Goldszmidt [19] suggest the removal of these independence assumptions by considering a richer class of networks. They de ne the TAN Bayesian classi er that learns a network in which each attribute has the class and at most one other attribute as parents. Thus, the dependence among attributes in a TAN network will be represented via a tree structure. Figure 6 shows an example of a TAN network. In a TAN network, an edge from Ai to Aj implies that the in uence of Ai on the assessment of Outcome also depends on the value of Aj . For example, in Figure 6, the in uence of the feature \color1" on Outcome depends on the value of \color7," while in the

naive Bayesian classi er the in uence of each feature on Outcome is independent of other features. These edges aect the classi cation process in that a value of \color1" that is typically surprising (i.e. P (color1jOutcome) is low) may be unsurprising if the value of its correlated attribute, \color7," is also unlikely (i.e. P (color1jOutcome; color7) is high). In this situation, the naive Bayesian classi er will overpenalize the probability of the class by considering two unlikely observations, while the TAN network of Figure 6 will not do so, and thus will achieve better accuracy. TAN networks have the attractive property of being learnable in polynomial time [19]. In the next section we show the results of using the TAN classi er for fusing the information provided by the various lters. As a control, we also used the naive Bayes classi er introduced above. As is illustrated in the next section, the lack of correlation modeling between the dierent features causes a substantial increase in the number of false positives (e.g. classifying normal frame are boundaries).

5 Results Several experiments were conducted with the Bayesian network model induction and segmentation. However, for the sake of brevity, in this paper we will present only an outline of our experiments and indicate some of the results. This work is still in progress and the next section presents the future directions. The video segmentation experiments were performed on samples of broadcast news video. The video segments were processed using the dierent color lters, texture lters, tracking algorithm, and the spatio-temporal edge detectors. We also ran the \ ash detector' on all the data. In all there were 51 features: 9 color features, 3 from ash detector output, 36 from the texture lters, 2 from the tracker and 1 from the spatio-temporal edge detector. In video data the fraction of the frames that are segment boundaries and ashes are extremely small. This is because video has 30 frames per second and scene changes do not typically oc-

cur more than once in 4-5 seconds. This is a tough problem as only approximately 1% of the data is of type breaks and ashes. We are not interested in accuracy (the percentage of successfully classi ed frames) as the vast majority of the frames are normal (approximately 99%). Our criteria must be based on the number of false negatives, how many segment boundaries or ashes, were missed, and the false positives, how many normal frames were confused by our model for segment boundaries or ashes. The rst Bayesian network model was generated with the data discretized following the method by Fayyad and Irani [20], using the routines in the MLC++ package [21]. We rst trained on the whole data set and tested classi cation on the same dataset. We run the risk of over- tting the data, but given the nature of the problem (so few instances) we wanted a sanity check. The results where very encouraging. Only 3 of the 36 events we were interested in where missed; namely only 3 false negatives, and 27 were false positives. The same experiment with the naive returned 0 false negatives (i.e., not a single segment boundary or a ash was missed), however the number of false positives jumped to 214 (57 of the normal frames were labeled as boundaries and 157 were labeled as scenes with ash). We performed a ve fold test to check how would the model behave against unseen data. The folds maintained the proportions of the interesting cases in the training data, but naturally it reduced the number of instances. The results show that about 1 in every 6:4 segment boundaries are missed, and that about 1 in every 50:3 normal frames were considered to be boundaries or ashes. To test whether the model was indeed fusing information we tested the performance of the four feature classes in isolation. The results reveal that indeed fusion took place. The number of false negatives decreased signi cantly for the lters based on \color," \track," and \ ash," and for \texture" even though the number of false negatives for segment boundaries increased by 2 the number of false negatives for ash decreased by the same amount.

It is worth noting that in the case of \texture" the number of false positives decreased by almost 40%. We attempted to induce a classi er without discretization. We used Gaussians and linear Gaussians as the family of distributions. The results where poor and the model failed to identify the majority of the scenes of interest (25 false negatives). This was a surprising result for us are we are trying to characterize this further at this time. As described in the next section, future work includes the exploration of more sophisticated models, such as those described in [22].

6 Conclusions & Future Work Fusion of information from multiple sensors or from dierent computational modules is becoming important in many applications. In particular for multimedia applications (with audio and video) the fusion of cues from the different media channels and from dierent processing modules is becoming increasingly critical. For example, in the domain of multimedia information processing, applications requiring content based search and retrieval require interpretation of features in the data from all the media sources. In this paper we presented a framework based on Bayesian networks for the fusion of information from multiple sources. This framework is very general and extensible. The preliminary results on our fusion experiments are very encouraging. Currently we are applying this framework to the integration of multiple cues resulting from the processing of audio, video/imagery, speech and text in broadcast news video. In addition to the fusion framework we also presented novel features to evaluate the content of video. These included the multi-scale color and texture lters and the edges in the spatio-temporal volume of data representing video. The usual approach to feature detection in video is repeated application of the image feature detectors to every frame of the video. Our approach was to design feature detectors speci c to video data. This approach we be-

lieve is the key to characterizing the structure of video data and extracting features that are relevant to the content of video. In our experiments with the Bayesian network models we were able to design dierent networks that performed better on one of the performance metrics at the expense of others. We are currently exploring methods to characterize the dierent models and quantify the tradeos. We are also exploring the use of more sophisticated models such as those in [22] that include mixtures of Gaussians and also a mix of discrete and continuous features. A signi cant step will be to use models that do not consider the data to be iid but sequences in time. The Bayesian network model also allows us to evaluate the contribution of the dierent features towards the nal decision using the value of information computations in Bayesian networks. This results in minimal set of features necessary to reliably segment video. In addition, we will generate a decision tree where the output of one feature detector directs the test with the next feature detector. Feature computations tend to be computationally expensive therefore our goal is to provide a decision procedure to determine the redundant computations and the most signi cant computations.

Acknowledgements

We would like to thank Claire Monteleoni for her help with the experiments.

References [1] H. Zhang, A. Kankanhalli, and S. Smoliar, \Automatic partitioning of full motion video," Multimedia Systems, 1(1):10{ 28, 1993. [2] B.-L. Yeo, Ecient Processing of Compressed Images and Video. PhD thesis, Princeton University, 1996. [3] R. Kasturi and R. Jain, \Dynamic vision," in Computer Vision: Principles (R. Kasturi and R. Jain, eds.), vol. 1, 1991. [4] Y. Taniguchi, A. Akutsu, and Y. Tonomura, \Panoramaexcerpts: Extracting

and packing panoramas for video browsing," in Proc. of the ACM Multimedia Conference, Seattle, Nov. 1997. [5] P. Aigrain and P. Joly, \The automatic real-time analysis of le editing and transition eects and its applications," Computer and Graphics, vol. 18, pp. 93{103, Jan. 1994. [6] A. Hampapur, R. Jain, and T. Weymouth, \Digital video segmentation," in Proc. of ACM Multimedia, San Francisco, pp. 357{ 364, Oct 1994. [7] H. Ueda, T. Miyatake, and S. Yoshizawa, \Impact: An interactive natural motion picture dedicated multimedia authoring system," in Proceedings of CHI, New Orleans, pp. 343{350, April 1991. [8] A. Nagasaka and Y. Tanaka, \Automatic video indexing and full video search for object appearences," in Visual Database Systems II (E. Knuth and L. Wegner, eds.), pp. 113{127, Elsevier Science Publishers, 1992. [9] D. Swanberg, C. Shu, and R. Jain, \Knowledge guided parsing and retrieval in video databases," in Storage and Retrieval for Image and Video Databases, vol. 1908 of Proc of SPIE, pp. 173{187, Feb. 1993. [10] R. Zabih, J. Miller, and K. Mai, \A feature-based algorithm for detecting and classifying scene breaks," in Proc. of ACM Multimedia 95, (San Francisco), pp. 189{ 200, Nov. 1995. [11] B. Shahraray, \Scene change detection and content-based sampling of video sequences," in Digital Video Compression: Algorithms and Technologies, vol. Proc. SPIE 2419, pp. 2{13, February 1995. [12] P. Hsu and H. Harashima, \Detecting scene changes and activities in video databases," in Proc. of ICASSP 94, pp. 13{36, April 1994.

[13] I. Fogel and D. Sagi, \Gabor lters as texture discriminator," Journal of Biological Cybernetics, vol. 61, pp. 103{113, 1989. [14] J. Shi and C. Tomasi, \Good features to track," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Seattle, June 1994. [15] K. Otsuji and Y. Tonomura, \Projection detecting lter for video cut detection," in Proc. of ACM Multimedia 93, (Anahiem), pp. 251{257, ACM, August 1993. [16] H. Baker and R. C. Bolles, \Generalizing epipolar{plane image analysis on the spatiotemporal surface," International Journal of Computer Vision, 3:33{49, 1989. [17] R. O. Duda and P. E. Hart, Pattern Classi cation and Scene Analysis. New York: John Wiley & Sons, 1973. [18] D. Heckerman, D. Geiger, and D. M. Chickering, \Learning Bayesian networks: The combination of knowledge and statistical data," Machine Learning, 20:197{ 243, 1995. [19] N. Friedman, D. Geiger, and M. Goldszmidt, \Bayesian network classi ers," Machine Learning, 29:131{163, 1997. [20] U. Fayyad and K. Irani, \Multi-interval discretization of continuous-valued attributes for classi cation learning," in Proc. of Intl. Joint Conf. on AI, pp. 1022{ 1027, 1993. [21] R. Kohavi, G. John, R. Long, D. Manley, and K. P eger, \MLC++: A machine learning library in C++," in Proc. 6t h Intl. Conf. on Tools with Arti cial Intelligence, pp. 740{743, 1994. [22] N. Friedman, M. Goldszmidt, and T. Lee, \Bayesian network classi cation with continous attributes: Getting the best of both discretization and parametric tting," in Proc. Intl. Conf. on Machine Learning, 1998.

Fusion of multiple cues for video segmentation

Fusion of multiple cues for video segmentation

Suggest Documents

Quality-Based Fusion of Multiple Video Sensors for Video Surveillance

Integration of Multiple Speech Segmentation Cues: A ... - CiteSeerX

Video Object segmentation using Multiple Features

Spatio-temporal Video Segmentation with Long-range Motion Cues

Adaptive Multiple Object Tracking Using Colour and Segmentation Cues

Towards Segmentation from Multiple Cues: Symmetry and Color R

Particle Filter Object Tracking Based on Multiple Cues Fusion - Core

Semantic video content abstraction based on multiple cues ...

Content-based Video Genre Classification Using Multiple Cues

Video copy detection using multiple visual cues and

FUSION OF LASERSCANNER AND VIDEO FOR ADVANCED ...

Fusion of Multiple Visual Cues for Visual Saliency Extraction from ... - Hal

Combining Multiple Cues for Pedestrian Detection

Particle filtering with multiple cues for object

Semantic Video Segmentation - arXiv

Bayesian video shot segmentation

Spatiotemporal Semantic Video Segmentation

Digital Video Segmentation - Liacs

fusion of multiple features for identity estimation

Fusion of Audio and Visual Cues for Laughter Detection - CiteSeerX

Talk Fusion Video Suite

Bilayer Video Segmentation for Videoconferencing ... - CiteSeerX

Cues to speech segmentation: Evidence from juncture ...

TEMPORAL VIDEO SEGMENTATION FOR REAL-TIME ... - DORAS