where µc is the mean of the latent image z, and Φc is a diagonal covariance matrix that specifies the variabil- ity of each pixel in the latent image. The variability ...
Recursive estimation of generative video models (Video file “VideoBrowsing.wmv” is an integral part of this submission)
Nemanja Petrovic Beckman Institute Urbana, IL 61801
Nebojsa Jojic Microsoft Research Redmond, WA 98052
Sumit Basu Microsoft Research Redmond, WA 98052
Thomas Huang Beckman Institute, Urbana
Abstract We present a generative model of video for which the learning procedure automatically clusters video frames into scenes and objects. The learning algorithm is based on a recursive hierarchical procedure which can be shown to converge and for which some guarantees on the local optimality of the resulting log-likelihood of the data can be given. In the previous work we presented generative model for video which assumes that most of the variability in the video comes from scene changes and simple camera translations. In this paper we present a generalization of that algorithm that allows for more complicated and realistic transformations, within the computationally very efficient framework. The final result is a static or dynamic visual representation of the entire input video that allows visualizing all scenes at once and easily navigating through the long input video. We show results on the analysis of a typical home video.
1
INTRODUCTION
Motivation. The amount of video data available to an average consumer has already become overwhelming. Still, there is a lack of efficient general-purpose tools for navigating this vast amount of information. We strongly believe that a successful video browsing and summarization system has to accomplish two goals. First, it must correctly model the sources of vast information content in the video. Second, it must provide the user with an intuitive and fast video navigation interface that is compatible, if not even jointly optimized with the analysis algorithm. As a solution of the first problem we propose the clustering of related but non-sequential frames into scenes. We propose novel generative model for the video and hierarchical,
fast, on-line clustering algorithm that groups frames similar up to translation and change in the scale. As a solution of the second problem we propose a “video visualization” tool based on the visualization of variables in the generative model, which ideally reflects all frames in the video. The video navigation tool uses this visualization for an efficient, visually meaningful index into the video or a set of video sequences. Previous work. Video summarization is one of the most difficult problems in the automatic video understanding. It aims to produce short, yet representative, synopsis of the video by extracting pertinent information or highlights that would enable the viewer to quickly grasp the general story or navigate to the specific segment. Related to this, a large amount of work has been previously reported to build interactive navigational tools. The approaches to video summarization can be broadly grouped into two distinctive groups: 1. static summarizations using shots, mosaics and storyboards, and 2. dynamic summarizations using video skimming. Numerous shot and key-frame detection approaches are based on extracting and tracking low level features over the time and detecting their abrupt changes. These approaches have two drawbacks: they tend to be profuse in unimportant details and insufficient for overall video understanding (only a frame-sized thumbnail per shot). Long lists of shots result in another information flood rather than abstraction, while key frames only are not sufficient for a user to judge the relevance of the content, especially if the content is dynamic. Several shot-independent approaches have been proposed recently, like hierarchical key-frame clustering [4, 5], and methods that integrate of face or object detectors. While promising they have several drawbacks ranging from the lack of robust similarity measure to the fact that with clusters above certain levels, visually different shots start to be merged together.
There have been several interesting approaches to video browsing and summarization focused on mosaicking. In this paper we extend our previous work to include probabilistic mosaic representation. Similar mosaicking representations were used before, but were overly constrained with respect to camera motions, requirements that target scenes must be approximately planar and should not contain moving objects. For example, [2] studied mosaics in the context of video browsing and a synoptic view of a scene. Video summaries work [3] uses mosaics for summarization, but ignores foreground motion and relies on background invariance for scene clustering, and use ad-hoc scene similarity measure by utilizing special properties of video genres. Our work seems mostly related to the mosaicking work [14] on mosaic-based representations of video sequences. There are a few important differences. Our method, having video clustering and compact video presentation as the ultimate goals, has a notion of variability of appearance (moving objects in the scene and blemishes in the background are allowed), automatically estimates the number of parameters (eg. number of different scenes), explains the cause of variability of the data, and recognizes the scene that already appeared. Also, other methods were used in the highly regimented cases (eg. aerial surveillance, “sitcoms”) where our is intended for the general class of unconstrained home videos. Recently, a lot of work has been reported in the field of the dynamic video summarization, often referred to as skimming. It consist of collecting representative or desirable sub-clips from the video. In this context, several authors proposed interactive navigational tools. The most direct approach in video skimming is compression using the video fast forward [10]. One of the approaches is to use generative models [9] to construct the similarity measure between the query sequence provided by the user and target sequence, and thus adapt the playback speed to the content. The browsing tool we introduce in this paper has the distinctive feature that it includes both static and dynamic summary of the video. Finally, more sophisticated methods were used for video summarization including generative models and linear dynamical systems [6], Bayesian decision theory [7] and singular value decomposition [8]. Despite numerous efforts, the results are still not impressive. In general, some of the methods rely on learning the semantic content of the general class of videos, which is an arduous task to begin with. Others are so computationally intensive that are impractical for real applications. The uniqueness of our work is in three aspects: statistical interpretation of video scenes using generative
models; application of variational inference to the new recursive, hierarchical clustering algorithm; and the powerful new navigational and browsing tool. Overview. After briefly reviewing our previous work, we will introduce the Masked Scene Model. We will present techniques for speeding up the naive implementation by orders of magnitude. We will introduce new hierarchical clustering algorithm based on variational EM algorithm. Finally, we will present the results of clustering and new video browsing tool.
2
PREVIOUS MODEL: TRANSFORMED HIDDEN MARKOV MODEL
To discover scenes, we reformulate the space of videos as a generative model. Following the pattern recognition ideas of [1], we argue that the vast majority of video objects and scenes can be modelled by a small set of ideal (normalized) objects and scenes that have been subjected to geometrical transforms. In the generative sense, while isolated transformations can generate (or explain) little, the small number of combined causes of variability applied to a small set of normalized video scenes produces the rich video content. We define the video scene clustering problem as the partitioning of video space into the transformation set and class of normalized scenes. Furthermore, we define the transitions explicitly: out of all transformations, camera pan, tilt, and zoom seem to introduce the most variability in the video. Once the set is thus defined, the class of normalized scenes can be inferred from the training data. Normalized objects, sometimes called exemplars, are often just a subset of the data set. The feature of our approach is that exemplars are generated from the data, for example by weighted averaging. While inference problem in this model may appear daunting, we have developed a variety of techniques for speeding up a naive implementation by orders of magnitude. Based on these principles, in our previous work [11, 12] we introduced the transformed hidden Markov model (THMM) and the techniques for realtime learning of THMMs. In order to utilize the fast Fourier transform (FFT) for inference and learning (Sec. 4) the shifts in our model are defined as being wrapped around the image boundaries. While this model is sufficient to cover wide range of real world videos, it ignores the fact that the viewing field of the camera often covers only the small portion of the actual scene. When the camera makes large pans, the previously unseen part of the scene can be best explained by introducing a new cluster in the mixture model. When there is a lot of camera motion the total cost of introducing excessive amount of clusters
overwhelms the efficiency gains. The situation gets even more complicated when there are the changes in the scale. The natural way to address these issues is to include new variables in the model: cropping and discrete scale variable. This demonstrates the flexibility of generative modeling when the new causes of variability of the data are simply modeled by introducing new variables in the model. While this model may look as the simple extension of the previous model just with a new type of transformation, the naive implementation is infeasible due to the fact that FFTs can not be used for composite transformations. Additionally, the complexity of the model would overly increase as new variables (eg. scale, rotation, lighting) would be included. It is therefore desirable to treat different transformations separately. It is also desirable to perform the clustering using the hierarchy of the models, starting with the simplest model (translation-invariant only), and feeding the results of the lower level to the higher level. These are the issues we are addressing in this paper.
3
Figure 1: Top: Generative masked scene model. Pair
THE MASKED SCENE MODEL
Similar to our previous work [12], the analysis algorithm we present here is based on a simple generative model in Fig. 1 (right). The appearance of the scene is modelled by a Gaussian appearance map. The probability density of the vector of pixel values z for the latent image corresponding to the cluster c is p(z|c) = N (z; µc , Φc ),
(1)
where µc is the mean of the latent image z, and Φc is a diagonal covariance matrix that specifies the variability of each pixel in the latent image. The variability Φc is necessary to capture various causes not captured by the variability in scene class and transformation, eg. slight blemishes in appearance or changes in lighting. We do not model the full covariance matrix as there is never enough data to estimate it form the data. It is possible, however to use a subspace modelling technique to capture some correlations in this matrix. The observable image is modelled as a cropped region of the latent scene image. Before cropping, the latent image undergoes a transformation, composed of a zoom and translation. The motivation for this is that out of the global camera motion types, zoom and pan are the most frequent, while rotations are fairly rare. Leaving out more complex motion, such as the ones produced by perspective effects, several dominant motion vector fields, nonuniform motion, etc., speeds up the algorithm (real time in our implementation), but makes the above defined variance maps a crucial part of the
c − z is a Gaussian mixture. Observation x is obtained by scaling by Z, transforming and cropping the latent image z by translation indexed by T . Bottom: Nine typical frames from the video, three corresponding cluster means, and, far right, the mean of a cluster obtained using scale/translation invariant distribution clustering (see Section 5).
model, as they can capture the extra variability, although in a cruder manner. In addition, the nonuniform variance map has to capture some other causes of variability we left out, such as small illumination changes, variable contrast, etc. The probability density of the observable vector of pixel values x for the image corresponding to the zoom Z, translation T, latent image z and fixed cropping transform W is p(x|T, Z, z) = δ(x − WTZz)
(2)
where T and Z come from a finite set of possible transformations. We consider only several different levels of zoom and the computational burden of searching over all integer translations is relieved by the use of Fast Fourier Transform (FFT) for performing computations in the Fourier domain. While we can view this model as a special case of the model in [13], where the transformation T is replaced by the composition WTZ, as we will show, it is necessary to keep these transforms separate in order to derive an efficient inference algorithm based on FFTs,
which is by several orders of magnitude faster than the algorithm based on testing all possible transformations sequentially. The joint likelihood of a single video frame x and latent image z, given c and T is p(x, z|c, T, Z) = δ(x − WTZz)N (z; µc , Φc )
(3)
Note that the distribution over z can be integrated out in the closed form p(x|c, T, Z) = N (x; WTZµc , WTZΦc Z0 T0 W0 ) (4)
as applying the Bayes rule, q(Tt |ct , Zt ) ∝ p(xt |ct , Zt , Tt ), q(ct , Zt ) ∝ p(ct )e−q(ct |Zt ,Tt ) log p(xt |ct ,Zt ,Tt )
(7)
Parameter optimization. Finding the derivatives of F with respect to the cluster mean ct = k we get T X X
q({ct , Tt })(WTt Zt )0 (WTt Zt Φk Z0t T0t W0 )−1 ×
t=1 {Tt ,ct }
(xt − WTt Zt µk ) = 0
(8)
It can be shown that Under the assumption that each frame is independently generated in this fashion, the joint distribution over all variables is Y p({x, ct , Zt , Tt }Tt=1 ) = p(xt |ct , Zt , Tt )p(ct )p(Tt , Zt ) t
(5) The model is parameterized by scene means µc , pixel variances stored on the diagonal of Φc and scene probabilities πc = p(ct = c), and as such providing a summary of what is common in the video. The hidden variables ct , Zt , Tt , describe the main causes of variability in the video, and as such vary from frame to frame. The prior distribution over Z, T is assumed uniform. To capture correlations in the hidden variables for the neighboring frames, we can use the above model as a stage in a hidden Markov model in which the video sequence X = {xt }t=1,...,T forms the observations and the hidden states S = {ct , Zt , Tt }Tt=1 form a Markov chain. For simplicity, we will treat the case without the Markov chain on the hidden states, which can help when the inference is done at very low resolutions.
4
INFERENCE OF CLASSES AND TRANSFORMATIONS. LEARNING THE SCENES IN THE MASKED SCENE MODEL
Inference (posterior optimization). For our model it is possible to derive exact EM algorithm that optimizes the free energy [15] of the form F =
X X t
ct ,Zt ,Tt
q(ct , Zt , Tt ) log
p(xt |ct , Zt , Tt )πct q(ct , Zt , Tt )
(6) For given parameters, we can optimize the free energy with respect to the posterior q. We express the posterior as q(ct , Zt )q(Tt |ct , ZtP ) and optimize F under the normalization constraints ct ,Zt q(ct , Zt ) = 1 and P q(T |c , Z ) = 1, which results in the same result t t t Tt
−1 0 0 0 0 0 T0 W0 WTZΦ−1 c Z T W xt = ZΦc Z diag(T Xt ) −1 0 0 0 0 0 T0 W0 WTZΦ−1 c Z T W WT = ZΦc Z diag(T m) (9)
where m , diag(W0 W), is the binary mask that shows the position of observable image within latent image (upper left corner), and where X , W0 x is frame x zero padded to the resolution of the latent image. Thus, assuming that the zoom has small effect on inverse variances, i.e., (ZΦZ0 )−1 ≈ (ZΦ−1 Z0 ) 1 we ˜k obtain simple update rules, e.g. for µ PT P P −1 0 t=1 Zt q(ct = k, Zt )Zt Tt q(Tt |ct = k, Zt )(T Xt ) PT P P −1 0 t=1 Zt q(ct = k, Zt )Zt Tt q(Tt |ct = k, Zt )(T m) (10) where Z−1 is the pseudoinverse of matrix Z, or the inverse zoom. In a similar fashion, we obtain the derivatives of F with respect to other two types of model parameters Φk and πk , and derive the update equations. It may seem at first that zero padding of the original frame to the size of the latent image constitutes the unjustified manipulation of the data. But, taking into account that zero is neutral element for the summation of sufficient statistics in (10), it is actually the mathematical convenience to treat all variables as being of the same size (resolution). The intuition behind (10) is that the mean latent (panoramic) image is the weighted sum of the properly shifted and scaled frames, normalized with the “counter” that keeps track on how many times each pixel was visited. Speeding up inference and parameter optimization using FFTs. Inference and update equations (7) and (8) involve either testing all possible transformations or summations over all possible transformations T. If all possible integer shifts are considered (which is desirable since one can handle arbitrary interframe shifts), then these operations can be efficiently performed in the Fourier domain by identifying them as either convolutions or correlations. For 1 To avoid degenerate solutions, the likelihood is scaled with the number of pixel increase that the zoom causes.
example, (10) can be efficiently computed using two dimensional FFT as X q(T|c, Z)(T0 X) = T
IFFT2[conj(FFT2(q(T))) ◦ (FFT2(X))]
(11)
where ◦ denotes pointwise multiplication, and “conj” denotes complex conjugate. This is done for each combination of the class and scale variables, and a similar convolution of the transformation posterior is also applied to the mask m. Similarly, FFTs are used for inference to compute Mahalanobis distance in (4). This reduces computational complexity of both inference and parameter update from N 2 to N logN (N − number of pixels), allows us to analyze video frames of higher resolution, and demonstrate the benefits of keeping translation variable T and separate from cropping W and zoom Z in the model. The computation is still proportional to the number of classes, as well as the number of zoom levels we search and sum over in the E and M steps, but the number of these configuration is typically much smaller than the number of possible shifts in the image. On-line learning. The learning described in the previous section still suffers from two drawbacks: the need to preset the number of scenes C, and the need to iterate. The structure of realistic video allows development of more efficient algorithms. Frames in video typically come in bursts of a single class which means that the algorithm does not need to test all classes against all frames all the time. We propose an online EM algorithm that passes through the data only once and introduces new classes as the new data is observed. We refer readers to our previous work [12] for the details. In order to dynamically learn the number of scenes in the on-line EM algorithm, we introduce the loglikelihood threshold γ: whenever the log-likelihood falls under the threshold a new scene is introduced. The sensitivity due to the choice of threshold is overcome by setting high threshold that guarantees the likelihood of the dataset remains high, which may lead to over-clustering. The problem of merging large number of clusters – still much smaller than the number of frames – is addressed by the EM algorithm for clustering of distributions described in the next section. We make two remarks here: 1. On-line clustering that checks the current frame against all existing clusters has the complexity linear in the number of clusters, which can be quite large. A suboptimal, but efficient initial solution is that one checks the frame only against the current cluster, i.e. there is no notion of the repeating scene. 2. Learning procedure is the fastest for translation invariant model (Z = I). Taking
into account that camera/object shifts are by far the most common transformations in the video, ignoring the scale variable leads to efficient initial clustering. Hence, we propose multi-pass, hierarchical clustering algorithm which in the first pass of on-line learning tests the data point against current cluster only and does not test for the scale variable. This leads to the real-time performance (20 frames per second on 2GHz PC). In the following passes we utilize generalized EM algorithm (derived in the next section), which tests data point against previous clusters and includes translation and zoom variable. It simplifies the model by re-estimating it, causing the model to compact the summary into a smaller number of clusters. When there is only a handful of clusters left, we can further cluster them using the batch EM algorithm with number of clusters determined by MDL criterion. We note that joint optimization over the scale and translation is not approximated by independent optimizations. While optimization over scale is not done in the first pass, subsequent passes optimize over both variables.
5
RECURSIVE MODEL ESTIMATION
For simplicity, we will formally derive the reestimation algorithm using a mixture model as an example, with the understanding that the same derivation is carried out for the more complex transformed model. Given that a large number of images xt have P been summarized using a mixture model p(x) = c p(x|c)p(c) clustered into C of data centers, each described by the prior p(c) = πc , mean µc and a diagonal covariance matrix Φc , we want to estimate another mixture model p1 defined by a smaller number of clusters S with parameters πs , µs , Φs on the same data. Reparametrized model p1 can be then reparametrized again, giving the hierarchy of models of decreasing complexity. By doing this we resort to the original data exactly once and avoid costly processing of hundreds of thousands of video frames in typical video. Assuming that p summarizes the data well, we can replace the real data {xt } with the similar data {yt } randomly generated from the obtained mixture model, and estimate the parameters of the model p1 using the virtual data {yt }. While a particular set {yt } may or may not be a good replacement for the original data, we assume that if p is a good model, training another model on the random data will on average provide similar results as training on the original data. We can do this without actually generating data, but rather by dealing only with the expectations under the model p. We optimize the expectation of the likelihood of P the generated data, T1 t log p1 (yt ) for large T , where
yt ∼ p(y) (in our example the mixture model with parameters {πc , µc , Φc }). T 1X log p1 (yt ) → E[log p1 (y)] = T t=1 Z Z X = p(y) log p1 (y) = p(y|c)p(c) log p1 (y) y
=
y
X
Z p(c)
X
p(y|c) log p (y) y
c
≥
c 1
Z p(c)
p(y|c) y
c
X
qc (s) log
s
p1 (y|s)p1 (s) = −EF, qc (s)
where the inequality follows by the same convexity argument as in the case of the standard free energy, and the new bound on free energy EF would be tight if qc (s) were exactly equal to the posterior, i.e., qc (s) = p1 (s|y). However, we assume that the posterior is the same for all y once the class c is chosen, and we emphasize this with the notation qc (s). Under this assumption the bound further simplifies into ( Z X X −EF = p(c) qc (s) p(y|c) log p1 (y|s) + c
y
s
) X
1
qc (s)[log p (s) − log qc (s)]
s
=
X c
1 T p(c) qc (s) − (µs − µc )T Φ−1 s (µs − µc ) 2 s 1 1 − tr(Φ−1 Φ ) − log |2πΦ | c s s 2 2 X X + p(c) qc (s)[log p1 (s) − log qc (s)] X
c
s
Minimizing the P free energy under the usual constraints, e.g., s qc (s) = 1 yields an iteration of an EM algorithm that reparameterizes the model, e.g., for the plain mixture model, qc (s) ∝ 1
p1 (s)e− 2 (µs −µc )
T
T 1 −1 1 Φ−1 s (µs −µc ) −2 tr(Φs Φc )− 2 log |2πΦs |
(12) P c p(c)qc (s)µc µs = P p(c)qc (s) P c p(c)qc (s)[(µs − µc )(µs − µc )T + Φc ] P Φs = c c p(c)qc (s) P X p(c)q c (s) c πs = p1 (s) = P = p(c)qc (s) (13) c p(c) c The EM algorithm above will converge to a local maximum and the quality of the results will depend on the
validity of the assumption that the posterior q(s) is shared among all virtual data samples from the same class c. When model p captures the original data with lots of narrow models p(y|c), and S C, the approximation is reasonable and reduces the computation by a factor of T /C in comparison with retraining directly on the original data. From these update equations we derive on-line update rules similar as before. The result is a hierarchy of models which can be elegantly presented to the user through an appropriate user interface, discussed in the next section and shown in the video submission. Fig. 1 (right) illustrates hierarchical clustering of three distributions into a single hyper-cluster, using both translation and scale invariant clustering. Finally, let’s note that optimization we performed is equivalent to minimizing KL divergence between p and p1 . Hence, this result is an ellegant approximation of the KL divergence between two mixtures of Gaussians (intractable in the closed form).
6
EXPERIMENTS WITH THE VIDEO BROWSER
We tested our system on an 18 minutes long home video. In the first pass of on-line learning, the video is summarized by 192 clusters, many of them repeating. We reestimate this model until we end up with a handful of classes. In all but the first pass we search over all configurations of the zoom and class variables, as explained earlier. The complexity of the learning drops exponentially as we go higher and higher in the hierarchy, and so we do not need to be careful about the selection of the sequence of thresholds or numbers of classes - we simply train a large number of models, as the user can choose any one of them quickly at the browsing time. Enclosed screenshots and the video submission demonstrate the novel user interface that helps user quickly search scenes and objects. The interface features a time-line and an active panel containing the scene images - the set of 31 recognized scenes at one level of the hierarchy. We chose this level so that there would be few enough scenes to fit them all into a single view, but enough to represent the major elements of the video. The generative model is fit to a video, and meta-data are readily available: class membership and position within the class for each frame. Vice versa, we label each pixel in the latent image by inferring all the frames to which it belongs. In order to explore how our representation can be used to browse and search through video, we developed the user interface shown in Fig. 2 (top, left). When the interface starts up, the right pane shows the set of top-level panoramas, i.e., the panoramas at the highest level of the hierarchy. The user can then zoom in
and out of these images. By moving the mouse and hovering over a pixel in a panorama, the user causes automatic playback of the set of frames in the video that see this point (see Fig. 2 top, right). The location of these frames is highlighted in green on the timeline at the bottom of the interface, and the video pane on the left loops through the selected frames, showing the current position with a red indicator. In this way, the user can browse through space as she moves the mouse pointer over the panoramas, or through time as she moves the cursor over the timeline. She can make “virtual pans” through the scene simply by moving the mouse around. In addition, the user can also explore the hierarchy through this interface. When the user clicks on a panorama, it will split into its children, i.e., the component panoramas that were clustered together into the current one. Fig. 2 (middle) shows the result of splitting a panorama into its children. The user can now mouse over these panoramas and navigate the video in the same way as with the original view. In Fig. 2 (bottom) we compare our method and shot detection based on color histograms against the manually labelled ground truth. The interface is especially useful for browsing home video, which typically consists of little short gems and long uninteresting segments that form no coherent story. The most important feature of our browser is that no frame of the video is hidden as the visual representation is estimated to represent all frames. The user easily finds the moment she is looking for and selectively plays parts of interest with the ability to switch to a different scene at any moment. This transforms our indexing/searching into a selective video consumption tool. Acknowledgements This work was supported in part by the Advanced Research and Development Activity (ARDA)’s Video Analysis and Content Extraction Program (VACE) II under contract number MDA904-03-C-1787.
References [1] D. Mumford. Pattern theory: a unifying perspective. In Perception as Bayesian Inference, Cambridge University Press, 1996. [2] P. Anandan, M. Irani, M. Kumar, and J. Bergen. Video as an image data source: Efficient representations and applications. In Proceedings of IEEE ICIP, pages 318–321, 1995. [3] A. Aner and J. R. Kender. Video summaries through mosaic-based shot and scene clustering. European Conference on Computer Vision, 2002.
[4] M. Yeung, B. Yeo, and B. Liu. Segmentation of video by clustering and graph analysis. In C omputer Vision and Image Understanding, vol. 71, no. 1, July 1998. [5] A. Girgensohn and J. Boreczky. Time-constrained keyframe selection technique. In Proc. IEEE Multimedia Computing and Systems, 1999. [6] X. Orriols and X. Binefa. An EM Algorithm for Video Summarization, Generative Model Approach. International Conference Computer Vision, 2001. [7] X. Orriols and X. Binefa. Online Bayesian Video Summarization and Linking. International Conference on Image and Video Retrieval, 2002. [8] Y. Gong and Y. Liu. Generating Optimal Video Summaries. International Conference on Multimedia and Expo, 2000. [9] N. Petrovic, N. Jojic and T. S. Huang. Scene generative models for adaptive video fast forward. International Conference Image Processing, 2003. [10] N. Omoigui, L. He, A. Gupta, J. Grudin, and E. Sanoki. Time-compression: System concerns, usage, and benefits. In Proceedings of ACM ICH, 1999. [11] N. Jojic, N. Petrovic, B. J. Frey, T. S. Huang. Transformed hidden Markov models: Estimating mixture models and inferring spatial transformations in video sequences. Computer Vision and Pattern Recognition, 2000. [12] N. Petrovic, N. Jojic, B. Frey, and T. Huang. Real-time on-line learning of transformed hidden Markov models from video. In Artificial Intelligence and Statistics, 2003. [13] B.J. Frey and N. Jojic. Transformation-invariant clustering using the EM algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence, 25(1), Jan 2003. [14] M. Irani, P. Anandan, and S. Hsu. Mosaic based representations of video sequences and their applications. In Proceedings of the 5th ICCV, pages 605–611, June 1995. [15] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justifies incremental, sparse and other variants. In M. I. Jordan, editor, Learning in Graphical Models, page 355368. Norwell MA: Kluwer Academic Publishers, 1998.
Figure 2: Top, left: The video browsing interface upon startup. The pane on the right shows the top-level panoramas, i.e., the clusters at the top of the hierarchy. Top, right: Zooming into a particular panorama with the browsing interface. When the user mouses over a particular pixel in the panorama, all of the frames containing that pixel are retrieved, and the video pane on the left loops through all of these frames. The time locations of the frames are shown in green in the timeline at the bottom of the interface, while the current location is shown with a red indicator. Middle: Exploring the hierarchy of panoramas with the browsing interface. Bottom: Comparison of ground truth, shot detection and our method. The timeline with ground truth segmentation (left) illustrates positions and labels of 15 scenes in the video. Some of the scenes are repeating. Shot detection algorithm based on color histogram (center) is sensitive to sudden swipes and changes in light. In the same time, scene changes are undetected if there are due to slow swipes of the camera. Our approach (right) correctly detects scenes in most cases. We label the situations when it oversegments.