For images, one selects an image and the system returns similar ... keyframes, and then apply the same techniques as are used for image similarity retrieval ... number of pre-computed coefficients, similarity calculations can be extremely rapid, on the ... In the next section, we discuss our statistical measure of video similarity.
Interactive Similarity Search for Video Browsing and Retrieval John Boreczky, Jonathan Foote, Andreas Girgensohn, Lynn Wilcox FX Palo Alto laboratory 3400 Hillview Avenue, Building 4 Palo Alto, CA 94304 USA {johnb; foote; andreasg; wilcox}@pal.xerox.com
Abstract We present an interactive system that allows a user to locate regions of video that are similar to a video query. Thus segments of video can be found by simply providing an example of the video of interest. The user selects a video segment for the query from either a static frame-based interface or a video player. A statistical model of the query is calculated on-the-fly, and is used to find similar regions of video. The similarity measure is based on a Gaussian model of reduced frame image transform coefficients. Similarity in a single video is displayed in the Metadata Media Player. The player can be used to navigate through the video by jumping between regions of similarity. Similarity can be rapidly calculated for multiple video files as well. These results are displayed in MBase, a Web-based video browser that allows similarity in multiple video files to be visualized simultaneously.
1. Introduction Similarity search is well understood for images and text. In the case of text, one simply specifies a document or a passage of the document, and similar documents are returned based on word frequencies. For images, one selects an image and the system returns similar images (Flickner et al., 1997). The advantage of similarity search is that it does not require indexing of the data to be searched, and it is easy to specify the query by example. To apply similarity search to video, one could simply represent video as a sequence of keyframes, and then apply the same techniques as are used for image similarity retrieval (Zhang et al., 1997). However, this method is not optimal, as it does not fully exploit the sequential and redundant nature of the media. For example, it does not provide a segmentation of the video according to similarity, and the query is limited to a single frame of the video. We present a similarity search that uses a segment of video as the query, and provides visual feedback on similar segments in a single video or multiple videos. Using a video segment for the query makes the search more robust. Segment queries allow us to model the invariant features for all frames in the segment. Furthermore, we can model the variation within the segment, which is not possible with a single-frame query (Dimitrova and Abdel-Mottaled, 1997). The interactive similarity search presented in this paper can rapidly find video regions similar to those selected by a user. Because each video frame is represented as a small
number of pre-computed coefficients, similarity calculations can be extremely rapid, on the order of thousands of times faster than real-time. This enables the interactive players and browsers described below. The results of the similarity search are displayed graphically. Similarity within a single video is displayed in the Metadata Media Player, a video player augmented with information from the search. It enables browsing and playback of similar regions of video. Similarity in multiple videos is displayed in the MBase Video Browser. This Web-based application displays the list of the videos along with visual information on the similarity of videos to the query (Girgensohn et al., 1999). At FX Palo Alto Laboratory, we record most seminars, meetings, and presentations that are held in our media-enhanced conference room (Chiu et al., 1999). The room is equipped with a number of cameras and microphones, which are controlled by a member of the lab. The captured presentations contain video shots of the speaker, the audience, and the presentation material being shown on the projection screen. The recorded meetings and presentations are made available to the staff via the company Intranet. Users are often interested in retrieving specific information from this video collection. However, finding the desired information in the collection of hour-long videos can be difficult. The interactive similarity search helps users solve this problem, without requiring manual indexing such as transcription. For example, suppose a user wants video clips that show a close-up of each seminar speaker. By providing an example clip of a close-up of one or two speakers, the system can find segments of video containing close-ups of all speakers in the collection of recorded seminars. Another example is finding the presentation material (e.g. PowerPoint slides) shown during a presentation. The user provides a segment of video showing a single PowerPoint slide, and the search returns similar segments. Since most presentations have similar slide format, such a search yields all presentation slides. In the next section, we discuss our statistical measure of video similarity. We continue by describing the Metadata Media Player, which shows the results of a similarity search within a single video, and the MBase Video Browser, which provides a visualization of the search results for multiple videos. We then present two user interfaces for selecting video query regions. We conclude by discussing related work and directions for future research.
2. Video Similarity To compute video similarity, we first perform a Discrete Cosine Transform (DCT) on each frame in the video. Video frames are decimated and converted to grayscale prior to the DCT. The DCT is truncated to reduce the dimensionality. The reduced representation is highly compact and preserves the salient information of the original frames. A Gaussian model is computed for the query by estimating parameters from the reduced DCT representation of all the frames in the query segment. This provides a more robust representation for a query than a single frame. A similarity score is then computed on a frame-by-frame basis over the videos to be searched. The score is based on a Gaussian model for the query, and has values between zero and one that are suitable for display in the browser and player interfaces. 2.1 Discrete Cosine Transform
A 64 by 64 grayscale image of each video frame is transformed using the Discrete Cosine (DCT) (Rosenfield and Kak, 1982). The transform is taken over the entire image, rather than sub-blocks, so that the coefficients represent the image exactly. The dimensionality of the transformed data is then reduced by discarding the high frequency coefficients. This results in a 30 dimensional feature vector of reduced DCT coefficients for each video frame. Similarity measures based on transform methods are better for many applications than the more common color-histogram approach (Zhang et al., 1997). In particular, the transform coefficients represent the major shapes and textures in the image, unlike histograms, which are nearly invariant to shape. For example, two images with the same object at the top left and the bottom right will have a very small histogram difference but will be distinctively different in the transform domain. Though the current similarity measure is based on the luminance only, it should be straightforward to extend this technique to use color, as discussed in the Future Work section.
2.2 Similarity Computation The similarity score for a query is computed for each frame in the video to be searched. For many applications, full video frame rate is not necessary, and frames can be decimated in time such that only a few frames per second need be considered. This reduces storage cost and computation time. In addition, the DCT features are pre-computed and stored with the video. This enables interactive and on-the-fly similarity measurement. Though future formats such as MPEG-7 might allow including such metadata with the video data, for our applications the DCT features are stored in separate files. 2.3 Gaussian Query Model The query video segment is modeled as a multidimensional Gaussian distribution over the feature vectors of DCT coefficients. The model is trained on the ensemble of video frames comprising the query segment. The mean µ of the Gaussian distribution captures the average feature values of the example frames, while the covariance matrix Σ models feature variations due to motion or lighting differences. We assume a diagonal covariance matrix, i.e., the off-diagonal elements are zero so the model will be robust in higher dimensions. These parameters can be computed extremely rapidly in one pass over the query video. The similarity of an arbitrary video frame to the query is based on the likelihood that the Gaussian model produced the frame. The Z-score for the video frame with DCT feature vector X is computed as Z ={ (X−µ)TΣ--1(X-µ)} 1/2. Note that the square of the Z-score is just the Mahalanobis distance. For this application, we require that the similarity score be between zero and one. Thus we define the similarity score to be the percentile rank of |Z|, that is, the probability that the absolute value of a Gaussian random variable exceeds the computed Z-score. The probabilities are stored in a simple look-up table. In effect, the score measures how closely the video frame fits the Gaussian model for the query. The score ranges between zero and one. A score of one is the maximum
similarity, and is achieved when the feature vector for the frame is the mean vector for the video query. The score decreases as the feature vector gets farther from the mean.
We have previously used a Gaussian model to classify video into pre-defined video classes such as close-ups of people and presentation material. The technique was tested on our corpus of meeting, presentation, and seminar videos and yielded 90% correct recognition (Girgensohn and Foote, 1999). Thus we expect similar performance with the similarity search. In this work we use a single Gaussian model, since the parameters can be computed in a single pass over the data. However, Gaussian mixture models may be preferable for some applications because they can better model variability in frames of the video query. In this case, an iterative algorithm known as the E-M algorithm (Rabiner and Juang, 1993) is required to estimate the multiple parameters and mixture weights for the model. This may require more computational power in order for the query to be truly interactive. Further, the Z-score technique for obtaining a normalized similarity score would have to be modified.
3. Metadata Media Player The similarity scores in a single video can be visualized using the Metadata Media Player. Figure 1 shows the user interface to the Player. Just below the video window are the usual controls for play, pause, stop, fast forward, and fast reverse. Similarity scores are displayed as grayscale bars in a time bar below the video. The slider above the time bar allows the user to position the video at any point in time, using the similarity scores or the time indicators as a guide. The similarity scores are mapped to the intensity gray bars, so that the most similar regions are the darkest. Figure 1 shows similarity to a query for the presentation material (PowerPoint slides) in a seminar, as in the displayed frame. The video recording of the seminar contains shots of the speaker alternating with shots of the presentation material and the audience. The location and extent of the regions in the video similar to the query for the presentation material are immediately apparent as black bars in the time line. The threshold slider at the middle right controls how index points are derived from the similarity scores. Index points are shown as brighter bars in the upper region of dark (similar) regions in the time bar. (This is primarily for the grayscale reproduction of this paper: we find index points are best displayed using contrasting colors.) The slider controls a threshold on similarity such that index points are determined when the similarity exceeds the threshold. The buttons labeled “” beneath the time bar automatically advance the playback point to the next or previous index point. In an area of large similarity variation (many index points), the user can select the most significant indication by increasing the threshold. In regions of lesser similarity, the user can still find index points by reducing the threshold, though they may be less reliable.
4. MBase Video Browser
While the Metadata Media Player lets you visualize regions of similarity in a single video, it does not provide support for multiple videos. The MBase Video Browser displays a summary of similarity in multiple videos. Figure 2 shows a listing of videos along with their similarity displays. The listing shows for each video metadata such as title, keywords, and date. Clicking on the video opens the Metadata Media Player to view the video. 4.1 Similarity Display As in the Metadata Media Player, the MBase Browser shows a time bar for each video, with gray bars to indicate similarity. The degree of similarity is represented graphically in the time scale such that the darker the shade of gray, the more similarity to the query. Figure 2 shows the results of a similarity search for close-up shots of speakers. The query contained a segment of video from a single seminar showing a speaker close-up. The results of the query are displayed graphically in the MBase Browser. This display helps the user to select the appropriate video. For example, if the user seeks a long video segment showing a close-up of a speaker, they can restrict browsing to those regions of video corresponding to longer similarity bars. Clicking in a similar region in the time bar launches the Metadata Media Player, where the same similarity visualization is available for navigation. 4.2 Keyframes We enhance the video listing with representative frames from the most similar regions. These keyframes can help video selection and make the listing more visually appealing. The positions of the keyframes are marked by blue triangles along a mouse-sensitive time scale adjacent to the keyframe (see Figure 2). As the mouse moves over the time scale, the keyframe for the corresponding time is shown and the triangle for that keyframe turns red. This method shows only a single keyframe at a time, preserving screen space while making other frames accessible through simple mouse motion. This interface supports very quick skimming that provides a good impression of the similar content of the video.
5. Query Interface We have created two different interfaces for selecting video to be used for a query. Figure 3 shows a Web-based application for selecting video regions that allows visualizing the selected regions. This interface also supports selection of non-contiguous regions. In this application, the video is represented as a sequence of keyframes taken at regular intervals and shown with their time in the video. We find a 5 second interval is appropriate for recorded seminars, though other rates may be may be better for other applications. The user selects multiple keyframes by clicking on the checkbox under each. The query model is trained on all frames of the video between adjacently selected keyframes. This interface allows the user to find regions of interest at a glance because of the compact display. In a normal-sized Web browser, 120 images corresponding to 10 minutes of video can be shown while the rest of the video is easily accessible via scrolling. While the Web-based interface provides a very good overview, it is not as well suited for quick similarity searches during video playback. Therefore, we have an optional display in the Metadata Media Player that shows the same periodically sampled still images in a
horizontally scrollable window (see bottom of Figure 4). During playback, the window scrolls automatically to stay synchronized with the playback window. Temporal context is shown by placing the still image closest to the current frame in the center of the scrollable window. When the video is stopped, the still images can be used for navigation. Scrolling to an interesting area and double-clicking on a still image positions the video at the corresponding time. Intervals for a similarity search can be selected by dragging the mouse over the still images. Selected areas are indicated by a bar both in the scrollable window and at the bottom of the time bar. Because only a small portion of the video is shown at a time in the scrollable window, the selected area is shown much larger. In Figure 4, the selected area displayed in the scrollable window corresponds to the very small area directly below the thumb of the slider. This difference in scale demonstrates that an interface in which one would simply click-and-drag over the time bar to select a region of video would not provide sufficient control. Selecting a query in the Metadata Media Player is good for quick searches within a video. The query is selected and the results are shown in the same application. As mentioned above, query selection in the player may be too coarse. For this reason, we have combined the two query interfaces, so that when a query is selected in the player, it is also displayed in the Web-based interface. This provides the capability for expanding the query in a larger context. Similarly, a query selected in the Web-interface can be shown in the player.
6. Use Experience We have been using the MBase video browser and Metadata Media Player at FX Palo Alto Laboratory to access our collection of meeting, seminar, and presentation videos. One of the primary uses of the system has been to locate the presentation material in a particular video. A query is formulated by selecting a segment of video showing a view of the presentation material. The results are then used to skim the video by looking only at the regions that are shown to be similar to the presentation slides. Another use of the system is in identifying speakers at a meeting. The camera person tends to do a close-up shot of any person who speaks a significant length of time at a meeting. Thus by forming a query showing a close-up of a person in a meeting, the system will return all video segments containing close-ups of people, and these are typically the speakers. A different use of the system has been in creating custom indexes to the seminar database. While a general indexing scheme for video is difficult, indexes for a specific collection of videos such as seminars can be specified more easily. In our case, the indexes are close-ups of the speaker, long shots of the audience, and the presentation material. The results for the query for each of these classes are stored for the entire database. This provides a pre-indexed collection of video for browsing and playing.
7. Related Work
Several image similarity measures have been based on wavelet transforms, for example the Haar basis of (Jacobs et al., 1995). The high-order coefficients are quantized and truncated to reduce the dimensionality. The similarity measure is a count of bitwise similarity. Kobla et al (1997) compute similarity using the DCT and motion vector information in MPEGencoded video. A global measure of similarity using histograms of 3-D curvature and orientation is used in (Ravela and Manmatha, 1998). Many systems use video clips as queries. Some systems represent the video as a single keyframe for both query and retrieval (Flickner et al., 1997). Zhang et al. (1997) characterize video segments by average color and temporal variation of color histograms. A similar approach was used by Günsel et al. (1997). After automatically finding shots, they are compared using a color histogram similarity measure. Swain (1993) presents a video indexing system using color histogram matching and image correlation. Dimitrova et al. (1998) regard the average distance of corresponding frames between two videos as the similarity measure, and take temporal order of frames into account. Mohan (1998) has developed a system for matching video sequences using the temporal correlation of extremely reduced frame image representations. While this can find repeated instances of video shots, for example “instant replays” of sporting events, it is not clear how well it generalizes to unrepeated video.
8. Future Work We are working on enhancing the current system in a number of ways. Specifically, color is a useful clue for many kinds of video images, for example in our videos of seminars computer presentations can often be distinguished by the slide background color alone. It would be straightforward to enhance this system to account for color in the similarity measure, by calculating one or more additional signatures based on the color information. This could be done by computing an additional signature for the chromatic components of the image (the UV components in the YUV color space) to add to the existing luminance (Y) signature. Because the chromatic components need less spatial resolution, they could be represented with fewer coefficients. Alternatively, each RGB color component could be treated as a separate image. Thus three signatures would be calculated and compared for each image. This would allow weighting by overall color in the retrieval. Yet another way to include color information is to combine this retrieval technique with another based on, for example, color histograms, as in (Mohan, 1998). By breaking the image into regions and computing color histograms on each region, some of the spatial information in the image can be preserved. In the first step, images would be found by luminance signature similarity. The top-ranking images could be re-scored using color-histogram similarity or a similar approach.
9. Conclusions We have presented an interactive system for finding regions of video similar to a query. The video query is easily specified using a static frame-based interface or a video player. A reduced DCT representation of the video allows rapid similarity computation. The similarity measure is based on a Gaussian model and has a value between zero and one which is easily interpreted. The results of the similarity search are displayed in Mbase, a Web-based video browser that shows regions of similarity in multiple video files simultaneously. For a single
video, similarity results are shown in the Metadata Media player, which provides the capability to navigate through similar regions of the video.
Figure 1: Metadata Media Player showing results of similarity search for presentation material.
Figure 2: MBase Video Browser showing results of a query for a close-up of a seminar speaker.
Figure 3: Web-based query interface showing video as a sequence of static images.
Figure 4: Query interface in Metadata Media Player showing static images as video plays.
10. References Chiu, P., Kapuskar, A., Reitmeier, S., and Wilcox, L. (1999). Meeting Capture in a Media Enriched Conference Room. In Proceedings of the Second International Workshop on Cooperative Buildings (CoBuild’99) Lecture Notes in Computer Science, Vol. 1670 (pp. 79--88) Springer-Verlag. Dimitrova, N. and Abdel-Mottaled, M. (1997). Content-based Video Retrieval by Example Video Clip. SPIE Vol. 3022 (pp. 59—71). Flickner, M. et al. (1997) Query by Image and Video Content: the QBIC System. In Maybury, M. (Ed.) Intelligent Multimedia Information Retrieval, (pp. 7—22) AAAI Press/MIT Press. Girgensohn, A., Boreczky, J., Wilcox, L., and Foote, J. (1999). Facilitating Video Access by Visualizing Automatic Analysis. In Human-Computer Interaction INTERACT ‘99, (pp. 205—212). IOS Press. Girgensohn, A. and Foote, J. (1999). Video Classification Using Transform Coefficients. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing ‘99, Vol. 6 (pp. 3045—3048) Phoenix, AZ, IEEE. Günsel, B., Fu, Y., and Tekalp, A.M. (1997). Hierarchical Temporal Video Segmentation and Content Characterization. Multimedia Storage and Archiving Systems II, SPIE, Vol. 3229, (pp. 46--55) Dallas, TX. Jacobs, C., Finkelstein, A., and Salesin, D. (1995) Fast Multiresolution Image Querying. In Proceedings SIGGRAPH ’95, Los Angeles. Kobla, V., Doermann, D.S., Lin, K-I., and Faloutsos, C. (1997). Compressed Domain Video Indexing Techniques Using DCT and Motion Vector Information in MPEG Video. In Proceedings SPIE Conference on Storage and Retrieval for Image and Video databases, Vol. 3022 (pp. 200—211). Mohan, R. (1998). Video Sequence Matching. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing ’98, Seattle, WA, IEEE. Rabiner, L. R. and Juang, B. (1993). Fundamentals of Speech Recognition Prentice Hall. Ravela, S. and Manmatha, R. (1998). On Computing Global Similarity in Images. In Proceedings IEEE Workshop on Applications of Computer Vision (WACV ’98) (pp. 82— 87). Rosenfield, A. and Kak, A. (1982) Digital Picture Processing. Academic Press. Swain, M. (1993). Interactive Indexing into Image Databases. In Proceedings SPIE Storage and Retrieval for Image and Video Databases Vol. 1908, (pp. 95—103). Zhang, H-J., Low, C-Y., Smoliar, S., and JWu, J-H. (1997). Video Parsing, Retrieval, and Browsing: an Integrated and Content-Based Solution. In Maybury, M. (Ed.), Intelligent Multimedia Information Retrieva, (pp. 139—158) AAAI Press/MIT Press.