Enhancing Video Navigation with Existing Alternate Representations Ken Yap, Bill Simpson-Young, Uma Srinivasan Research Data Network CRC/CSIRO Australia Locked Bag 17, North Ryde 2113, Australia
Email: fKen.Yap,Bill.Simpson-Young,
[email protected] Tel: +61 2 3253100 Fax: +61 2 3253101
Abstract. We describe FRANK, an application that we have developed
for remote browsing of lm archives. FRANK uses alternate representations of video in the form of text, image, or structured data that have a time-based mapping. We describe the motivation for the project, the work processes of the users, and the operation of the resulting prototype. Finally we suggest some techniques for deriving timecodes in the absence of time-stamped alternate representations.
1 Introduction Alternate representations of video are material in the form of text, image or structured data that have a time-based mapping to speci c video material. Alternate representations are often generated during the work process for a video production and, in the case of documentary videos (the material we are working with), include transcripts and shotlists.1 We describe how this material is integrated into FRANK (Film/TV Researchers Archival Navigation Kit) to enhance the quality of the navigation. Our application, FRANK[1], is a tool that allows lmmakers and researchers to browse a video archive remotely (intercity via a broadband ATM network). These users need to do more than just view a video using standard VCR type controls. They need to be able to selectively search a video with the help of the alternate representations. In their current work practices they have transcripts and shotlists at hand. Therefore FRANK had to be more than a standard Video On Demand system. Our challenge was to design an architecture and implement a prototype that satis ed the users' requirements.
2 Work processes in lm research In the rst part of the project, we did a user requirement study[2]. This involved interviewing a group of typical users, comprising lm producers, lm researchers, and librarians. Our focus was the research part of the lm production process, where the users are evaluating potential material for inclusion in the production. 1
Shotlists are textual descriptions of video sequences.
In the initial stage of a search, a researcher uses catalogs, printed material, personal contacts and accumulated expertise to narrow the eld of search. In the next phase, they scan candidate videos, looking for potential material, making shot notes as they go. They may also have at their disposal screenplays, transcripts and shot lists, which were created either in the screen writing process, the production process (e.g. transcripts of interviews) or in the post-production process (e.g. transcripts for subtitling). In these days of oce automation, these resources are available in machine readable form.[3] Some contain timecodes which can be used for synchronisation with the video. Our architecture takes advantage of this existing material by assimilating it into our database and providing a navigation engine to work with the video server.
3 FRANK functionality and architecture The FRANK software system allows a lm or TV researcher to browse remote video (using start, stop, random access, etc) and to browse remote alternate representations such as transcripts and shotlists (using next page, full-text searching, etc). The researcher can use the alternate representation to locate material of interest (e.g. by performing a search for a desired word in a transcript) and, with the press of a button, request that the video be played from the corresponding position. They can specify that they want tight coupling of the video with the alternate representation so that as they browse and jump through, say, a transcript, the video will always keep synchronised (or vice versa). The architecture of the FRANK system is a client-server architecture and has a separation of the video server from the server providing navigation and searching of the alternate representations (see Figure 1). The FRANK client performs the synchronisation and display of the material provided by the two servers.
4 Navigation The key to eective use of alternate representations is to provide tight coupling of the navigation through the video and its representation such that the users feel that they are navigating through a common information space[4]. In FRANK, this is achieved not only by providing the coupled navigation through text/images and video, but also by making the video navigation buttons (e.g. next and prev) context-dependent such that the eect of a button depends on the alternate representation currently in eect. For example, when viewing a shotlist, pressing the next button would advance the video to the next shot speci ed in the shotlist, whereas when viewing a set of stills indicating positions of interest (e.g. the results of a search), the next button would advance to the next such position. In providing this support, FRANK is allowing navigation through the information space rather than navigation through the video as a sequence of frames as in the VCR or standard video control model.
Static navigation aids (web pages)
Continuous media (video/audio)
Other web pages
Resources for dynamic navigation aids (transcripts, shotlists, etc)
Navigation Engine
Continous Media
Web Server
(CM) Server
Network
Continuous media control
Web-browsing control
CM/Web synchronisation/linking Continuous media display
Web-browsing display
Fig. 1. FRANK architecture The appearance of the resulting FRANK prototype is shown in Figure 2. The left half is the navigational text while the right half is the video viewer. For the practical use of alternate representations for video navigation in a network environment, it is advantageous to separate them from the video on the network as is done with the FRANK architecture. In doing this, the alternate representations can be used as navigation aids for videos available from existing video servers on a network (assuming the video servers support a protocol for random access) without the video server having any knowledge of the existence of external navigation aids. This allows alternate representations to be provided as a value-added service aiding navigation of existing video services. As well as being useful for material such as transcripts and shotlists, the FRANK architecture is suitable for any representation of video that has a time dimension and can be determined by pre-processing of the video. Thus the FRANK architecture is appropriate for content-based retrieval and navigation
such as searching for footage within a video matching speci ed image characteristics (e.g. texture). In such cases, the alternate representation is the timecoded image metadata obtained by pre-processing the video.
5 Generation of Alternate Representations The initial material on which FRANK has been used has timecode information already present in the alternate representations. For example, the material from Film Australia's Australian Biography series has timecodes approximately every paragraph. We have found that this is sucient for use by lm researchers, who want to nd the part of a video where a topic is being discussed and neither need nor want the video to be cued exactly to a speci ed word. Transcripts are often available where the material is likely to be exported (and hence translated) or where the material is expected to be of important historical value. In Australia, a lmmaker can also get a greater tax advantage when they donate material to a national archive and the estimated value is greater if a transcript exists. Whether or not timecodes are included in a transcript will depend on the policy of the body sponsoring the transcript creation. In situations where there are no timecodes on transcripts, some pre-processing needs to be done in order to synchronise available information with the corresponding video segment. In such cases the method used necessarily is context dependent depending on factors such as the structure of the material (e.g. whether it's a narrative documentary, a sequence of newsclips, etc.) and the nature of the available alternate representations (e.g. a transcript of the audio component, a full shotlist of the video component, a short shotlist in a catalogue entry, etc.). We intend to use FRANK on material with a variety of dierent characteristics. The material that we are using, and some characteristics of each, includes: 1. Material from Film Australia's Australian Biography series. Each half-hour program is primarily a recorded interview with interspersed archival stills and footage. The audio of the interview has been created from a longer taped interview but is intended to appear as a continuous narrative. The audio usually continues over the stills and some archival footage leading to a close relationship between the transcript and the audio component but usually not the video component. A separate shotlist exists that describes the visual component with descriptions of archival footage used. The transcripts have timecodes corresponding to approximately every paragraph of spoken text and the shotlists have a timecode entry for each shot. 2. Material from the Australian Broadcasting Corporation's \4 Corners" documentary series. The transcript includes a full textual version of the audio component and also a textual description of shots used in the program. The transcripts do not have timecode information. This section describes the broad categories of options that we intend to investigate for the generation of alternate representations and/or timecode information.
5.1 Alternate Representation Synthesis This category of techniques is used for the generation of an alternate representation where there is no use made of existing textual alternate representations. The category includes techniques such as: 1. Automatic transcript generation using speech analysis techniques. 2. Video segmentation using video analysis techniques to nd rst or a representative frame from each shot and/or scene. 3. Context-based sampling techniques to capture the timecodes of corresponding segments, for example; every 2000th frame for an action movie; rst frame of every shot for a news clip[5]; or camera shift for documentary/interviews[6].
5.2 Content-based Timecode Generation This category of techniques includes using techniques such as those above to enhance the usefulness of existing textual representations by matching the textual representation with the audio or video content in order to insert timecode information into the textual representation. The techniques include those for: 1. Synchronisation of audio with transcript. The synchronisation process can utilise various cues. For example, an interview usually consists of two persons taking turns to speak. They will have distinctive audio signatures and the transitions between their parts can be used to time stamp the transcript. 2. Synchronisation of video with shotlist.
5.3 Interpolation on Parallel Media This category of techniques includes using techniques in both of the categories above and additionally using interpolation in one representation based on relative positions in another representation. For example, the transcript length in words can be used to estimate the approximate duration of the shot, and the transcript and video can be synchronised by interpolation.
5.4 Interactive Timecode Insertion This category of techniques include all those discussed above and in addition include user involvement and user-feedback on the heuristics being used. For example, the base FRANK system will support manual insertion of timecodes into a transcript while a video is being viewed (by the user clicking the Insert Timecode button when the footage corresponding to the text insertion cursor position is played). A great deal of time can be saved if, after a timecode is entered and the text cursor repositioned by the user, a guess is made for the next timecode needed (using interpolation techniques) and the video cued to the corresponding position. It might also be possible for the user, often an experienced researcher, to estimate the transcript length and corresponding shot duration.
5.5 Use of These Techniques We are accumulating a stock of heuristics and trying to infer a more general architecture for a browsing tool that will allow users to apply these techniques, singly or in combination, to nd the material they need.
6 Status of Prototype Initial demonstrations of the FRANK system to lm and television researchers and librarians has met with an overwhelmingly positive response. Although developed to experiment with applications of wide-area broadband networks, we have had a great deal of interest in the system being used across local area networks for internal use within archives. We are now commencing a technology trial of FRANK in several archives.
7 Summary We have shown how alternate representations are critical in our architecture to provide the quality of video archive navigation demanded by our users. We analysed the work habits of our user group and discovered that alternate representations are commonly used as aids in current work processes. We have developed an architecture and a prototype based on that design for remote video browsing. In our design, the navigation aids are stored separately from the video. This allows small service providers to provide a high-quality service with the help of a shared resource, that is, the large media servers providing the video streams. This in turn reduces the entry costs for a media service provider. We have identi ed some approaches we will be investigating for generating timecodes in the absence of time-based alternate representations.
8 Acknowledgements We acknowledge the support provided by the Cooperative Research Centre for Research Data Network established under the Australian Government's Cooperative Research Centres Program. We also thank Film Australia and the Australian Broadcasting Corporation for supplying material and lm researching/archiving expertise to the project.
References 1. K. Yap and B. Simpson-Young, The BIOS Trials on the Telstra Experimental Broadband Network, in Australian Telecommunication Networks and Applications Conference 1995, Sydney, December 1995.
2. B. Simpson-Young and K. Yap, Work Processes of Film and Television Researchers, CSIRO Division of Information Technology Technical Report 95/11, March 1995. See http://www.syd.dit.csiro.au/projects/dimmis/bios/tr95_11.ps 3. James Grioen, Raj Yavatkar, and Robert Adams. An Object-oriented Model for Semantic Intepretation of Multimedia Data. In ACM Multimedia'95 Proceedings, CA 1995. 4. Howard D. Wactlar, Takeo Kanade, Michael A. Smith, and Scott M. Stevens. Intelligent Access to Digital Video: Informedia project. IEEE Computer 29 (5) May 1996. 5. Zhang, H J, and Smoliar S.W. Developing Power Tools for Video Indexing and Retrieval. In Proceedings IS&T/SPIE symposium on Electronic Imaging Science and Technology: Storage and Retrieval for Image and Video Databases II 140{149. San Jose, CA, 1994. 6. Akioi Nagasaka and Yuzuru Tanaka. Automatic Video Indexing and Full-Video Search for Object Appearances. Visual Database Systems II A-7 113{119. Elsevier Science Publishers B.V. (North Holland). IFIP 1992.
This article was processed using the LATEX macro package with LLNCS style
Fig. 2. Appearance of the FRANK interface