Software Tools for Extraction Of Video Frames for Use in Mosaicking Making Use of Available Modern Technologies To Reduce the Need for Intensive Manual Labor By Yuri Rzhanov Associate Professor University of New Hampshire Durham, New Hampshire
I
magery mosaicking typically relies on one of two sources—a sequence of still images or a video stream. The difference between these sources is marginal; a sequence of images can be combined in a single file or a stream can be separated into a set of frames. Each source type has its advantages, however: Still images usually have higher resolution, and video makes use of interframe compression, which allows one to save frames at a high rate and guarantee larger sequential overlap. Reliable mosaicking requires at least 80 percent overlap between sequential frames for two reasons. First, the higher the overlap, the more robust the pairwise image registration will be. Large camera translations (less overlap) make 3D features in the scene appear differently in different frames and cause changes in shadowing, as light sources are often attached to the camera platform. Second, the final mosaic looks more consistent when there are no dramatic changes between consecutive images and each image contributes a relatively small patch to the mosaic. On the other hand, the use of all available imagery—with overlap larger than 80 percent—significantly increases processing time, wasting computational resources. When still images are used, the successive frames may have too small of an overlap for automatic registration to
succeed, or even no overlap at all. A video stream with frame rates higher than 24 frames per second guarantees that sequential overlap never drops below 95 to 98 percent, so there is always the possibility of choosing a frame that has a specified overlap with the previously chosen one. The amount of collected data is an additional factor in favor of video: Still images are typically saved on flash drives with limited storage capacity, while video streams are recorded in formats that provide more storage space and are cheaper. Frame selection is directly related to total processing (mosaic construction) time, robustness of registration and quality of the final product. Frame selection can be done at the registration stage by an iterative trial-and-error procedure (bisection, for instance) if the container for imagery allows for reliable seeking (i.e., a movie file, an image sequence, etc.). If, however, collected imagery is supplemented by time-tagged positional and attitude information for the camera, this selection can be done prior to actual frame extraction (which then may be done directly from a magnetic tape playing in real time). The selection process produces a list of time codes, uniquely identifying frames essential for the construction of a high-quality mosaic. Frame Extraction Software Tools This article describes a set of software tools developed at the University of New Hampshire’s Center for Coastal and Ocean Mapping (CCOM) to help with the frame-extraction process. www.sea-technology.com
These tools are dialog-based applications and/or command-line utilities running on Microsoft Windows™ and utilizing the Microsoft DirectShow™ approach. DirectShow is software architecture for streaming technology. Each application has at least one graph consisting of interconnected filters, with upstream filters processing media samples and passing them to downstream filters for further processing. Many filters for various tasks have been developed by Microsoft and other software companies. Some filters for specific tasks have been developed at CCOM. All the graphs used in these applications have the same structure: The imagery is fed into a source and converted to the standard format—an uncompressed movie in audio video interleave (AVI) format. An intermediate filter passes some frames to the output and blocks the rest, depending on additional information provided to the application. DVPlayerAcquire. In this graphical user interface (GUI) dialog-based application, the source of the imagery is a digital video (DV)-compliant tape player (miniDV, Hi8, Digital8, DVCam, etc.), and, as always with a mechanical tape device, it does not support seeking well (frequent seek operation damages the tape). The type of tape device can be selected from a drop-down list of available devices that are currently connected to the computer. Frames can be selected via one of two mechanisms. The first assumes that the imagery is not accompanied by additional information, so the frame selection rule is DECEMBER 2009 / st
1
defined simply by a decimation factor. This functionality is provided by the TripShot filter, which accepts a decimation factor of N and passes through only every Nth frame. The second mechanism expects a user-provided text file containing the list of time codes that correspond to the required video frames. This is also handled by the TripShot filter. The application is constantly querying the capture filter (the interface with the tape player) for the current time code, and when it precedes the time code listed in the time code file, it sets the semaphore on the TripShot filter so that the next media sample (containing the requested frame) will be allowed to pass through. Other downstream filters do the necessary color space conversion and write uncompressed video in AVI format onto a hard drive. HDPlayerAcquire. This is a GUI dialog-based application for imagery extraction from tapes recorded in highdefinition (HD) standard (like the Sony HVR-M15U). The size of the frame for this particular format is 1440*1080 pixels, and the tape contains MPEG-2encoded multiplexed streams. Saving uncompressed frames on a hard drive in real time is not feasible. The limiting factors (processing power of the computer and recording speed of the hard drive) become essential when the frequency of frame extraction is too high. To guarantee robust operation, the process has to be split into two stages. In the first stage, MPEG-2 streams are recorded on a hard drive without any processing. At this stage, it is possible to query time codes, and the recorded stream is related to recording time through the time code of the start frame. The second stage consists of offline processing and usually takes longer than the first stage, due to the necessity of decoding and handling large frames. The prerecorded file is read from the hard drive and demultiplexed, and the video stream is decoded. Knowledge of the time code related to the first frame in the stream allows for time tagging of all the consequent frames. The frame selection mechanisms described above for the DVPlayerAcquire application are also implemented here. And, as in the DVPlayerAcquire application, the imagery of the selected frames is converted to AVI format and saved. 2
st / DECEMBER 2009
VOBToAVI_TL. In some video acquisition setups, the imagery is recorded directly onto a DVD using native MPEG-2 compression. DVD video data—recorded in files with the extension VOB for video object—does not support time codes; therefore, to be able to relate individual frames to the time when they have been taken, time codes are recorded on the audio channel, using the longitudinal time code standard established by the Society of Motion Picture and TV Engineers (SMPTE). This approach, for example, has been selected by the team managing the deep-submergence remotely operated vehicle Jason II. The SMPTE LTC frame consists of 80-bit cells. The process of decoding the audio signal requires a search for the “sync word” (last 16-bit cells) and interpretation of the following 64-bit cells. If the waveform of the audio channel of the DVD is displayed as a graph, the synchronization words can be clearly visually distinguished in the sequence of samples. To handle these data, CCOM has developed a specialized DirectShow filter called TimecodeListener. The processing logic can be best demonstrated visually as a DirectShow graph in one of the GUI applications developed specifically to be a test bed for DirectShow filters. DirectShow graphs are easy to visualize in a window of the program GraphEditPlus, which is an enhanced version of the original DirectShow visualization tool GraphEdit. The top left dark-blue rectangle represents the source—VOB, containing video and audio data and possibly other information. These two interleaved streams are demultiplexed and then split into video (top) and audio (bottom) streams. The video stream is MPEG-2-decoded and then converted to RGB color space. The audio stream is decoded from MPEG-1 to pulse-code modulation and is passed to the TimecodeListener filter. The latter receives audio data on input, decodes elements of the time structure and allows the VOBToAVI_TL application to access this structure. The application is responsible for setting a semaphore on the TripShot filter when the approaching media sample contains one of the frames selected for extracwww.sea-technology.com
tion. This frame is then passed to the AVI multiplexer and saved on the hard drive. A similar approach can be used to decode frequency shift key data, which may also be recorded on an audio channel. VOBToAVI_OP. Another popular technique for time tagging the imagery is to show the time, date and position information as an overlay on the video display. Although this approach substantially reduces the area of the frame useful for postprocessing (assembling individual frames in a mosaic, shape extraction, target recognition, etc.), it is nevertheless favored by many underwater explorers due to its convenience for observers and event annotators. The specialized filter OverlayParser has been developed at CCOM for the extraction of textual information (including time codes) from image frames. In essence, it performs a lightweight optical character recognition (OCR) procedure. Converting images of letters and digits overlaid on an image into text that can be interpreted by a processing program is significantly easier than a generic OCR task, which often involves a variety of fonts rotated by unknown angle lines and other issues. For initialization, the OverlayParser filter requires binarized font for digits and a set of pixel locations where these digits are expected to appear. These data are easy to prepare prior to the frames’ extraction—either from the description of the DSL video overlay processor used or manually from several examples of extracted frames. Other data can be parsed from the overlay in the same manner. The only factor complicating the optical recognition of digits is related to the video compression of frames. In the original raw frames, the overlaid characters are white (or another color selected by the operator), but MPEG-2 compression causes “color leakage,” so the pixels representing characters may have different colors, typically a mixture of the original color and that of the background. Sometimes, high compression rates make digits almost unreadable. The correct digits can be determined by template matching: The score is calculated for each of the 10 digits, with the lowest score indicating the winner. In fact, the time code regularity allows for detecting only a change in digits once the time code has been successfully detected. This
approach, however, fails if the recording has been stopped or some frames have been dropped due to bandwidth limitations and time code continuity was broken. Hence, the independent determination of all time code digits provides a more robust solution. All of the programs described in this article are available for download on the CCOM Web site. Acknowledgments The author would like to thank Peter Dartnell and Seth Ackerman of the U.S. Geological Survey and Stace Beaulieu of the Woods Hole Oceanographic Institution for lending various equipment and providing data. References For a full list of references, please contact Yuri Rzhanov at
[email protected]. ■ Visit our Web site at www.sea-technology.com and click on the title of this article in the Table of Contents to be linked to the respective company’s Web site. Yuri Rzhanov received a Ph.D. in semiconductor physics from the Russian Academy of Sciences in 1983. He is an associate professor at the University of New Hampshire’s Center for Coastal and Ocean Mapping, and his research interests include optical methods of seafloor mapping and sea-bottom structure reconstruction from multiple views.
[Fig. 1] Example of a SMPTE LTC sync tone waveform.
[Fig. 2] Typical waveform of the audio channel from a DVD recorded for Jason II.
[Fig. 3] Example of a video frame with overlay.
[Fig. 4]
[Fig. 5]
DirectShow graph shown in the window of the GraphEditPlus application.
Example of overlaid text that is difficult to process automatically. www.sea-technology.com
DECEMBER 2009 / st
3