Multimedia Processing for Enhanced Information Delivery on Mobile ...

Multimedia Processing for Enhanced Information Delivery on Mobile Devices Lee Begeja Bernard Renger

David Gibbon Zhu Liu Behzad Shahraray

AT&T Labs-Research 180 Park Ave Florham Park, NJ 07932 {lee, renger} @research.att.com

AT&T Labs-Research 200 Laurel Ave S Middletown, NJ 07748 {dcg, zliu, behzad} @research.att.com Abstract – Handheld mobile devices have created new possibilities for accessing information. The limited power, storage capacity, communications bandwidth, and user interface capabilities of these devices, however, present challenges for effective presentation of multimedia data. In this paper, we discuss how automated multimedia processing techniques can be used to address some of these challenges. Media conversion is used to generate presentations that fit the device and/or user capabilities. Content-based video sampling is used to generate a compact presentation of visual information using a small set of still images. Multimodal content processing is employed to extract relevant video clips based on a user profile of interests. The combination of these techniques creates personalized information delivery systems that reduce the storage, bandwidth, and processing power requirements, and simplify user interaction. A prototype of such a system for personalized delivery of video information on mobile devices is presented.

address these issues. By processing a single medium or multiple media, content can be “repurposed” for consumption in a different form. Such multimedia processing can also serve to reduce the storage, processing, and communications requirements by creating a “simplified” or “condensed” version of the content with little or no loss of information. Even if the device has the capability to receive, store, and render a more demanding presentation of the content, a simplified version may be advantageous since it will ameliorate the storage and power limitations, and/or reduce the communications demands and cost. Another issue stems from the explosion of information that complicates the task of identifying information of interest among large amounts of irrelevant information. The current solution to this problem involves using a PC to search large repositories of online information using textbased or multimedia-based search engines. This usually involves a considerable amount of manual browsing by the user to discard irrelevant information and calls for extensive interaction and data transfers. While a similar approach may be taken for some mobile devices, the limitations imposed by the UI and data access make it an unpleasant and undesirable experience. An alternative approach is to automate the filtering process, thereby minimizing the amount of effort on the part of the user. This approach uses a profile of interests, specified by the user, to identify interesting information. The interest profile is rather static and need not be entered using the mobile device that would ultimately be rendering the information. Therefore, the user can take advantage of other devices to enter this information with more ease. Multimedia processing algorithms are then used to identify content of interest and isolate segments of interest within that content. This eliminates the need to store or retrieve the entire content or to spend additional effort to find the relevant segments within the content. Such a system was originally proposed [1] for creating personalized broadband video delivery services. However, the main point of the approach, which was to minimize the level of interaction between the user and the system, is even more relevant to the case where the information appliance is a mobile device. This paper is concerned with how multimedia processing techniques can be used to repurpose, condense and

I. INTRODUCTION

The inherent small size and low weight of mobile information appliances impose certain constraints on their hardware and user interface (UI) capabilities. These constraints create challenges for the effective access and presentation of information on such devices. While improvements in device technologies results in continuous increases in the processing power, which make it possible to support more communications bandwidth, and in storage capacity, such improvements are usually accompanied by increased power consumption. This necessitates larger or better batteries if the run time is not to be compromised. Moreover, the differences in media replay support between small mobile devices and larger information appliances can make it difficult, or even impossible, to deliver certain content on mobile devices. For example, some mobile devices (e.g., mobile phones) can display still frames but lack the ability to play motion video. In some cases, the conditions under which the device is being used prevent the use of a certain modality. For example, while driving, a user can receive audio information but does not have the option to look at a display. Such limitations may also result from hearing or vision impairments on the part of the user. Media conversion (e.g., from text to speech, or from motion video to still images) is an effective way to

1

desktop PC via USB and supports MMC memory cards. This would also include downloading content to a memory card and inserting it into the device. 2. Narrowband: This bandwidth is available today from 1G and 2G systems. We will assume a maximum bit rate of 14Kb/s. 3. Broadband: Bit rates greater than 14Kb/s that are available or are becoming available through deployment of 2.5G, 3G, 4G or IEEE 802.11x hot spots.

personalize information for delivery to handheld mobile devices. It focuses on the delivery of information that is originally produced for delivery in video form. However, some of the content personalization techniques are applicable to radio content as well. The remainder of the paper is organized as follows. Section II presents a background of video-based information delivery systems. Section III is concerned with media conversion. Section IV discusses the use of content-based video sampling techniques. Content repurposing is described in section V. Section VI is concerned with the applications of multimedia processing to personalized content delivery. Concluding remarks are given in section VII.

To create a complete service, the critical issues of authentication, authorization, transcoding, adaptation, and deployment should be provided by a mobile services platform [2][3]. Further, it is clear that advances in wireless video encoding and error resilience will benefit these services [4]. Digital rights management systems and usage monitoring are also required and the service concepts presented here will build on these existing components. However the typical usage logging can be greatly enriched when combined with the user interest profiles that are associated with content personalization.

II. BACKGROUND

Vast amounts of high quality rich media content are being produced on a daily basis in the form of broadcast and cable television programming. There is little variation in the basic capabilities of the endpoint devices for which these contents are intended. These capabilities are a function of the format of the data which are governed by well established international standards (e.g., NTSC, PAL, and SECAM). Today’s desktop PCs can easily display this content once it has been encoded using widely available codecs such as MPEG-4, Windows Media, Real, or QuickTime. Transcoding allows emerging mobile devices with sufficient capability to render it as well, although typically at somewhat reduced frame rate and spatial resolution. Table 1 summarizes typical media compression parameters for desktop and handheld mobile clients.

III. MEDIA CONVERSION

Media conversion is the process of automatically generating content in a target medium given source material in a different medium or media. Transcription of television programs is an example of off-line, manual media conversion. Very large vocabulary continuous speech recognition (ASR) has the advantage of real-time automatic media conversion, but at the cost of some loss of accuracy. Using closed captioning to represent the audio component of a television program is a third example that is real-time, but manual. In each case, a dramatic bandwidth reduction is obtained, far beyond that available through standard transcoding techniques. Importantly, media conversion allows content to reach a broader range of devices and usage scenarios. Text-to-speech (TTS) media conversion addresses usage scenarios in which it is not possible or desirable to read text. In this case, the bandwidth of the medium is increased but this can take place at or near the endpoint to maintain low communications cost. ASR and TTS are examples of media conversion from one modality to another but conversion can also take place within the same modality. In video production, a storyboard is used to represent the final video content. A storyboard is low cost, easily editable, and concisely captures the gist of the program. While there is no automatic way to convert a storyboard into full motion video (although it is possible to render scene transitions such as dissolves), we can do the converse – creating a set of still frames to represent video. This will be discussed further in the next section. Again, these media conversions are associated with dramatic changes in bandwidth or storage requirements.

TABLE 1 MEDIA PARAMETERS

Item Video Parameters Audio Parameters Storage per Hour

Desktop Client 300Kb/s 320x240pixels, 30Hz 64Kb/s 16KHz, stereo

Handheld Client 128Kb/s 224x168pixels, 15Hz 16Kb/s 16KHz, mono

163MB

64.8MB

Currently deployed wireless networks do not have sufficient bandwidth to deliver good quality video. In the near term, “intermittent broadband” connectivity should be exploited where possible. This refers to mixed-mode 802.11x capable phones as well as the cradled scenarios where a temporary PPP USB connection is available and local storage on the device is utilized. We will use the following terminology to describe the networking capabilities of mobile devices: 1. Intermittent broadband: All content is stored and accessed locally on the device. Content is downloaded at times when the device is connected to a broadband network or to a PC that has network connectivity. For example, the Smartphone™ synchronizes easily with a

2

IV. CONTENT-BASED SAMPLING

The data used here is from a broadcast television news program. Encoding the half-hour video with a spatial resolution of 160x120 pixels at 160 Kb/s results in a file that is 36 megabytes in size (the audio portion of the file is excluded from this calculation). For the still-frame representation, we use images with the same spatial resolution as the encoded video and use JPEG to compress them. The number of still images that are extracted by the content-based sampling algorithm and the size of individual JPEG-compressed images are data dependent. Therefore, for these estimates, we use one month of data to compute average values. The average number of still frames for the half-hour news program computed in this way is 550 (1100 per hour) and the average JPEG image size is 4 kilobytes (if we move to 320x240 resolution, the image file size increases to around 11 kilobytes.) It results in an average size of 2.2 megabytes for the still frame representation and is thus a data reduction factor of 16. It should be noted that the number of content-based sampled images is highly dependent on the type of programming. For example, for soap opera content, the average value is about 1400 per hour and, for coverage of congressional sessions, it may be as low as 140 per hour. The estimates for the half-hour news program include almost 10 minutes of commercials which in general have very frequent scene changes. Therefore, the use of multimedia processing techniques to identify and eliminate them from the program will result in considerable reductions in the number of images needed to convey the visual information in the news program. Table 2 summarizes the results of processing many hours of various content sources.

The visual information in multimedia data is conveyed either by still images or in video form. While a high-quality motion video is usually more pleasing and may carry more information, in some cases video replay is either not an option, or may be too costly in terms of the storage and/or communications requirements. Moreover, when faced with limitations on these resources, pushing the size and/or the quality of video down to fit within the available resources may produce marginal or unacceptable results. In such cases, it would be necessary or advantageous to derive a different visual presentation of the video information. This can be done by selecting a small subset of video frames to convey the visual information. The individual images must be selected such that they eliminate the high level of redundancy in the video frames, without losing “significant” information. The criteria for determining “significant” information are somewhat subjective. However, our experience indicates that for professionally produced video programs, a combination of scene changes that result from video editing and those resulting from camera operations perform a satisfactory job of dividing the video into segments with similar visual contents. Once this segmentation is obtained, a single frame is selected from each segment to represent the visual contents of that segment. The criteria used for selecting the representative frame for each segment is a function of the application. One can simply use the first stable frame of the segment or use one from the middle of the segment. We use the algorithm discussed in [5] for performing the content-based sampling. This algorithm detects abrupt and gradual transitions in the video sequence. This is combined with the information extracted by motion analysis to detect camera motioninduced changes in the visual contents of the scene. The set of representative frames retained by the algorithm generate a compact representation of the video program. In some cases, it is not necessary to retain a representative frame for each video segment. Image and multimodal processing is used to filter the frame set to reduce storage requirements even further. For example, anchorperson detection can be used to identify all images containing the anchorperson in television news programs and these can be represented by a single image. Image similarity metrics are valuable for locating repeated scenes which are common in television programs. Finally, domain specific rules (e.g., retain only one frame from an action sequence of several scenes) can be employed to reduce the frame set while preserving semantics as much as possible. This process not only enables devices that are only capable of displaying still frames to be used for conveying the information that was originally in video form but also achieves high data reduction factors even when playing the motion video is an option. To give the readers an idea of this data reduction, we present an example based on real data from our large archive of searchable television programs [6].

TABLE 2 MEDIA CONVERSION RESULTS

Genre News, ½ hour News, 2 hour Financial News Soap Opera Congressional Coverage News, Public TV

32 26

Average Seconds per key frame 3.02 4.04

44

4.30

Hours Processed

Average Caption bits/sec

Commercials Included

74.9 73.2

yes yes

67.7

yes

20

2.52

63.8

yes

36

30.05

59.2

no

17

6.90

84.4

no

V. REPURPOSING CONTENT

The images obtained using the algorithm described above are combined with text and/or audio which are extracted or derived from the video program to create repurposed content that is appropriate for the capabilities and connection bandwidth of the intended target mobile devices. For example, a narrowband text-only device such as a Blackberry™ could receive the closed captioning text extracted from a video program. If the device has the

3

There is considerable work in media processing that is done that enables the capabilities in eClips. The eClips service prototype is built on top of a system that provides the data acquisition, processing, storage and delivery capabilities for the service (see [6] and [8] for more details). Once the video is captured, it is segmented and classified so that it can be intelligently searched. On the server, we apply personal profiles and multimodal story segmentation techniques to generate short video clips of interest to the user. We use closed captioned text to search for the appropriate clips to extract and then assemble these clips into a single session. In this way, we use processing on the server to reduce the requirements on the mobile device. eClips provides the users the ability to only see the specific portions of videos that they desire. Therefore, users do not have to undertake the arduous task of manually finding desired video segments and further don’t have to manually select the specified videos one at a time. Rather, eClips generates all of the desired content automatically. Although in [9] we described delivering eClips over broadband communication links, the fact that we clip exactly what the user is interested in, makes it possible to deliver downloadable versions of the content to portable devices efficiently as well. The content can include video clips or can be limited to some combination of audio, still frames, and text if bandwidth/storage does not permit full motion video with audio. Multimedia analysis techniques can be used to determine if stories from different video sources are about the same topic or contain the same video material.

additional capability of displaying color images, the captions can be overlaid on an appropriately selected set of still frames. Moving further along the device capabilities axis, the addition of good quality audio playback yields a surprisingly compelling user experience [7]. There are a large number of combinations of device capabilities, and in some cases the presence of capability is debatable. For example, what is the minimum spatial resolution and color depth required for displaying a key frame? A selected set of classes of capabilities is shown in Table 3. For the bandwidth and storage sections, we have included Figures for both commercial (C) (mean 3.5 seconds per key frame) and non-commercial (NC) content (7 seconds per key frame) respectively. We assume 160x120 spatial resolution, 16Kb/s audio and 128Kb/s video. The Figures for the text modality refer to uncompressed ASCII data; these could be reduced by 60% using Lempel-Ziv compression. TABLE 3 MEDIA REPURPOSING

Media Text Text & Images Text, Images & Audio Text, Images, Audio & Video

Bandwidth Kb/s C NC 0.069 0.071 9.211 4.64

Storage MB/h C NC 0.031 0.032 4.14 2.0

Blackberry Tablet PC

25.2

20.6

11.34

9.27

Picture Phone

144

144

64.8

64.8

Smartphone

Example Device

The requirements of conserving memory on the device in the first case, and minimizing download time over narrowband connections in the latter are similar and suggest the use of filtering methods mentioned above for selecting a subset of key frames. However, it is possible to optimize the filtering parameters for each case, e.g., more images may be included when rendering content for PDAs with large memories. Further, it is possible for the device to request that the server render views to fully utilize available memory. This concept may be most useful for devices capable of video or audio playback since the available memory scales directly with the number of minutes of playback time.

A. Media Processing There are many systems for searching video archives to retrieve content of interest [10]. However, for the applications that we are considering here, it is also necessary to segment the video on topic or story boundaries to obtain manageable sized content units (“clips”). We will briefly summarize a multimodal processing approach to this problem (see [1] for more details). Video segmentation is the process of partitioning the video program into segments (scenes) with similar visual content. As described above, content-based sampling is used to segment the video into individual shots. This information is later combined with information extracted from the other components of the video program to enable the extraction of individual stories [11]. The presence of an anchorperson is a valuable cue to the location of the topic or story boundary. We use a multimodal approach to anchorperson detection, one of the components of which is face detection [12]. With knowledge of which key frames contain anchorpersons, we can render user interfaces without redundancy by selecting only a single image of the anchorperson. Speaker segmentation [13] is an important part of the multimodal story segmentation. In several genres, including broadcast news, speaker boundaries provide landmarks for

VI. CUSTOMIZED CONTENT

Another method for efficiently using the available resources of a mobile device is customized content. When devices are restricted in their presentation, interface and bandwidth it becomes more important to ensure that the content that reaches those devices is as targeted as possible to the needs and desires of the user. In [1], we describe eClips, a personalized multimedia delivery service. eClips uses a personal profile for each user to search multiple content feeds for relevant video clips. The details of the process are described in the paper but the main point of this work is that the multimedia processing of audio, video and text can be used as input to automate the segmentation of some video content to individual story clips.

4

clip aired and the duration of the clip. Also shown are a representative frame and a segment of the closed caption text to provide context. These help the user decide whether or not to initiate streaming or playback of the clip (shown at the right of Figure 4.) We can define a simple interface for customized content because we assume that the user is interested in most of it. In order to assure that interest, we provide for a simple “thumbs up” or “thumbs down” on individual clips so that we can refine the search parameters using relevance feedback [18]. Thus customized content helps us in the user interface, bandwidth, storage, and power consumption (as display time); all the areas where mobile devices are currently challenged.

detecting the content boundaries so it is important to identify speaker segments during automatic content-based indexing [14]. Speaker segmentation is finding the speaker boundaries within an audio stream. Large vocabulary speech recognition has been used for indexing spoken documents [15] and the 1-best transcription can be used for topic segmentation using natural language processing (NLP). In some cases, the accuracy of the 1-best ASR output is good enough to be suitable for presentation to users to provide context [16]. While the accuracy of the automatically generated transcripts is below that of closed captions, they provide a reasonable alternative for identifying clips of interest with reduced, but acceptable, accuracy. After recognition, a parallel text alignment is performed to align the timing information from the automatic speech transcription with the more accurate transcription of the closed captions. Alternatively, a parallel text alignment algorithm is used to import high quality off-line transcripts of the program when they become available [17]. B. User Interface For Customized Content Although the default user interface on a desktop client allows the user to simply view the entire selection of video as one program with no interaction necessary, the user has the option of full interactivity and can easily customize which content is desired and how the content is played. The profile contains a list of keywords and video sources which are used to extract the desired video clips. The profile is primarily configured using a desktop client thus reducing the complexity of the interaction from a mobile device which has a limited screen size and limited input capabilities. The user can easily jump across topics and segments using an intuitive interface to navigate the individual multimedia segments using a touch screen with stylus on PDAs or the keypad highlight-and-click paradigm on phones. In addition, the user can dynamically enter search terms at any time without using a profile in order to find the desired video content. Examples of the mobile device interface are shown in the Figures. The device in Figure 1 is a Motorola MPX200 Smartphone which an example of the intermittent broadband case, where the content is downloaded when the device is cradled. In Figure 2 is an iPAQ 3600. The content can also be streamed to the iPAQ device using 802.11b for transport using an expansion sleeve. We attempted to maintain a consistent user interface for the intermittent bandwidth and broadband networking cases. However, since dynamic search of the media archive is not possible for downloaded content, we suppress the searching user interface elements in this case. Figure 3 shows the top-level topic selection page on the left and an example of the display for a selected topic on the right for a Smartphone client. On the left of Figure 4, the metadata for a particular clip are shown indicating the program from which the clip was extracted, the time that the

Figure 1. Smartphone client interface.

Figure 2. PDA client interface.

5

application of such techniques to create content for mobile devices.

REFERENCES [1]

[2] Figure 3. Topic selection (left) and selected topic (right). [3]

[4]

[5]

[6]

[7]

[8] Figure 4. Clip metadata details (left) and retrieved video (right). [9]

VII. CONCLUSIONS

[10]

The use of multimedia processing techniques to enable and enhance the delivery of video content on mobile devices was discussed. In some cases, the processing is intended to convert the media to enable its use on devices that are incapable of replaying video. However, the quantitative results based on real data presented here are indicative of major reductions in bandwidth and storage requirements that would benefit even video-capable devices. Such alternative presentations of video content would also help accommodate users with disabilities. In other cases, multimodal processing algorithms are applied to automatically segment the video program into story clips. Such clips can then be selectively assembled to create personalized presentations that minimize the levels of interaction with the device, and the need to transfer or store unwanted content. The combination of these repurposing and clip segmentation techniques, coupled with mechanisms for detecting the capabilities of the device and the user preferences can be used to create useful content delivery services that dynamically adjust to the users and devices that are in use. The quality of such automatically repurposed content may not rival that of content which have been originally authored for a specific device. Nevertheless, the automatic nature of these algorithms and the existence of large amount of video programs is a strong incentive for the

[11]

[12]

[13]

[14] [15]

[16]

[17]

[18]

6

D. Gibbon, L. Begeja, Z. Liu, B. Renger, and B. Shahraray, “Creating personalized video presentations using multimodal processing,” Handbook of Video Databases: Design and Applications, Edited by Borko Furht and Oge Marques, New York: CRC Press, 2004, pp. 1107-1131. Y. Chen et al., “Personalized multimedia services using a mobile service platform,” Proc. of the IEEE Wireless Communications and Networking Conference, 2002. J. Arreymbi and M. Dastbaz, “Issues in delivering multimedia content to mobile devices,” Proc. of the Sixth International Conference on Information Visualisation, pp. 622-626, July 2002. S. John, R. Jana, V. Vaishampayan, and A. Reibman, “iVideo – A video proxy for the mobile internet”, Proc. of the IEEE 11th International Packet Video Workshop, 2001. B. Shahraray, “Scene change detection and content-based sampling of video sequences,” Digital Video Compression: Algorithms and Technologies 1995, Edited by Robert J. Sarfanek and Arturo A. Rodriguez, Proc. SPIE 2419, February 1995. R. Cox, B. Haskell, Y. LeCun, B. Shahraray, and L. Rabiner, ``On the application of multimedia processing to telecommunications,'' Proc. of the IEEE, vol. 86, no. 5, pp. 755-824, May 1998. B. Shahraray and D. Gibbon, “Pictorial transcripts: multimedia processing applied to digital library creation,” IEEE First Workshop on Multimedia Signal Processing, Princeton, NJ, pp. 581-586, 1997. B. Shahraray, “Multimedia information retrieval using pictorial transcripts,” Handbook of Multimedia Computing, Edited by Borko Furht, CRC Press, 1998. L. Begeja et al., “eClips: A new personalized multimedia delivery service,” Journal of IBTE, vol. 2 part 2, April 2001. A. Hauptmann et al., “Video retrieval with the informedia digital video library system,” NIST Special Publication 500-250: The Tenth Text Retrieval Conference, p. 94, 2001. Q. Huang, Z. Liu, A. Rosenberg, D. Gibbon, and B. Shahraray, “Automated generation of news content hierarchy by integrating audio, video, and text information,” Proc. IEEE International Conference On Acoustics, Speech, and Signal Processing ICASSP’99, pp. 3025-3028, May 1999. R. Chellappa, C. Wilson, and S. Sirohey, “Human and machine recognition of faces: a survey,” Proc. of the IEEE, vol. 83, no. 5, pp. 705-741, May 1995. M. Siegler, U. Jain, B. Raj, and R. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” Proceedings of the DARPA Speech Recognition Workshop, Chantilly, VA, February 1997, pp. 97-99. Z. Liu and Q. Huang, “Content-based indexing and retrieval-byexample in audio”, ICME-2000, July 2000. L. Begeja et al., “A system for searching and browsing spoken communications,” HLT/NAACL Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, Boston, MA, May 6, 2004, in press. L. Stark, S. Whittaker, and J. Hirschberg, “ASR satisficing: the effects of ASR accuracy on speech retrieval”, Proc. of the International Conference on Spoken Language Processing, 2000. D. Gibbon, “Generating Hypermedia Documents from Transcriptions of Television programs using parallel text alignment”, Handbook of Internet and Multimedia Systems and Applications, Edited by Borko Furht, New York: CRC Press, 1998, pp. 201-215. H. Drucker, B. Shahrahray, and D. Gibbon, “Support vector machines: relevance feedback and information retrieval,” Information Processing and Management, Elsevier Science Ltd., May 2001.