High Resolution Multimedia Slide Show ... - Semantic Scholar

➠

➡ HIGH RESOLUTION MULTIMEDIA SLIDE SHOW COMPOSITION FOR VIDEO CD AND DVD RENDERING

Zhaohui Sun, Jonathan Riek , and Alexander C. Loui

Electric Imaging Products, Research & Development Eastman Kodak Company, Rochester, NY 14650-1816 fsun,riek,[email protected] ABSTRACT

Normal

With the advance of digital and information technology, there is an urgent need to make image taking and sharing easy, fun, and cost and time eÆcient. In this paper, we present a multimedia system to compose visual and audio information into high resolution multimedia slide show, a Video Compact Disc (VCD) compliant bitstream readable by electronic devices such as Video CD/DVD players, for TV-centered multimedia rendering, sharing and entertainment. By using the technologies of multimedia composition, multimedia coding, and the sophisticated bit allocation scheme, we can achieve the same spatial resolution of DVD (without full motion). Special eorts are devoted to the multimedia experience, image quality improvement, and automation in the system. 1.

INTRODUCTION

With the advance of digital and information technology, increasing sales of digital cameras, deeper penetration of VCD/DVD players, and an exponentially increasing number of available digital images and video clips, there is an urgent need to make image taking and sharing easy, fun, as well as cost and time eÆcient. The biggest challenge is that it still requires expertise, expensive devices (computers, printers, CD Writers, etc.), fast changing software, and a tremendous amount of time to make everything work. In recent years, there has been increasing interest in TV-centered (rather than computer-centered) digital image sharing and entertainment systems, as most users do not have all the resources and time to do everything by themselves on computer. In a TV-centered system, pictures are stored on a CD/DVD disc, played back on a VCD/DVD player, and displayed on a TV. Picture sharing and enjoyment is as easy as watching J.

Riek is now with VirtualScopics, Pittsford, NY 14534.

0-7803-7965-9/03/$17.00 ©2003 IEEE

High Res Still

image

Image Enhancement And Composition

High Padding Pack Normal

Silent slide show

High Res Still

image High

Padding Pack Timing parameter

music speech

Fig. 1

Audio Composition

Audio Coding

Multimedia slide show

Audio bitstream

. System diagram of multimedia composition.

a movie on TV. The only dierence is that it is a lowbudget movie featuring family, friends, and personal experience rather than a Hollywood title. The same content can also be played back on computers. In this paper, we present a multimedia system to compose visual and audio information into a high resolution multimedia slide show, a Video Compact Disc (VCD) compliant bitstream readable by electronic devices such as VCD/DVD players, for TV-centered multimedia rendering, sharing and entertainment. Using the technologies of multimedia composition and coding, and the sophisticated bit allocation scheme, we can achieve the same spatial resolution of DVD (without full motion) by the use of cheaper media (compact disc vs DVD), device (CD writer vs DVD writer) and technology (MPEG-1 vs MPEG-2). Following the overview of the multimedia system in Section 3, the visual/audio encoding and high resolution multimedia slide show composition are presented in Sections 3, 4, and 5, respectively. The paper is concluded in Section 6 with results and discussion. 2.

SYSTEM

The goal of the multimedia system is to compose visual and audio information into high resolution multi-

II - 633

ICME 2003

➡

➡ media slide show, which can be played back on TVcentered multimedia sharing and entertainment system, and make the experience easy, fun, cost, and time eÆcient. Fig. 1 shows the system block diagram. The input is on the left side, including visual information of images fIi gN i=1 and audio information of music clips and speech annotation. The output on the right is a bitstream consisting of a high resolution still image, a silent slide show displaying a number of still images sequentially, or a multimedia slide show with audio. In Fig. 1, still images are read in, reoriented if necessary, enhanced to improve rendering quality, composed with other texture and text annotation, and downsampled to intra pictures with normal and high resolutions, which are coded as sectors and interleaved as a single mixed resolution bitstream. The mixed resolution still pictures are assembled together as a silent high resolution slide show subject to buer management and time scheduling. Meanwhile, the background music and the speech annotation are automatically composed together and coded as a single normalized audio stream. Timing parameters are calculated to synchronize the A/V streams, which are further multiplexed as a single multimedia slide show. The normal spatial resolution (352 240 for NTSC or 352 288 for PAL) of VCD is limited by the low bit rate (no more than 1.39 Mbits/second) to accommodate single-speed CD-ROM drives and the capacity of a compact disc. When images are downsampled to normal resolution and displayed sequentially as a slide show on TV, the distortion, such as blocking artifacts, is pronounced. One solution to improve the spatial resolution is provided by the Video CD speci cation [1] as High Resolution Still Picture (704 480 for NTSC or 704 576 for PAL). The high resolution slide show is technically possible because it encodes I-frames only, without B-frames and P-frames as required by a fullmotion video sequence. The bit rate used to encode temporal moving objects is essentially traded to encode a single I-frame at much higher spatial resolution. By using this technology, we can encode and render still pictures sequentially at a resolution four times higher than the normal resolution full motion VCD, without upgrading to the more expensive DVD technology. 3.

image quality, such as color balance, noise reduction, edge sharpening, red eye detection and correction, tone scale adjustment, etc. One example is image sharpening by unsharp masking, subtracting a blurred version of the image from the original. Another example is noise reduction by the enhanced black printing algorithm, i.e., adaptively ltering the image based on the noise factor indicated by the variance of high-frequency channels. Image enhancement not only explores the visual content, but also gives the images a uniform look. Other textures can be overlaid to form composite images, such as background texture, frame decoration, company logo, and text annotation (e.g., title and subject). The image is resized twice to a normal resolution image and a high resolution image with the same content, which are encoded separately later. 3.2.

Picture coding follows MPEG I-frame coding scheme with emphasis on data compression [2, 3, 4]. The color space is transformed to YUV to decouple the channel correlation. The chrominance channels (U and V components) are subsampled by a factor of 2 to explore visual redundancy. Each channel is divided into 8 8 blocks and coded by 2-D Discrete Cosine Transform (DCT) to remove spatial correlation. Symbol redundancy is further removed by quantization and Human coding. The sequence of 0s and 1s are assembled as video bitstream with a hierarchical structure of syntax, including sequence header, group of pictures (GOP) header, picture header, slice header, and macroblock header [2]. The bitstream is packed into xed-length sectors (2324 bytes) under bit rate constraints and buer management. This is analogous to putting nal products into labeled boxes on an assembly line. After delivery, decoders can retrieve these boxes/sectors, open them by following the instructions on the label, and interpret the bits inside accordingly. The normal and high resolution sectors are interleaved as a single mixed resolution sequence, with a sequence header, a GOP header, one intra picture, and a sequence end code. The intra picture contains a normal resolution picture followed by a high resolution picture with the same image content.

VISUAL COMPOSITION AND CODING 3.3.

3.1.

High Resolution Still Picture Coding

Image Enhancement and Composition

Digital pictures stored in various devices and formats are read rst in through the image i/o library. They are reoriented if necessary, which can be done by user interaction or automated schemes. A variety of image enhancement algorithms are applied to enhance the

Silent High Resolution Slide Show

The video bitstreams of the N still pictures fIi gN i=1 are assembled as a single high resolution slide show without audio based on the timing parameters fti gN i=1 . A high resolution slide show consists of a system header, a number of mixed resolution sequences, and a sequence end code. Each mixed resolution sequence further con-

II - 634

➡

➡ Audio enhancement lters the audio signals to re-

sists of normal resolution packs, high resolution packs, and padding packs. The key issues are synchronization, decoding time stamp, and buer management. Synchronization is achieved by insertion of stuÆng bytes 0xFF to maintain the constant bit rate, such that the picture stays on display long enough before switching to the next one. The number of padding packs npi needed for picture i is

move noise, enhance certain frequencies, balance channel volume, and apply special eects.

Audio combination merges all audio signals as a

single stream by combining the samples at the same time instance with emphasis on the foreground speech annotation.

p = ti 75 (nn + nh ); (1) i i i where nni and nhi are the number of sectors for the

4.2.

n

The problem of time scheduling is to decide the time intervals of images along temporal dimension given the length of speech annotations and music clips subject to certain constraints. An image should show up when the speech annotation starts to play, and it should stay long enough until the narration ends. Meanwhile, the time durations for all images should be distributed as uniformly as possible. Let the time duration, default life time, and minimum life time of an image be ti , td , and tm , respectively. The length of speech annotation is tî , and the length of music is ts . Duration ti is automatically computed to synchronize the A/V bitstreams as follows.

normal and high resolution pictures, respectively. The decoding time stamp is recommended to point at the end of the mixed resolution still picture as DT S

i 1 X

i=(

j=1

j 75 + nni + nhi ) 1200:

t

(2)

Buer management is intended to make sure no over ow and under ow happen at the decoding buer. It is relatively simple for a silent slide show because: (1) no motion estimated and compensated frames (Bframes and P-frames); (2) constant bit rate; and (3) increased buer size for high resolution still pictures. 4.

Time Scheduling

No annotation and no music. i = d . age stays for td seconds.

t

t

Each im-

Annotation only. i

î = 0, ti = tm if t = td if t î < tm , and ti = tî otherwise. An image stays on display until speech annotation is nished.

AUDIO COMPOSITION AND CODING

t

4.1.

Audio Composition

Music only. i = s

Digital audio composition and coding [5] follows a thread almost parallel to image composition and coding. It composes multiple audio streams (speech annotation and background music), in dierent formats and speci cations (sampling rate, sample precision, and number of channels), and from various sources (audio les on a hard disk, sound tracks on a CD-ROM, or speech from a microphone), as a single PCM coded, 16 bits/sample, 44.1 Khz, stereo audio stream, which is compressed as layer II (mp2) audio. The following ve steps are involved:

t t =N . All images stay for the same amount of time while the music is playing.

Annotation and short music, P î

t > ts . It is similar to the annotation only case with zero padded music clip, ti = tm or ti = tî .

Annotation and long music, P î

t < ts and td N < s . It is similar to the music only case with ti = ts =N . t

Annotation and music with some annotations longer than the average time duration, 9 î s .

Audio extraction extracts audio signals from var-

i; t

ious media, e.g., reading les from disks, ripping music from a sound track on CD-ROM, or recording speech annotations from a microphone.

Audio normalization

converts all audio streams to the same normalized speci cation of sample precision, sampling rate, and number of channels.

Audio alignment registers multiple audio streams on the same time line and automatically decides the timing parameters.

> t =N

We use a recursive scheme to assign time duration one by one. After ti is assigned, it is subtracted from the total available time and the computation moves on to the next image ti+1 . 5.

MULTIMEDIA SLIDE SHOW

The silent slide show is further multiplexed with the digital audio as a multimedia slide show subject to bit rate constraints and buer management. The system bitstream consists of a system header, interleaved

II - 635

➡

➠ mixed resolution still picture packs, audio packs, padding packs, and a sequence end code. Buer management becomes more complicated, as 3 buers are involved, a 4 KB audio buer, a 46 KB normal resolution video buer and a variable bit rate high resolution buer with a maximum size of 224 KB. Digital audio is maintained at 224 Kbits/second. If the audio buer over ows, a portion of the audio signal will be lost. If the audio buer under ows, the audio will be cut o and delayed, yielding discontinuous audio. To maintain the audio buer, we keep lling the audio buer at 224 Kbits/second. Audio packs are inserted into the silent slide show bitstream. After a couple video sectors, we need to add a few audio sectors, and remove the same number of padding packs to maintain the system bit rate. In addition, the decoding time stamps need to be updated accordingly. 6.

RESULTS AND CONCLUSION

To this end, we have implemented a system to compose a multimedia high resolution slide show, which can be played back on VCD/DVD players and TV, so that people can share and enjoy their family pictures with narration and music in the living room. The system is released as a window-based application and a package of libraries. A snapshot of the user interface is shown in Fig. 2, where creation of a multimedia slide show is only a few clicks away. An application programming interface is also provided for easy integration into other systems in three simple steps: (1) prepare a/v data in data structure MediaCotent; (2) call audio composition; and (3) call multimedia coding. The size of a high resolution multimedia slide show depends on the display time only (about 10 MB per minute) and is due to the constant bit rate. The time complexity of the high resolution encoding is slightly higher than that of the normal resolution encoding. It takes about a second to encode a single high resolution picture on a 600 Mhz PC. Special eorts have been devoted to the multimedia experience, image quality improvement, and automation in the system. (1) By wrapping family pictures, texture templates, narration, music, and text annotation together as a single bitstream, we can turn a photo album into a self-contained multimedia presentation to share stories and memories. (2) Image quality is improved by rendering pictures at high resolution and applying image enhancement algorithms in image composition. Under the bit rate constraint for compact disc, bits and bytes reserved for P and B frames in full motion video are essentially traded to increase the spatial resolution of the slide show. In fact, the res-

. User interface of the system.

Fig. 2

olution is four times higher than the normal resolution, therefore, we can achieve the similar spatial resolution as DVD (without full motion) with the use of compact disc. (3) Automated schemes are employed to minimize user interaction in the personal authoring process and maximize the functionalities. All the processing steps, including image composition and enhancement, audio composition, time scheduling, and multimedia encoding, are fully automatic. More complicated settings are stored in theme template for special occasions, such as birthday and Christmas. In the future, we plan to continue our eort to make image sharing easy and fun.

Acknowledgment We thank Phoury Lei and Phil Masiewicz for technical contributions. 7.

REFERENCES

[1] \Video CD speci cation version 2.0," Philips Consumer Electronics B.V., April 1995. [2] \Coding of moving pictures and associated audio for digital storage media at up to about 1.5 mbits/s," in ISO/IEC 11172, November 1991. [3] Barry G. Haskell, Atul Puri, and Arun N. Netravali, Digital video: an introduction to MPEG-2, Chapman & Hall, 1997. [4] A. M. Tekalp, Hall, 1995.

Digital Video Processing,

Prentice

[5] L. R. Rabiner and R. W. Schafer, Digital processing of speech signals, Prentice-Hall, 1978.

II - 636