The MPEG Musical Slide Show Application Format: Enriching the MP3 ...

2 downloads 0 Views 2MB Size Report
Jun 15, 2011 - specifies the synchronization of MP3 audio, Joint Picture Experts Group. (JPEG) images and 3rd Generation. Partnership Project (3GPP) timed ...
[standards in a NUTSHELL]

Houari Sabirin and Munchurl Kim

The MPEG Musical Slide Show Application Format: Enriching the MP3 Experience

T

he ISO/IEC 23000-4 MPEG-A Part 4: The Musical Slide Show Application Format (MSS AF) standard is a storage format that specifies the synchronization of MP3 audio, Joint Picture Experts Group (JPEG) images and 3rd Generation Partnership Project (3GPP) timed text data in conjunction with the descriptions for image rendering animation using MPEG-4 Lightweight Application for Scene Representation (LASeR) scene representation. The creation information for MP3 audio and JPEG images, and the color and texture descriptions for JPEG images can also be generated for richer content description using MPEG-7 metadata. By specifying such a storage format, the standard enables different multimedia contents (in this case audio, images, texts and metadata) to be integrated into a single file in a structured way. In addition to this, the standard specifies the protection and governance schemes that enable flexible rights management of the contents seamlessly. Therefore, new business models can be created with the governed contents in secured and controlled manners according to their target applications. This article provides an overview of the technologies used in the standard as well as application examples to show its benefits and advantages. BACKGROUND The Moving Picture Experts Group (MPEG) has created a standard that combines the specifications from already matured technologies inside and outside Digital Object Identifier 10.1109/MSP.2011.941137 Date of publication: 15 June 2011

1053-5888/11/$26.00©2011IEEE

MPEG and is designated as ISO/IEC 23000 MPEG-A Multimedia Application Format (MAF). The purpose of the MAF standard is to support the interoperability among different devices and services that use combinations of various multimedia components such as audio, video, images, metadata, and text. By doing so, applications that require multiple components to be packaged into unified files can be seamlessly developed and utilized for various devices and services. The MAF standard provides the mechanisms to arrange different multimedia components into single files and to govern how the components can be consumed over various devices. The MAF standard has been innovatively designed to support various kinds of applications that require harmonization among different multimedia contents such as music collections, photo album, broadcasting services, multimedia archival systems, and rich video applications. MOTIVATION The current MP3 audio format enables users to listen their favorite tunes anytime and anywhere, thanks to the widely used MP3 players. Some MP3 players also support displaying JPEG images embedded as part of ID3 metadata in the MP3 files. However, currently, one MP3 file with ID3 metadata can only support one single image. On the one hand, if more images could be contained in an MP3 file, then this MP3 file could be enriched with much more informative image contents. For example, a storytelling application may require one single track of audio to which several images can be synchronized to provide the illustration of the story told in the MP3 audio.

IEEE SIGNAL PROCESSING MAGAZINE [136] JULY 2011

Furthermore, karaoke applications can easily be created in a unified and standardized file structure that allows rendering the lyrics of the songs in attractive slide show with animated image rendering. On the other hand, the ease of exchanging and sharing of multimedia contents has caused illegal content distribution. Therefore, a standardized protection mechanism is needed to manage the contents in a controlled manner at the time of their consumption by taking into account the interoperability of the content protection mechanism. OBJECTIVE The MSS AF standard is Part 4 of MAF, which is specifically aimed at enriching existing MP3 applications with images, text, metadata, and rendering of images and texts. The MSS AF provides a mechanism to store audio, images, texts, and animation descriptions into a single file for a richer audio consumption experience. MSS AF specifies a simple standardized file structure to store not only MP3 audio with one or more JPEG images, but also timed texts that can be used to show the lyrics of a song or captions to the corresponding audio data. The storage format specification of the MSS AF is intended for lightweight applications such as mobile or portable devices such as MP3 players and portable multimedia players (PMPs). The storage format of the MSS AF enables such devices to play only the components that they are capable of handling. The MSS AF also supports flexible protection schemes based on MPEG intellectual property management and protection (IPMP) technologies to ensure governed consumption of the contents.

TARGET APPLICATIONS Target applications of the MSS AF standard include a karaoke machine with audio stream of voiceless song accompanied with synchronized lyrics and images, a storytelling that provides the audio of a storyteller with the illustration of the story in image slide show, and the caption text rendering, a foreign language exercise that shows

conversation with the dialogue scripts and pronunciation with illustrative images for voice production. The protection schemes in the MSS AF make the controlled distribution and consumption of the MSS AF contents possible. For example, a karaoke or storytelling application can have a service that offers different ranges of purchasing plans (for example, basic,

(a)

(b)

(c) [FIG1] Parts (a)–(c) show examples of MSS AF applications in a mobile device.

IEEE SIGNAL PROCESSING MAGAZINE [137] JULY 2011

normal, and advanced plans) for MSS AF files, depending on what contents can be consumed. The basic plan may contain only the song. The normal plan may contain the audio with its lyrics that can be rendered in synchronization to the audio by MSS AF players. The advanced plan may contain the full content components of the MSS AF with the images that can be rendered in a slide

[standards in a NUTSHELL]

continued

show animation in addition to the features of the normal plan. A content provider may generate the same MSS AF file (or content) for different plans, but the contents are protected such that the access to the contents can only be possible according to the buying plans. For example, if a basic plan is purchased for an MSS AF file that contains the full content components of the MSS AF, access is allowed only to the song whereas the consumption of other components is not authorized. However, all the components of the MSS AF file can be accessible by the trusted MSS AF players that have the advanced plan purchased by the users. Figure 1 shows screenshots of some examples of MSS AF applications running in a mobile device. Figure 1(a) shows a personal musical slide show with a song, images, and lyrics with license information below the third image. Figure 1(b) shows a storytelling application with audio and illustration images of the story as well as the story’s subtitle. Figure 1(c) illustrates a slide show application with image transition animation (sliding, zooming, and fading) using MPEG-4 LASeR. ISSUING BODY, SCHEDULE, AND STRUCTURE OF THE STANDARD The MSS AF was developed as Part 4 of ISO/IEC 23000 MPEG-A MAF. The first committee draft was published in July 2006 and the final draft of international standard (FDIS) was completed in November 2007, followed by its amendment for reference software and conformance files in April 2009. To accommodate content protection and governance, the second edition of the

MSS AF (initially designated as “Protected Musical Slide Show AF”) was initiated in April 2007 and completed as international standard in January 2009, followed by the amendment for reference software and conformance files in September 2009. The MSS AF standard comprises the overview of the other standards used in MSS AF, the file structure of MSS AF, the synchronization of multimedia contents, and the utilization of animation, timed text, and metadata. The MSS AF standard also provides a protection and governance mechanism for audio, images, timed texts, and animation in a standardized file structure of MSS AF. The information about authoring and player software is available in Amendment 2 of the MSS AF standard for quick reference to the development of the real-field applications. TECHNOLOGY THE COMPONENTS OF MSS AF The MSS AF file structure consists of one stream (or track) of audio, one or more images for slide show with possible animation for image transition, and timed texts for lyrics, subtitles, or captions. XML metadata instantiations are used for content descriptions for images and audio as well as for protection and license descriptions. The specifications of the components used in the MSS AF standard are detailed in MPEG-1/2 Audio Layer III (MP3) for audio, JPEG compression format for images, 3GPP for timed text, MPEG-4 LASeR for scene animation, and MPEG-4 MP4 file format for file structure. It also specifies the usage of some elements

[TABLE 1] COMPONENT STANDARDS OF MSS AF. Types

Component Names

Specifications

Audio Image Timed text Metadata

MPEG-1/2 Layer III (MP3) JPEG 3GPP Timed Text MPEG-4 LASeR MPEG-7 Visual MPEG-7 MDS MPEG-21 DID MPEG-21 IPMP MPEG-21 REL MPEG-4 MP4 File Format

ISO/IEC 11172-3:1993 ISO/IEC 10918-1:1994 3GPP TS 26.245 ISO/IEC 14496-14:2003 ISO/IEC 15938-3:2001 ISO/IEC 15938-5:2003 ISO/IEC 21000-2:2005 ISO/IEC 21000-4:2006 ISO/IEC 21000-5:2004 ISO/IEC 14496-14:2003

File format

IEEE SIGNAL PROCESSING MAGAZINE [138] JULY 2011

with constraints from MPEG-7 visual descriptor metadata to describe content descriptions such as colors and texture of the objects in the JPEG images, as well as MPEG-7 multimedia description scheme (MDS) metadata to describe content descriptions such as a song title, a performing artist’s name, and album name for the MP3 audio. The MPEG-7 MDS metadata is also used for content creation descriptions. The MSS AF standard specifies the synchronization among audio, images, texts, and animation rendering during the playback based on the MPEG-4 MP4 File Format. The metadata for protection and governance of the MSS AF contents are specified based on MPEG-21 IPMP and MPEG-21 Rights Expression Language (REL) description tools. The MPEG-21 Digital Item Declaration (DID) de scription tool is used to encapsulate the description of contents as digital items when protection and governance information is used. Table 1 summarizes all the specifications of the component standards used in the MSS AF specification. ARCHITECTURE Conformant MSS AF files can be created according to the rules and definitions of the file storage structure specified by the MPEG-4 MP4 file format. The descriptions of synchronization of audio, images and texts are defined in the presentation metadata and the animation description is generated as XML instantiation. The MPEG-7 MDS descriptions for MP3 song information are also generated as XML instantiation by mapping the ID3 metadata into appropriate MPEG-7 MDS elements. If one or more components in an MSS AF file need to be protected, an XML instantiation is also generated as the protection and governance description. Accordingly, the contents to be protected are then encrypted using an encryption tool, which is not standardized by MSS AF. Content providers can in fact use any protection tool suitable for their applications. Playback of an MSS AF file starts by parsing the list of content components

Metadata Display

Enhanced Mode

Animation Renderer

Text Display

Image Display

Enhanced

Timed Text Parser

JPEG Decoder

MP3 Decoder

LASeR Parser

Basic

Basic Mode

Audio Player

User Interface MSS AF Player

Synchronizer

MPEG-21 IPMP/REL

LASeR XML

MPEG-21 DID

MPEG-7

Timed Text

JPEG Images

MP3 Audio

File Format Extractor

MSS AF File

Generate MPEG-7 Visual

Generate LASeR XML Instantiation

Create MPEG-21 DID Structure Create Presentation Metadata (Tracks)

Create MP4 File Structure

Encrypt Protection-Signaled Contents

Extract ID3 to MPEG-7

LASeR Animation

3GPP Timed Text

JPEG Images

Assign Protection and License Information MP3 Audio

FILE STRUCTURE The structure of storing content components into an MSS AF file is based on the MP4 file format where all data are contained in the boxes, each of which is assigned a name type and is defined in a specific syntax according to its functionality. Figure 3 shows an MSS AF file that illustrates the structures of boxes inside the file with three main boxes: file type box, movie presentation box, and media data box. The file type box provides the identification of a file, implies what contents are stored within the file, and what mechanism is used to extract and playback the contents. All the samples of the MSS AF audio, images, and timed text contents are physically stored in the media data box and their presentation metadata (e.g., time, duration) are stored in the movie presentation box. In the movie presentation box, the presentation metadata of one set of samples is described in a track box within which each sample has its own specific presentation information of time, duration, and location. The MSS AF specification requires the MP3 audio data to be stored as the set of MPEG-4 access units for which each access unit has its own presentation metadata. A similar storage mechanism also applies to the JPEG images. Since

Rendering Mode Selector

Decrypt Protection-Signaled Contents

MSS AF Authoring Tool

stored in the file. If no protection information exists, the list of contents is obtained directly from the item location information described in the file. Otherwise, the MPEG-21 DID in stantiation shall be parsed. Next, each of the audio, images, and timed text streams are decoded (after unprotection, if protection information exists). Based on their presentation metadata, the content components are then played in a synchronized manner. If the MPEG-4 LASeR description is not present, the “basic” playback mode is performed without rendering any animation for image transition. Otherwise, the “enhanced” playback mode is performed for image transition animation. Figure 2 shows the system architecture for creating and playing MSS AF files.

[FIG2] System architecture for creating and playing an MSS AF file.

IEEE SIGNAL PROCESSING MAGAZINE [139] JULY 2011

continued

Movie Presentation

Media Data

Metadata

Track 1: MP3onMP4 Media Information

Item Location Item 1: MP3onMP4

Sample Descriptions Metadata XML Instantiation

Item 2: JPEG

MP3onMP4 Stream

MPEG-7

Item 3: JPEG . . .

Track 2: Slide Show

Item N – 1: JPEG

JPEG Data

...

Metadata XML Instantiation

XML Instantiation

JPEG Data

Sample Descriptions

Item N: TimedText

JPEG Data

Media Information JPEG Data

[standards in a NUTSHELL]

MPEG-7

MPEG-21DIDL MPEG-21IPMP MPEG-21REL MPEG-4LASeR

Track 3: Timed Text

File Type

Media Information

Timed Text Stream

Sample Descriptions

[FIG3] Illustration of an MSS AF file structure.

each JPEG image will be displayed at a specific time for a certain duration, presentation information is also assigned to each image that is designated as one sample. The storage of timed texts follows the 3GPP timed text specification that is basically the same as that of the MP4 file format. However, in addition to presentation time and duration information, the presentation metadata of timed text contains the description of how the text shall be rendered in the forms of font styles, colors, sizes, positions, and highlight, thus enriching the usage of the text data. For example, in karaoke applications, certain words in the text can be highlighted only at the time when the words are sung in the song. The MPEG-4 LASeR scene description metadata for image transition animation is stored inside an Xml instantiation box designated to store XML instantiation. The MPEG-7 MDS metadata for MP3 audio, and MPEG-7 Visual and MDS metadata for JPEG images are stored in an Xml instantiation box in their respective tracks.

The item location information in the movie presentation box provides access points to the content components stored in MSS AF files. Any MSS AF-compliant player shall first parse this information to know what and where the content components are stored in the files. It describes the exact position (offset) of the first access unit of each sample, so the content components can be extracted and thus played according to the presentation metadata. If the content components are protected, the Xml instantiation box may contain the XML instantiation of MPEG21 DID structure. The MPEG-21 DID provides the structure of the components as digital items and it acts as a container for the protection and governance descriptions by MPEG-21 IPMP and MPEG-21 REL. The access points to the content components start from the MPEG-21 DID instantiation if present. Also, the MPEG-4 LASeR description is encapsulated within the MPEG-21 DID instantiation. The synchronization of image slide show and timed text to audio playback is

IEEE SIGNAL PROCESSING MAGAZINE [140] JULY 2011

determined by defining the time and duration of the images and timed text. During the audio playback, the images and timed text will be rendered based on the presentation information. Figure 4 shows an example of synchronization of audio playback with image and text rendering. The appropriate synchronization between the media is fully under the ability of an MSS AF file creator. The MSS AF does not provide any mechanism to automatically determine when and what images and texts should be rendered at certain time instances of audio playback. PROTECTION AND GOVERNANCE The concept of protection and governance using MPEG-21 IPMP and REL is to signal the protection information to the specified content components and what protection algorithm has been used. Any specific algorithm to perform the protection (e.g., encryption) is not standardized. Therefore, the interoperability of the protected contents for distribution can be achieved at metadata level. The MSS

Animation

Animation

Animation

Animation

... JPEG Sample 1 0s

JPEG Sample 2 10 s

JPEG Sample 3 17 s

JPEG Sample 4 22 s

Audio Timestamp 0s

12 s

16 s

Timed Text Sample 1

Timed Text Timed Text Sample 3 Sample 2

20 s ... Timed Text Sample 4

[FIG4] Synchronization of audio, animated image transitions, and timed text in MSS AF.

AF uses this concept to determine various possible protection schemes for the contents. An MSS AF file creator can flexibly determine which content components are supposed to be pro tected. The MSS AF specification also allows protecting certain duration of the audio and/or certain regions of the images that enables a flexible mechanism to manage which parts of audio or which regions of images can be consumed by authorized users. Those protection schemes can be simply structured by describing the content as a digital item in MPEG-21 DID and can describe the protection information about the protected contents according to the MPEG-21 IPMP description tools. The information of the tools that are used for content protection and the licenses granted to the contents are also described in the MPEG-21 DID instantiation. The protection tools can be located outside the MSS AF files or even outside the MSS AF-capable players. By doing so, the MSS AF contents with protected content components can still be maintained in a simple file structure. THE FUTURE OF MSS AF With the advance of handheld devices such as smartphones, netbooks, or tablets, the MSS AF can provide a file format specification for musical slide show applications using high-quality MP3 audio and high-resolution JPEG images for the slide show. Moreover,

with the advance of Internet in frastructure that enables multimedia contents to be downloaded almost anywhere with remarkable speed, the file format of MSS AF can be greatly simplified without the content components (audio, images, and texts) included. The MSS AF file can only contain the metadata while the actual contents can be accessed online so users can be controlled, to download and playback appropriate content based on the protection and gov ernance schemes. RESOURCES STANDARDS ■ Information technology— Multimedia application format (MPEG-A)—Part 4: Musical slide show application format, ISO/IEC 23000-4:2009 [Online]. Available: http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail. htm?csnumber=50745. ■ Information technology— Multimedia application format (MPEG-A)—Part 4: Musical slide show application format AMENDMENT 1: Conformance and reference software for musical slide show application format, ISO/IEC 23000-4:2009/Amd.1: 2009 [Online]. Available: http://www.iso.org/iso/iso_ catalogue/catalogue_tc/catalogue_ detail.htm?csnumber=51815.

IEEE SIGNAL PROCESSING MAGAZINE [141] JULY 2011

■ Information technology— Multimedia application format (MPEG-A)—Part 4: Musical slide show application format AMENDMENT 2: Conformance and reference software for protected musical slide show application format, ISO/IEC 23000-4:2009/ Amd.2: 2009 [Online]. Available: http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail. htm?csnumber=52362.

AUTHORS Houari Sabirin ([email protected]) is a doctorate candidate at Department of Information and Communications Engineering, Korea Advanced Institute of Science and Technology in Daejeon. He was the coeditor of the MSS AF standard and several other parts of MPEG-A MAF standards. Munchurl Kim ([email protected]. kr) is an associate professor in the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology in Daejeon. He has been active in MPEG as a coeditor of MPEG-7 standards and several parts of MPEG-A MAF standards. REFERENCE SOFTWARE AND CONFORMANCE FILES

(2011, Apr. 26). KAIST MCCB MPEG Workshop [Online]. Available: http://mpeg.kaist.ac.kr/mccb/? page_id=27

[SP]