Interactive Music Applications and Standards

0 downloads 0 Views 290KB Size Report
Video games and music consumption, once discrete markets, are now ... with a smart phone, recorded and uploaded online to share with others. .... Preset data – pre-defined mixing information on multiple audio tracks (e.g. .... running on a particular device. ... As such it facilitates interchange, management, editing and.
Interactive Music Applications and Standards Rebecca Stewart, Panos Kudumakis, and Mark Sandler Queen Mary, University of London, London, UK {rebecca.stewart,panos.kudumakis,mark.sandler}@eecs.qmul.ac.uk http://www.elec.qmul.ac.uk/digitalmusic

Abstract. Music is now consumed in interactive applications that allow for the user to directly influence the musical performance. These applications are distributed as games for gaming consoles and applications for mobile devices that currently use proprietary file formats, but standardization orgranizations have been working to develop an interchangeable format. This paper surveys the applications and their requirements. It then reviews the current standards that address these requirements focusing on the MPEG Interactive Music Application Format. The paper closes by looking at additional standards that address similar applications and outlining the further requirements that need to be met. Keywords: interactive music, standards.

1

Introduction

The advent of the Internet and the exploding popularity of file sharing web sites have challenged the music industry’s traditional supply model that relied on the physical distribution of music recordings such as vinyl records, cassettes, CDs, etc [5], [3]. In this direction, new interactive music services have emerged [1], [6], [7]. However, a standardized file format is inevitably required to provide the interoperability between various interactive music players and interactive music applications. Video games and music consumption, once discrete markets, are now merging. Games for dedicated gaming consoles such as the Microsoft XBox, Nintendo Wii and Sony Playstation and applications for smart phones using the Apple iPhone and Google Android platforms are incorporating music creation and manipulation into applications which encourage users to purchase music. These games can even be centered around specific performers such as the Beatles [11] or T-Pain [14]. Many of these games follow a format inspired by karaoke. In its simplest case, audio processing for karaoke applications involves removing the lead vocals so that a live singer can perform with the backing tracks. This arrangement grew in complexity by including automatic lyric following as well. Karaoke performance used to be relegated to a setup involving a sound system with microphone and playback capabilities within a dedicated space such as a karaoke bar or living room, but it has found a revitalized market with mobile devices such as smart K. Jensen et al. (Eds.): CMMR 2010, LNCS 6684, pp. 20–30, 2011. c Springer-Verlag Berlin Heidelberg 2011 

Interactive Music Applications and Standards

21

phones. Karaoke is now no longer limited to a certain place or equipment, but can performed with a group of friends with a gaming console in a home or performed with a smart phone, recorded and uploaded online to share with others. A standard format is needed to allow for the same musical content to be produced once and used with multiple applications. We will look at the current commercial applications for interactive music and discuss what requirements need to be met. We will then look at three standards that address these requirements: the MPEG-A Interactive Music Application Format (IM AF), IEEE 1599 and interaction eXtensible Music Format (iXMF). We conclude by discussing what improvements still need to be made for these standards to meet the requirements of currently commercially-available applications.

2

Applications

Karaoke-influenced video games have become popular as titles such as Guitar Hero and Rock Band have brought interactive music to a broad market [11]. The video games are centered around games controllers that emulate musical instruments such as the guitar and drum set. The players follow the music as they would in karaoke, but instead of following lyrics and singing, they follow colored symbols which indicate when to press the corresponding button. With Rock Band karaoke singing is available – backing tracks and lyrics are provided so that a player can sing along. However, real-time pitch-tracking has enhanced the gameplay as the player’s intonation and timing are scored. The company Smule produces applications for the Apple iPhone, iPod Touch and iPad. One of their most popular applications for the platform is called I Am T-Pain [14]. The application allows users to sing into their device and automatically processes their voice with the auto-tune effects that characterize the artist T-Pain’s vocals. The user can do this in a karaoke-style performance by purchasing and downloading files containing the backing music to a selection of T-Pain’s released tracks. The song’s lyrics then appear on the screen synchronized with the music as for karaoke, and the user’s voice is automatically processed with an auto-tune effect. The user can change the auto-tune settings to change the key and mode or use a preset. The freestyle mode allows the user to record their voice without music and with the auto-tuner. All of the user’s performances can be recorded and uploaded online and easily shared on social networks. Smule has built on the karaoke concept with the release of Glee Karaoke [13]. The application is branded by the US TV show Glee and features the music performed on the show. Like the I Am T-Pain application, Glee Karaoke lets users purchase and download music bundled with lyrics so that they can perform the vocal portion of the song themselves. Real-time pitch correction and automatic three-part harmony generation are available to enhance the performance. Users can also upload performances to share online, but unlike I Am T-Pain, Glee Karaoke users can participate in a competitive game. Similar to the Guitar Hero and Rock Band games, users get points for completing songs and for correctly singing on-pitch.

22

3

R. Stewart, P. Kudumakis, and M. Sandler

Requirements

If the music industry continues to produce content for interactive music applications, a standard distribution format is needed. Content then will not need to be individually authored for each application. At the most basic level, a standard needs to allow: – Separate tracks or groups of tracks – Apply signal processing to those tracks or groups – Markup those tracks or stems to include time-based symbolic information Once tracks or groups of tracks are separated from the full mix of the song, additional processing or information can be included to enhance the interactivity with those tracks. 3.1

Symbolic Information

Karaoke-style applications involving singing require lyrical information as a bare minimum, though it is expected that that information is time-aligned with the audio content. As seen in Rock Band, I Am T-Pain and Glee Karaoke, additional information regarding the correct pitch and timing is also needed. A standard for interactive music applications also needs to accommodate multiple parallel sequences of notes. This is especially important for multiple player games like Rock Band where each player has a different instrument and stream of timings and pitches. 3.2

Audio Signal Processing

The most simplistic interactive model of multiple tracks requires basic mixing capabilities so that those tracks can be combined to create a single mix of the song. A traditional karaoke mix could easily be created within this model by muting the vocal track, but this model could also be extended. Including audio effects as in I Am T-Pain and Glee Karaoke allows users to add musical content (such as their singing voice) to the mix and better emulate the original performance. Spatial audio signal processing is also required for more advanced applications. This could be as simple as panning a track between the left and right channels of a stereo song, but could grow in complexity when considering applications for gaming consoles. Many games allow for surround sound playback, usually over a 5.1 loudspeaker setup, so the optimal standard would allow for flexible loudspeaker configurations. Mobile applications could take advantage of headphone playback and use binaural audio to create an immersive 3D space.

4

MPEG-A IM AF

The MPEG-A Interactive Music Application Format (IM AF) standard structures the playback of songs that have multiple, unmixed audio tracks [8], [9], [10].

Interactive Music Applications and Standards

23

IM AF creates a container for the tracks, the associated metadata and symbolic data while also managing how the audio tracks are played. Creating an IM AF file involves formatting different types of media data, especially multiple audio tracks with interactivity data and storing them into an ISO-Base Media File Format. An IM AF file is composed of: Multiple audio tracks representing the music (e.g. instruments and/or voices). Groups of audio tracks – a hierarchical structure of audio tracks (e.g. all guitars of a song can be gathered in the same group). Preset data – pre-defined mixing information on multiple audio tracks (e.g. karaoke and rhythmic version). User mixing data and interactivity rules, information related to user interaction (e.g. track/group selection, volume control). Metadata used to describe a song, music album, artist, etc. Additional media data that can be used to enrich the users interaction space (e.g. timed text synchronized with audio tracks which can represent the lyrics of a song, images related to the song, music album, artist, etc). 4.1

Mixes

The multiple audio tracks are combined to produce a mix. The mix is defined by the playback level of tracks and may be determined by the music content creator or by the end-user. An interactive music player utilizing IM AF could allow users to re-mix music tracks by enabling them to select the number of instruments to be listened to and adjust the volume of individual tracks to their particular taste. Thus, IM AF enables users to publish and exchange this re-mixing data, enabling other users with IM AF players to experience their particular music taste creations. Preset mixes of tracks could also be available. In particular IM AF supports two possible mix modes for interaction and playback: preset-mix mode and user-mix mode. In the preset-mix mode, the user selects one preset among the presets stored in IM AF, and then the audio tracks are mixed using the preset parameters associated with the selected preset. Some preset examples are: General preset – composed of multiple audio tracks by music producer. Karaoke preset – composed of multiple audio tracks except vocal tracks. A cappella preset – composed of vocal and chorus tracks. Figure 1 shows an MPEG-A IM AF player. In user-mix mode, the user selects/deselects the audio tracks/groups and controls the volume of each of them. Thus, in user-mix mode, audio tracks are mixed according to the user’s control and taste; however, they should comply with the interactivity rules stored in the IM AF. User interaction should conform to certain rules defined by the music composers with the aim to fit their artistic creation. However, the rules definition is optional and up to the music composer, they are not imposed by the IM AF format. In general there are two categories of rules in IM AF: selection and

24

R. Stewart, P. Kudumakis, and M. Sandler

Fig. 1. An interactive music application. The player on the left shows the song being played in a preset mix mode and the player on the right shows the user mix mode.



  

  

 

 

 

  !

 

!  "  

Fig. 2. Logic for interactivity rules and mixes within IM AF

mixing rules. The selection rules relate to the selection of the audio tracks and groups at rendering time whereas the mixing rules relate to the audio mixing. Note that the interactivity rules allow the music producer to define the amount of freedom available in IM AF users mixes. The interactivity rules analyser in the player verifies whether the user interaction conforms to music producers rules. Figure 2 depicts in a block diagram the logic for both the preset-mix and the user-mix usage modes. IM AF supports four types of selection rules, as follows: Min/max rule specifying both minimum and maximum number of track/ groups of the group that might be in active state. Exclusion rule specifying that several track/groups of a song will never be in the active state at the same time.

Interactive Music Applications and Standards

25

Not mute rule defining a track/group always in the active state. Implication rule specifying that the activation of a track/group implies the activation of another track/group. IM AF also supports four types of mixing rules, as follows: Limits rule specifying the minimum and maximum limits of the relative volume of each track/group. Equivalence rule specifying an equivalence volume relationship between tracks/groups. Upper rule specifying a superiority volume relationship between tracks/groups. Lower rule specifying an inferiority volume relationship between tracks/groups. Backwards compatibility with legacy non-interactive players is also supported by IM AF. For legacy music players or devices that are not capable of simultaneous decoding the multiple audio tracks, a special audio track stored in IM AF file can still be played. 4.2

File Structure

The file formats accepted within an IM AF file are described in Table 1. IM AF holds files describing images associated with the audio such as an album cover, timed text for lyrics, other metadata allowed in MPEG-7 and the audio content. IM AF also supports a number of brands according to application domain. These depend on the device processing power capabilities (e.g. mobile phone, laptop computer and high fidelity devices) which consequently define the maximum number of audio tracks that can be decoded simultaneously in an IM AF player running on a particular device. IM AF brands are summarized in Table 2. In all IM AF brands, the associated data and metadata are supported. The IM AF file format structure is derived from the ISO-Base Media File Format standard. As such it facilitates interchange, management, editing and presentation of different type media data and their associated metadata in a flexible and extensible way. The object-oriented nature of ISO-Base Media File Table 1. The file formats accepted within an IM AF file Type

Component Name

File Format ISO Base Media File Format (ISO-BMFF)

Specification ISO/IEC 14496-12:2008

Audio

MPEG-4 Audio AAC Profile MPEG-D SAOC MPEG-1 Audio Layer III (MP3) PCM

ISO/IEC 14496-3:2005 ISO/IEC 23003-2:2010 ISO/IEC 11172-3:1993 -

Image

JPEG

ISO/IEC 10918-1:1994

3GPP Timed Text

3GPP TS 26.245:2004

Text

Metadata MPEG-7 MDS

ISO/IEC 15938-5:2003

26

R. Stewart, P. Kudumakis, and M. Sandler

Table 2. Brands supported by IM AF. For im04 and im12, simultaneously decoded audio tracks consist of tracks related to SAOC, which are a downmix signal and SAOC bitstream. The downmix signal should be encoded using AAC or MP3. For all brands, the maximum channel number of each track is restricted to 2 (stereo). Audio Brands

AAC MP3 SAOC PCM

Max No Max Freq. Tracks /bits

im01

X

X

4

im02

X

X

6

im03

X

X

im04

X

X

im11

X

X

im12

X

im21

X

X

8 X

AAC/Level 2

X

AAC/Level 2

16

32

Mobile

SAOC Baseline/2 AAC/Level 2

Normal

AAC/Level 2

2 X

Application

48 kHz/16 bits

2 X

Profile/ Level

SAOC Baseline/3

96 kHz/24 bits

AAC/Level 5

High-end

Format, inherited in IM AF, enables simplicity in the file structure in terms of objects that have their own names, sizes and defined specifications according to their purpose. Figure 3 illustrates the IM AF file format structure. It mainly consists of ftyp, moov and mdat type information objects/boxes. The ftyp box contains information on file type and compatibility. The moov box describes the presentation of the scene and usually includes more than one trak boxes. A trak box contains the presentation description for a specific media type. A media type in each trak box could be audio, image or text. The trak box supports time information for synchronization with media described in other trak boxes. The mdat box contains the media data described in the trak boxes. Instead of a native system file path, a trak box may include an URL to locate the media data. In this way the mdat box maintains a compact representation enabling consequently efficient exchange and sharing of IM AF files. Furthermore, in the moov box some specific information is also included: the group container box grco; the preset container box prco; and the rules container box ruco for storing group, preset and rules information, respectively. The grco box contains zero or more group boxes designated as grup describing the group hierarchy structure of audio tracks and/or groups. The prco box contains one or more prst boxes which describe the predefined mixing information in the absence of user interaction. The ruco box contains zero or more selection rules boxes rusc and/or mixing rules boxes rumx describing the interactivity rules related to selection and/or mixing of audio tracks.

Interactive Music Applications and Standards

27

               +  



 





   

            

      

          

 





'(%)*   



 

     



    



 



    

'(%)* '  ' '  

  !"#$%&

 

     

    



             



 

          

          

 

           

  

   !"#$%&

Fig. 3. IM AF file format

5

Related Formats

While the IM AF packages together the relevant metadata and content that an interactive music application would require, other file formats have also been developed as a means to organize and describe synchronized streams of information for different applications. The two that will be briefly reviewed here are IEEE 1599 [12] and iXMF [4]. 5.1

IEEE 1599

IEEE 1599 is an XML-based format for synchronizing multiple streams of symbolic and non-symbolic data validated against a document type definition (DTD). It was proposed to IEEE Standards in 2001 and was previously referred to as MX (Musical Application Using XML). The standard emphasizes the readability

28

R. Stewart, P. Kudumakis, and M. Sandler

     

    

  

    *      

                

            

 #     

$  

%   

' 

          

 &                

!        

    "

"  "

        !  "

'    "( ")*  "

Fig. 4. The layers in IEEE 1599

of symbols by both humans and machines, hence the decision to represent all information that is not audio or video sample data within XML. The standard is developed primarily for applications that provide additional information surrounding a piece of music. Example applications include being able to easily navigate between a score, multiple recordings of performances of that score and images of the performers in the recordings [2]. The format consists of six layers that communicate with each other, but there can be multiple instances of the same layer type. Figure 4 illustrates how the layers interact. The layers are referred to as: General – holds metadata relevant to entire document. Logic – logical description of score symbols. Structural – description of musical objects and their relationships. Notational – graphical representation of the score. Performance – computer-based descriptions of a musical performance. Audio – digital audio recording. 5.2

iXMF

Another file format that perform a similar task with a particular focus on video games is iXMF (interaction eXtensible Music Format) [4]. The iXMF standard is targeted for interactive audio within games development. XMF is a meta file format that bundles multiple files together and iXMF uses this same meta file format as its structure.

Interactive Music Applications and Standards

29

iXMF uses a structure in which a moment in time can trigger an event. The triggered event can encompass a wide array of activities such as the playing of an audio file or the execution of specific code. The overall structure is described in [4] as: – – – –

An iXMF file is a collection of Cues. A Cue is a collection of Media Chunks and Scripts. A Media Chunk is a contiguous region in a playable media file. A Script is rules describing how a Media Chunk is played.

The format allows for both audio and symbolic information information such as MIDI to be included. The Scripts then allow for real-time adaptive audio effects. iXMF has been developed to create interactive soundtracks for video games environments, so the audio can be generated in real-time based on a user’s actions and other external factors. There are a number of standard Scripts that perform basic tasks such as starting or stopping a Cue, but this set of Scripts can also be extended.

6

Discussion

Current commercial applications built around interactive music require real-time playback and interaction with multiple audio tracks. Additionally, symbolic information, including text, is needed to accommodate the new karaoke-like games such as Guitar Hero. The IM AF standard fulfils most of the requirements, but not all. In particular it lacks the ability to include symbolic information like MIDI note and instrument data. IEEE 1599 and iXMF both can accommodate MIDI data, though lack some of the advantages of IM AF such as direct integration with a number of MPEG formats. One of the strengths of iXMF is its Scripts which can define time-varying audio effects. These kind of effects are needed for applications such as I Am T-Pain and Glee Karaoke. IM AF is beginning to consider integrating these effects such as equalization, but greater flexibility will be needed so that the content creators can create and manipulate their own audio signal processing algorithms. The consumer will also need to be able to manually adjust the audio effects applied to the audio in order to build applications like the MXP4 Studio [7] with IM AF. As interactive music applications may be used in a variety of settings, from dedicated gaming consoles to smart phones, any spatialization of the audio needs to be flexible and automatically adjust to the most appropriate format. This could range from stereo speakers to surround sound systems or binaural audio over headphones. IM AF is beginning to support SAOC (Spatial Audio Object Coding) which addresses this very problem and differentiates it from similar standards. While there are a number of standard file formats that have been developed in parallel to address slightly differing application areas within interactive music, IM AF is increasingly the best choice for karaoke-style games. There are still

30

R. Stewart, P. Kudumakis, and M. Sandler

underdeveloped or missing features, but by determining the best practice put forth in similar standards, IM AF can become an interchangeable file format for creators to distribute their music to multiple applications. The question then remains: will the music industry embrace IM AF – enabling interoperability of interactive music services and applications for the benefit of end users – or will it try to lock them down in proprietary standards for the benefit of few oligopolies? Acknowledgments. This work was supported by UK EPSRC Grants: Platform Grant (EP/E045235/1) and Follow On Fund (EP/H008160/1).

References 1. Audizen, http://www.audizen.com (last viewed, February 2011) 2. Ludovico, L.A.: The new standard IEEE 1599, introduction and examples. J. Multimedia 4(1), 3–8 (2009) 3. Goel, S., Miesing, P., Chandra, U.: The Impact of Illegal Peer-to-Peer File Sharing on the Media Industry. California Management Review 52(3) (Spring 2010) 4. IASIG Interactive XMF Workgroup: Interactive XMF specification: file format specification. Draft 0.9.1a (2008), http://www.iasig.org/pubs/ixmf_ draft-v091a.pdf 5. IFPI Digital Musi Report 2009: New Business Models for a Changing Environment. International Federation of the Phonographic Industry (January 2009) 6. iKlax Media, http://www.iklaxmusic.com (last viewed February 2011) 7. Interactive Music Studio by MXP4, Inc., http://www.mxp4.com/ interactive-music-studio (last viewed February 2011) 8. ISO/IEC 23000-12, Information technology – Multimedia application format (MPEG-A) – Part 12: Interactive music application format (2010), http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail. htm?csnumber=53644 9. ISO/IEC 23000-12/FDAM 1 IM AF Conformance and Reference Software, N11746, 95th MPEG Meeting, Daegu, S. Korea (2011) 10. Kudumakis, P., Jang, I., Sandler, M.: A new interactive MPEG format for the music industry. In: 7th Int. Symposium on Computer Music Modeling and Retrieval (CMMR 2010), M´ alaga, Spain (2010) 11. Kushner, D.: The Making of the Beatles: Rock Band. IEEE Spectrum 46(9), 30–35 (2009) 12. Ludovico, L.A.: IEEE 1599: a multi-layer approach to music description. J. Multimedia 4(1), 9–14 (2009) 13. Smule, Inc.: Glee Karaoke iPhone Application, http://glee.smule.com/ (last viewed February 2011) 14. Smule, Inc.: I Am T-Pain iPhone Application, http://iamtpain.smule.com/ (last viewed February 2011)