Audio Engineering Society
Convention Paper 8907 Presented at the 134th Convention 2013 May 4–7 Rome, Italy This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
Layout remapping tool for multichannel audio productions Tim Schmele1 , David Garc´ıa-Garz´ on2 , Umut Sayin1 , Davide Scaini1,2 and Daniel Arteaga1,2 1
Fundaci´ o Barcelona Media. Diagonal 177, 08018 Barcelona (Spain)
2
Department of Information and Communication Technologies, Universitat Pompeu Fabra. T` anger, 122-140, 08018 Barcelona (Spain)
Correspondence should be addressed to Daniel Arteaga (
[email protected]) ABSTRACT Several multi-channel audio formats are present in the recording industry with reduced interoperability amongst the formats. This diversity of formats leaves the end user with limited accessibility to content and/or audience. In addition, the preservation of recordings —that are made for a particular format— comes under threat, should the format become obsolete. To tackle such issues, we present a layout-to-layout conversion tool which allows converting recordings that are designated for one particular layout to any other layout. This is done by decoding the existent recording to a layout independent equivalent and then encoding it to the desired layout through different rendering methods. The tool has proven useful according to expert opinions. Simulations depict that after several consecutive conversions the results exhibit a decrease in spatial accuracy and increase in overall gain. This suggests that consecutive conversions should be avoided and only a single conversion from the originally rendered material should be done.
1. INTRODUCTION, GOALS
MOTIVATION
AND
Many audio recordings these days are presented in layout dependent formats where the final product is designated to be reproduced on a given number of channels and a given speaker layout. These formats are used frequently in well known systems such as
5.1 or 7.1, as well as lesser known elevated evolutions thereof, such as 11.1 [1] or 22.2 [2]. Layout independent formats, such as Ambisonics or objectbased methods, on the contrary, should be apt for any playback system but have a very small place in the market and are not well renowned in recording industry. Even in these cases the initial layout-
Schmele et al.
Layout remapping tool for multichannel audio productions
independent format is often missing or not easily available and only the final multichannel mix-down remains. This reduces the interoperability of several existing systems and limits the diversity of recorded material that is available to the end user. In addition, with fast moving tendencies, the lack of layout manipulation tools proves several systems obsolete in short periods of time and impossible to transfer previously produced material to newer technologies. Here, a simple layout-to-layout conversion tool is presented to increase the compatibility between different existent layouts, thus increasing the flexibility of content for the end user. This tool not only facilitates the conversion of to well-known layouts but also creates the possibility of experimenting with new layouts to come. This increases the availability of recorded material to more end users and eases the task of preservation of previously recorded multichannel content. The tool itself was created for quick and easy conversion of in-house material from old speaker layouts to the current one. Despite being a na¨ıve approach, it has proven very useful and maintains the spatial impression of the original audible image quite well, if consecutive conversions are avoided, as is shown in Figure 1. As compared to other solutions in the literature (Section 2), the solution presented here is simple, general, practical, free from artifacts, and mostly concentrated in the applications of 3D audio production for special events and music, not imposing any restrictions on either speaker layout design. Additionally, this tool can also be used to decode a multichannel audio format to a layout which does not exactly conform to the standard layout for that format (e.g., for space constraints in a given room), including decoding of higher order irregular Ambisonics to hard-to-decode layouts. Similarly, this tool can also be used to convert any multichannel content into a binaurally encoded recording by simulating the targeted speaker layout binaurally over headphones. This paper is structured as follows. In Section 2 we will briefly review the state of the art in the subject. In Section 3 the method followed in the paper will be explained, along with the implementation. The tool’s ability to maintain the original spatial image over consecutive conversions is evaluated in Section
4. Finally, we summarize and present conclusions in Section 5. 2. STATE-OF-THE-ART The conversion between speaker layouts has been active field of research, especially since the emergence and popularization of multichannel formats, such as 5.1. Gerzon [3] was one of the first to realize the usefulness of such a tool in multichannel audio productions. Both mono [4] and stereo up-mixing algorithms [5,6] are well researched and in wide-spread use today. These methods, however, are motivated by differing goals, as mono and stereo are naturally compatible with commercially available multichannel formats. Hence, they usually employ perceptually oriented techniques that extract, or invent content that is then spread it across the additional speakers to create a more immersive feeling. Goodwin and Jot present in ref. [7] a surround format conversion method that first converts the input audio into either an intermediate mono or oversampled and up-mixed multi-channel version, which is then re-converted into the target layout. In both cases, the encoder appends time-frequency weighting coefficients to correctly distribute the audio over the target layout. The method is said to degrade the ambient parts of the content and is not artifact free, due to the intermediate down-mix. The method is further improved in ref. [8] by introducing a principal component analysis to separate the primary, localized parts of the audio from the ambient content. It then treats the localization of primary components different to ambient components, matching the former against the particle velocity vector and the latter against the energy vector. A more general method is presented by Laborie et al. [9], in which the sound field produced by the original speaker layout is determined and encoded using the Fourier-Bessel expansion, similar to Ambisonics. To correctly encode a given input scene, not only do loudspeaker positions need to be known, but the originally targeted loudspeaker spatio-temporal response as well. This poses problems to convert archived material, as such information is rarely available and would need to be guessed.
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 2 of 8
Schmele et al.
Layout remapping tool for multichannel audio productions
A solution similar to the one presented here is described in ref. [10], which also treats the original speakers’ signals as virtual sound sources in the target speaker layout. It triangulates each virtual source between the three nearest target speakers available. The triangulation is not solved to preserve the sound energy, as is done in the well known VBAP method [11], but instead solves the problem for the sound pressure. Hence, it is expected for this method to work better for low frequencies, which are known to be a worse cue for sound localization. The tool presented in this paper makes use of the well established VBAP panning method and is, therefore, optimized for high frequencies, as it solves the panning problem by solving the equations for the sound’s energy. Finally, let us mention that methods similar to the one presented here have been used to decode Ambisonics to irregular arrays [12, 13], essentially consisting in decoding to a virtual regular layout which then is decoded to the real layout by making use of VBAP. 3. METHOD AND IMPLEMENTATION Similar to the methods described above, the algorithm is divided into two steps in which the original content is first transformed into its layout independent equivalent and then re-rendered into the target layout. The first step is done by treating each input as a virtual source. This object-based transformation places the input channels in their respective angular positions in the sphere. Hence, each input channel is treated as a point source with a static position, overlaying the original layout via point sources over the target layout. This intermediate layout-independent representation can then be converted into the target layout using any chosen rendering method. Currently vector base amplitude panning (VBAP) [11] is used, but other algorithms such as higher order Ambisonics (HOA) [14] could be used as well. VBAP is chosen as the preferred method due to it’s high localization using the least necessary speakers to render each virtual source in its desired position.
Binaural decoding is also included as an output layout so that layout-to-binaural conversion is also possible by convolving the layout independent intermediate recording with head related transfer functions (HRTFs). Each original source channel is, again, treated as a virtual source, that is binaurally placed at its location in the original speaker layout. This is equivalent as simulating the speaker display in the originally proposed concert situation over headphones. The tool is implemented as a command line program in Python [15]. It relies on existing libraries such as NumPy [16] and libsndfile [17]. The widespread use of the language and the libraries across many platforms makes the tool quite portable. Internally, the script works in the two-stage process mentioned above, where first the original layout is converted to an internal object-based representation, where each original channel is converted to a point source, and later on is rendered in VBAP in the new layout. The overall result of the process is a N -input by M output matrix (with N and M being respectively the number of loudspeakers in the initial and final layout), which is then applied frame by frame to the original multichannel audio. As mentioned above, designing the tool as a twostage process makes easy replacing VBAP with some other rendering method. In particular, binaural conversion is also provided. Each virtual point source is convolved with the closest HRTF measurement. This is done using the CLAM [18] spatialization plugin [19], which provides the ability of choosing one of the many public HRTF databases available [20–22]. 4. TESTING AND EVALUATION The layout-to-layout conversion has been successful in translating any given format to the desired equivalent. We have tested the tool with layouts such as 5.1, 7.1, 10.1 and non-standard layouts such as 23.1, 16.1, which are converted successfully to one another and also to binaural recordings. The conversion times were significantly shorter than the actual recording duration. Subjectively, the tool has been used to convert many in-house demos of which only bounces with older speaker layouts exist to the current one, which is an
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 3 of 8
Schmele et al.
Layout remapping tool for multichannel audio productions
(a)
(b)
Fig. 1: Results showing the trend of the energy vector magnitude rE of an initially focused source after a certain amount of iterations. Results obtained by averaging over 1000 trials. The shaded area corresponds to the region between the 5% and 95% percentile, and the continous line corresponds to the median (50% percentile). Results obtained by averaging over 1000 trials. Plot (b) shows a plot (a) during only the first 5 conversions. irregular, non-standard 23.1 layout. The conversion quality was rated high by many expert users, including the audio engineers that created the original content. Direct A/B comparisons were unfeasible, because only one fixed speaker layout is currently available to us at the time. Statistically significant user tests pose work to be carried out in the future. However, it can be readily appreciated that the conversion process will incur in some performance degradation penalty. To assess the amount of degradation, quantitative tests were conducted by running an initial point source consecutively through several speaker layouts. These speaker layouts were randomly generated on the fly to maintain the most general case. In particular, we wanted to measure how well the spatial image is preserved by looking at the energy vector magnitude rE [23], also named the Gerzon localization vector [8]: rE =
P ~ P 2 ~ | Ii | g u ˆi |I| = Pi = Pi i 2 E E g i i i i
(1)
where I~ and E are respectively the total sound intensity and sound energy, I~i and Ei are respectively the intensity and energy radiated from each loudspeaker,
gi is the gain factor applied to each loudspeaker, and u ˆi is the unit vector pointing in the direction of each loudspeaker. The above equation assumes an incoherent summation of the sound sources. The energy vector magnitude is equal to 1 for a point source, (i.e., that coming from a single speaker), and 0 for a completely distributed source, equally coming from all directions. We generated random uniform test-layouts implementing a modified, spherical version of the algorithm presented in ref. [24]. We generated triangulations for several arbitrary speaker numbers and selected a number at random at each conversion iteration. Inspired by past, present and future channel agnostic standards in the industry, we chose to randomly convert between 14, 23 and 64 speakers. Figure 1 shows the average trend of rE over 1000 trials. One can observe that the curve falls quite steeply, within the first 20 consecutive iterations and converges around rE = 0.1. As a reference, first order ambisonics depicts a value around rE = 0.5, which is considered quite unacceptable for good spatial fidelity. Nevertheless, when zooming in to Figure 1 [Figure 1(b)], one can observe that the algorithm stays above rE = 0.8 in the first few iterations.
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 4 of 8
Schmele et al.
Layout remapping tool for multichannel audio productions
(a)
(b)
Fig. 2: Trend of the angular source size (per number of conversions). Same considerations as in Figure 1 apply. To have a more clear appreciation of the degradation process, the energy vector magnitude rE can be related to angular source diamater δ by the following equation: √ δ = 4 arccos( rE ) (2) Details on eq. (2) may be found in the appendix (section 6). Figure 2 shows the same source spread trend of rE converted to angles. One can see that many consecutive conversions will converge towards nearly 300◦ , which can be considered fully spherical. Also, after 5 consecutive conversions the angular source size will grow to 120◦ , constituting the third of a sphere. With just one conversion the the source size grows up to 40◦ , which turns ofut to be quite acceptable in most cases. The decline in spatial accuracy in Figure 1 is quickly explained when considering the underlying mechanisms of VBAP [11]. A virtual point source is triangulated between three near speakers (at most). For the original point source, this means a point source would have some localization degradation, meaning that from a theoretical value of rE = 1 we receive a realistic values comprised between 1 and 0.85, with 0.95 being a typical value. Iteratively, these three speakers are again considered to be point sources themselves, and re-distributed over the three nearest speakers in the new layout, spreading the original
audio signal over up to 9 speakers. As this process progresses, the audio is eventually distributed over the whole listening sphere. Another initially unexpected effect that has been observed is an overall increase in gain of the distributed signal. The Figure 3 (a) shows that the average increase in gain is more than 3 dB per consecutive conversion. This is an alarming trend and further shows that this method cannot be applied to the same content consecutively more than a few iterations. The Figure 3 (b), though, depicts the slow start of the process, giving us acceptable gain increase of 1 dB after the first iteration, confirming that this method may be well applied at least once without a large decrease in quality. We explain this effect, again, with the internal workings of VBAP, which is designed to work with multiple incoherent point sources under the incoherent addition hypothesis. For single sound sources VBAP P 2 maintains gi = 1, so in the first iteration there is no gain increase. However, after the first iteration, when a given loudspeaker receives the same signal coming from two different point sources, the electronic addition of these signals inside the loudspeakers is necessarily coherent, a fact which contradicts the incoherent addition hypothesis of VBAP and leads to gain increases. This same gain increase would happen for distributed sound sources.
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 5 of 8
Schmele et al.
Layout remapping tool for multichannel audio productions
(a)
(b)
Fig. 3: Results showing the gain increase in dB. Same considerations as in Figures 1 and 2 apply. Notice that the first conversion does not involve any gain increase, as expected. Finally, when dealing with de-correlated sources, comb filtering effects are unavoidable due to the na¨ıve summing of virtual sources in the target layout. Judging by the subjective listening tests described above, these comb filtering effects have never been perceivable in conversions of our in-house productions, even to expert ears. Although formal listening tests in controlled environments are subject to future work, we can conclude that this effect cannot be perceived in complex, fully produced scenes.
In that sense, it was tested with random speaker layouts whose speakers are distributed normally over a full sphere. Two primary factors that limit the number of consecutive conversions considerably are a widening of the sources and an overall increase of the sound pressure. In fact, tests show that these two performance penalties are quite important after only a few iterations. Additionally, although not tested, we argued that comb filtering effects will also occur in presence of de-correlated sound sources.
5. SUMMARY AND CONCLUSIONS
This concludes that if a original channel indepenent representation format is available, it is better to mix down from this original representation and avoid the use of the tool presented here. If only a multichannel bounce is available, the tool is very useful, but it should not be used consecutively. We recommend to safeguard the original bounce down from which new conversions can be made each time. We consider this a last resort tool that is able to deliver good results.
This paper describes a practical tool for 3D speaker layout to layout conversion, which works in two steps, first by converting to a layout-independent format, and then decoding this intermediate format to the new layout, currently using VBAP (or binaural for headphone rendering). Even though the tool is simple and perhaps na¨ıve to an extent, it has proven to be very successful in converting many in-house productions for single conversions. Although it depicts limitations, it has been shown to be easy, fast and free of any artifacts, as compared to more sophisticated tools. Moreover, it is versatile, as it is specifically designed to convert between fully three-dimensional channel agnostic speaker layouts.
When using a tool like this to decode Ambisonics [12, 13], all the aforementioned drawbacks will be manifest: widening of the sources, increase of the sound pressure level energy, and possible comb filtering effects, giving rise to a suboptimal decoding. In our view, in the case of irregular layouts it is better to attempt direct Ambisonics decoding using a non-linear search method (see ref. [25] for references
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 6 of 8
Schmele et al.
Layout remapping tool for multichannel audio productions
and details), and leave approaches similar to the one presented here as a last resort method. As a side remark, let us highlight the gain increase for VBAP found in this paper, which will happen not only in the context of the tool presented here, but whenever multiple correlated or extended sound sources are decoded with VBAP. Finally, let us mention that it remains for future work to use another decoding method instead of VBAP, such as HOA, and to repeat the same testing done here. 6. APPENDIX: CORRESPONDANCE BETWEEN THE ENERGY VECTOR MAGNITUDE AND SOURCE SIZE In this appendix we want to demonstrate eq. (2) by showing the correspondance between a sound source with a certain energy vector magnitude, eq. (1), and a homogeneous sound source of a certain angular diameter δ. To this end, let us consider an extended sound source distributed homogeneously on a spherical cap of angular radius δ/2. The different components of the distributed sound source are assumed to be added incoherently one to each other. For the easy of calculation, and without any loss of generality, let us assume that this sound source is located around the zenith. If the energy density of the source is ε, the total energy will be : Z π Z π/2 E= da de cos e ε = 2πε[1 − cos(δ/2)]. −π
π/2−δ/2
Regarding the intensity vector, the x and y components vanish because of symmetry considerations and the z component is given by: Z π Z π/2 Iz = da de cos e ε sin e = πε sin2 (δ/2), −π
π/2−δ/2
which can be simplified to: rE =
1 + cos(δ/2) δ = cos2 , 2 4
(3)
Inverting eq. (3) leads to eq. (2). 7. REFERENCES [1] B. Claypool, W. V. Baelen, and B. V. Daele, “Auro 11.1 versus object-based sound in 3D,” tech. rep., Barco. [2] K. Hamasaki, T. Nishiguchi, R. Okumura, Y. Nakayama, and A. Ando, “A 22.2 multichannel sound system for ultrahigh-definition TV (UHDTV),” SMPTE Motion Imaging Journal, vol. 117, no. 3, pp. 40–49, 2008. [3] M. A. Gerzon, “Compatibility of and conversion between multispeaker systems,” in Audio Engineering Society Convention 93, 1992. [4] M. Lagrange, L. G. Martins, and G. Tzanetakis, “Semi-automatic mono to stereo up-mixing using sound source formation,” in Audio Engineering Society Convention 122, 5 2007. [5] C. Avendano and J.-M. Jot, “Frequency domain techniques for stereo to multichannel upmix,” Watermark, vol. 1, 2012. [6] M. Bai and G.-Y. Shih, “Upmixing and downmixing two-channel stereo audio for consumer electronics,” Consumer Electronics, IEEE Transactions on, vol. 53, pp. 1011 –1019, aug. 2007. [7] M. M. Goodwin and J.-M. Jot, “Multichannel surround format conversion and generalized upmix,” in Audio Engineering Society Conference: 30th International Conference: Intelligent Audio Environments, 2007.
where in the above equation we have taken into account that the z component of the sound intensity of a differential area element of the sound source, located at a certain elevation e, is dIz = ε sin e.
[8] M. Goodwin and J.-M. Jot, “Spatial audio scene coding,” in Audio Engineering Society Convention 125, 2008.
Therefore according to eq. (1)
[9] A. Laborie, R. Bruno, and S. Montoya, “Reproducing multichannel sound on any speaker layout,” in Audio Engineering Society Convention 118, 2005.
rE =
~ Iz 1 sin2 (δ/2) |I| = = E E 2 1 − cos(δ/2)
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 7 of 8
Schmele et al.
Layout remapping tool for multichannel audio productions
[10] A. Ando, “Conversion of multichannel sound signal maintaining physical properties of sound in reproduced sound field,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 6, pp. 1467–1475, 2011. [11] V. Pulkki, “Virtual sound source positioning using vector base amplitude panning,” J. Audio Eng. Soc, vol. 45, no. 6, pp. 456–466, 1997. [12] J.-M. Batke and F. Keiler, “Using VBAPderived panning functions for 3d ambisonics decoding,” in 2nd Ambisonics Symposium, Paris, 2010.
[23] M. Gerzon, “General metatheory of auditory localisation,” Audio Engineering Society Convention 92, March 24–27 1992. [24] H. Vogel, “A better way to construct the sunower head,” Mathematical Biosciences, vol. 45, no. 3-4, pp. 179–189, 1979. [25] D. Arteaga, “An Ambisonics decoder for irregular 3d loudspeaker arrays,” in Audio Engineering Society Convention 134, 2013.
[13] J. Boehm, “Decoding for 3-D,” in Audio Engineering Society Convention 130, 2011. [14] J. Daniel and S. Moreau, “Further study of sound field coding with higher order Ambisonics,” in Audio Engineering Society Convention 116, 2004. [15] G. Van Rossum and F. Jr, An Introduction to Python. Network Theory., 2011. [16] D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin, and T. Oliphant, Numerical Python. Lawrence Livermore National Laboratory, Livermore, CA, ucrl-ma-128569 ed., 1999. [17] “libsndfile.” libsndfile/.
http://www.mega-nerd.com/
[18] “CLAM: C++ library for audio and music.” http://clam-project.org/. [19] P. Arumi, N. Olaiz, and T. Mateos, “Remastering of movie soundtracks into immersive 3D audio,” in Proceedings of Blender Conference 2009, 2009. [20] “Listen HRTF Database.” http://recherche. ircam.fr/equipes/salles/listen/. [21] W. Kreuzer and Z. Chen, “A fast multipole boundary element method for calculating hrtfs,” in AES Convention, 2007. [22] B. Gardner and K. Martin, “HRTF Meassurements of a KEMAR,” J. Acoust. Soc. Am., vol. 97, no. 6, pp. 3907–3908, 1995.
AES 134th Convention, Rome, Italy, 2013 May 4–7 Page 8 of 8