A professional audio network should basically support synchronous transport .... professional applications (A capability managed by AVB). [3]. IV. Application to ...
Next-Generation Audio Networking Engineering for Professional Applications Abderrahmane SMIMITE 1 2
Ken CHEN 2
Azeddine BEGHDADI 2
1: Digital Media Solutions, 45, Grande Allée du 12 Février 1934, 77186, Noisiel, France 2: Université Paris 13, Sorbonne Paris Cité, Laboratoire de Traitement et Transport de l'Information (L2TI, EA 3043), F-93430, Villetaneuse, France
Keywords: Audio Transport, Audio networking architecture, Interoperability, Multichannel audio, Spatial sound, Synchronisation
Abstract: This paper presents an overview of the present and the future of audio networking and its needs, and focus on a lightweighted generic architecture that we had experienced on multichannel spatial sound streamed to multiple recipients, in the context of 3D multimedia environment, and which should fit all the current and future requirements of professional audio applications.
I. Introduction Audio transport is one of the most fundamental functional components when we’re dealing with multimedia contents. As a matter of fact, since the first technologies handling Audio and Video, their transport has been a major and sensitive concern and it is still rising because of the recent upcoming applications with newer and higher requirements such as 3D immersive multimedia environment. So considering that almost all of audio content are migrating to digital today, with a particular increasing need of transmission over long distances, sound transport had leaned toward networked digital solutions. The major advantage of digital audio networks, in contrast to legacy analog audio point-to-point connections, is flexibility, alongside the fact that analog signal distribution requires separate physical channel for every signal. In addition to all the advantages of the digitalisation, an audio signal transported over a network is available almost everywhere in the network. No rerouting or re-plugging of cables is needed. A further big advantage of networks is their scalability. If there is a need for more bandwidth, the network can easily be extended by adding more switches and connections. This report brings to light the current state and the future of audio networking by presenting the evolution of its needs, with a focus on the engineering aspects that a transport technology has to provide in order to match all the requirements. For that matter, a generic architecture is suggested and which should guarantee a high level of interoperability. The layout of this paper is as follow: section 2 gives background information of the current and upcoming requirements for sound transmission, particularly multichannel audio streams. An overview of the existing technologies is given in section 3 through a comparison of a Layer 2 and Layer 3 solutions. In section 4, a generic architecture for multichannel audio transport is suggested through an application on 3D multimedia environment, and which should suit most of current and future foreseeable audio applications. At last, section 5 will
present potential and future developments in audio networking. II. Requirements The media networks requirements are generally application specific. This section will not describe requirements down to the last detail but gives more of a general scope with an attention to the common basic components, while focusing on the new requirements for future applications. A professional audio network should basically support synchronous transport of a high number of channels with all the timing constraints involved and the remote control and monitoring of the devices handling those streams. We list here the main features that, in our opinion, a network engineer should take into consideration when developing an audio network: A. Channels number The most obvious example when we’re talking about application requiring high number of audio channels is public address. As a matter of fact, examples of up to 10,000 nodes are found in this area and same thing goes for mixing console & conferencing markets. Nowadays, new technologies are offering a new audio experience where listeners can enjoy a fully immersive 3D audio scene. The most promising techniques are HighOrder Ambisonics (HOA) and Wave Field Synthesis (WFS) which are the most advanced audio spatialization techniques but require a high number of audio channels that sometimes goes up to few hundreds (TU Berlin is equipped with a 832 channels WFS system) [1][4]. B. Timing & Synchronisation Audio streams, as any other digital signals, are by nature time dependant, as sampling is the first step of digitisation. Timing aspect is even more critical when we’re dealing with multichannel audio streams where multiple channels are time-correlated in order to form Auditory images (3D audio landscapes). In fact, timing constraints are less obvious but more fundamental when we’re dealing with spatial sound reproduction. In order to preserve the auditory image quality (i.e. how good the impression of spatialization will be perceived), and particularly when channels are separated and streamed to different and spatially distant clients like loudspeakers, the transport mechanism should maintain the phases between the channels. The synchronisation drift can’t exceed a few microseconds in
order to not be perceived [6]. More precisely, AES11 states that the outputs from all equipment in an isochronous system must lie within +/- 5% of the reference phase. For 48kHz this represents about +/-1 µs. Latency is also a well-recognised issue when dealing with a network. Optimally, delay must be as low as possible to ensure network transparency and preferably deterministic. For Broadcast and live performance, the maximum tolerable latency is around 2 ms while in general consumer we can accept up to 50 ms. C. Audio Quality & bandwidth An audio network interface should be able to handle all the sampling rates used on the market: 44.1 and 48 KHz for general purpose, 88.2 and 96 KHz for studios and even 192 KHz for probable future use. This same thing goes for bit depth: we should be able to manage audio signals coded on 16 bits as well as 24 or even 32 bits. Consequently, the bandwidth required for a single channel of non-compressed audio data will vary between 705,6 Kbps and 6,144 Mbps. So it’s only wise to consider the upper limit when we’re building an audio transport technology to match all needs. D. Reliability Occasional loss or corruption of media packets if not significant can be tolerated in general consumer market whereas media networks involved in professional applications have high reliability requirements. Applications like conferencing and communication systems are depending on the correct functioning of the networks carrying the audio-visual data. From a listener point of view, subjective tests have revealed that audio packet loss begins to be bothersome if it exceeds 5% [7]. As mentioned in [1], the use of Professional media networks might be extended to life-safety application from time to time (distress signal for instance). Therefore higher constraints are imposed and might even involve an additional security layer. The same thing goes for business conferencing to ensure that such critical applications are safe from tampering. Fault-tolerance is another aspect that a media network has to guarantee. The basic solution is usually redundancy even if it’s not the most cost-effective one (it doubles the physical resources and requires an additional mechanism to handle the both streams and which may introduce an additional delay). Depending on the application, redundancy can be avoided using an audio-tailored Forward Error Correction (FEC) as suggested in [5] and so, maximize network performances. E. Scalability & Manageability Professional media networks evolve continuously and involve generally devices from different manufacturers. It should be possible to add new devices without compatibility matter, and the devices must at least be able to handle audio streams and exchange basic information. The main interest, as stated above, of a networked solution, is to be able to route any signal from any device to another. In addition to that, the network should allow devices monitoring, advanced control capabilities and Quality of Service management.
Supporting various proprietary technologies is and will be a substantial issue. As a matter of fact, a comprehensive media network should be able to interconnect and support devices with vendor-specific features but it won’t always be a feasible task for the simple reason that manufacturer will naturally always try to keep their secrets from being revealed. III. Existing Technologies When we state a layer-specific technology, we refer to the seven layers OSI model. Besides Layer 1 (L1) proprietary audio transport solutions, many Layer 2 (L2) and layer 3 (L3) solutions have been introduced to the market. We’ll focus basically on the second ones, with a particular interest to AVB (L2) and RAVENNA (L3) as they represent the most advanced and open technologies of this class. a. AVB AVB stands for Audio Video Bridging for real-time sensitive multimedia content. It’s a set of approved IEEE developed by 802.1 standards committee. The protocol stack is as follow (figure):
Figure 1: Protocol Stack of an AVB endpoint
AVB standard consists of the following components: - IEEE 1722, a transport protocol that enable interoperable streaming by defining: (1) a media formats and encapsulations, (2) a Media synchronization mechanisms (3) a Multicast address assignment [8]. - IEEE 802.1AS that provides a tool to ensure Timing and Synchronization through a profile of IEEE 1588 (gPTP). A glance of PTP is given later in this paper. - IEEE 802.1Qat that defines a Stream Reservation Protocol (SRP), which is an enhancement to Ethernet via the implementation of an admission protocol and a mechanism for end-to-end management of the streams to guarantee Quality of Service. - IEEE 802.1Qav for traffic shaping. - IEEE 802.1BA for system specification. b. RAVENNA RAVENNA is an open solution based on Internet Protocol standard and thus can run on most existing managed network. As an IP-based solution, it is based on protocol levels on or above layer 3. RTP (Real-time Transport Protocol), widely used in numerous timesensitive applications, is the protocol used for streaming the media content. It’s used jointly with RTCP (Real-time Transport Control Protocol) that provides statistics and control information for RTP flows. RAVENNA includes also RTSP/SDP Protocol for communication control and
session management, and supports both DNS-SD and the ZeroConf mechanism for device configuration. It relies on PTPv2 (IEEE 1588-2008) to achieve nodes synchronization, and use DiffServ as a QoS mechanism.
that timing constraints are easier to respect and interoperability with other systems is more feasible. Our application consists of one multichannel audio source and multiple sinks as shown below:
c. Comparison The main difference between RAVENNA and AVB is that they’re respectively Layer 3 and Layer 2 solutions. RAVENNA describes an Audio-Over-IP (AoIP) technology while AVB defines an Audio-Over-Ethernet (AoE) standard (and IP independent). The following table states the main differences and more details can be found on [3]. AVB
RAVENNA
Clock Synchronisation
IEEE 802.1AS gPTP (subset of IEEE 1588-2008)
IEEE PTP V2
Latency
Guaranteed: * Class A: 2 ms * Class B: 50 ms 7 hops max allowed
Configurable and network dependent (min possible ~1ms)
Media Streaming
IEEE 1722
RTP
Fault Tolerance
Not covered
Fully redundant on 2 network interfaces supported (not mandatory)
Configuration and control
IEEE 1722.1 (Plug & Play Support)
Variable (static, ZeroConf, RTSP, ...)
1588-2008
The main issue with AVB is that it requires particular switches that are AVB-compliant, which make it unusable on existing networks. It’s also important to state that AVB isn’t yet a ready-to-use solution but more of a set of standards (a first and promising implementation of AVB has been introduced by XMOS recently but still on a beta phase). RAVENNA on the other hand, relies for QoS on DiffServ, which is not a bandwidth allocation scheme, so no guarantee can be given that streams will always have the bandwidth they need for uninterrupted streaming. Therefore an engineered network, which is thoroughly designed and maintained, is vital for a RAVENNA system. This is not the case for unmanaged consumer plug-andplay networks and might be bothersome for some professional applications (A capability managed by AVB) [3]. IV. Application to 3D Immersive Environment According to their specific application, user may choose between wired and wireless solutions. Up today, wireless solutions doesn’t match all the technical specifications for professional purposes, particularly bandwidth and synchronisation. A first ample wireless solution based on IEEE 802.11 is presented in [5], but the bandwidth required for transmitting a high number of high quality audio channels is still an open issue. That’s why we choose to work on a wired solution for our application, in addition to some electro-magnetic issues that we may encounter in some environments. Another aspect that we kept in mind is respecting a minimal changes philosophy by using minimalistic light protocols that are compliant with the existing standards, so
Figure 2: Concept of our Application
Only the main aspects of the developed solution are presented in this section and more details will be available in an additional paper. a. Using Ethernet We can simplify the task of building an audio network by designing it around one of the many existing communication standards used by the IT networks. Ethernet pops up as a natural choice since it provides the best balance between the high bandwidth (Gigabit version) and cost-effectiveness, compared to other technologies such as MADI. To use Ethernet for transporting real-time audio information, it’s required to either eliminate the causes of unpredictable behaviour or mitigate them with buffering and retransmission strategies on a very known and mastered time bases. One major characteristic to deal with when we transport audio over Ethernet is packetization. As a matter of fact, any scheme for moving digital audio over a packet-based network must pack audio data into a frame, transmit it and then unpack it into its original form. However, a packetization strategy involves a number of trade-offs: to optimize bandwidth use, we need to maximize the ratio of payload data to header data by using the largest possible payload of 1500 bytes. But a single audio channel coded on 32 bits packed into such a frame would contain 8 milliseconds of material. Given the inevitability of buffering, this introduces a granularity that, possibly at multiple points in the transmission chain, would impose an important delay on the audio path (many tens of milliseconds), which won’t be acceptable for live broadcast for example [2]. That’s why a specific packetization scheme has been proposed. b. Streaming Since our system is built on UDP/IP for compatibility purpose, we choose to handle streaming to work with an RTP-like header and keep mainly the critical elements of it: the timestamp to guarantee time aligning at the receiver and the sequence ID to play packets in the correct order and detect the missing ones. RTCP is used on a secondary port to gather network statistics and manage control information. c. Clocks synchronisation The major issue when we’re using standard Ethernet networks for distribution of digital audio signals is the distribution of the corresponding media clock across the network, considering their asynchronous aspect.
With the Precision Time Protocol (PTP) described in IEEE 1588, it is possible to synchronise distributed clocks with an accuracy of less than 1 µs, which suits enough our application. The synchronisation process, as illustrated in the figure above, is achieved through the exchange of specific Figure 3: Principle of PTP messages based on a Master-Slave principle: it includes Master sync message, Master delay response message, and the Slave clock delay request messages. In addition, the Best Master Clock (BMC) algorithm is used to allow multiple Masters to negotiate the best clock for the network. Depending on the implementation, synchronisation accuracy can goes from 10 to 100 us to 10 to 100 ns. During our tests, a software implementation (PTPd, an open source daemon) has been used and which gave good results. More details can be found in the IEEE 1588-2008 standard. d. Network monitoring and management Devices management has been done using a message exchange technique through the RTCP channel. We’re leaning toward an XFN-based device description to ensure more interoperability (XFN is an IP-based peer-to-peer protocol for control, configuration, monitoring and connection management of networked devices) [9]. e. Ensuring Reliability Today’s networks are highly reliable but can’t guarantee yet a total fail-free performance. We’ve been working on an audio-specific FEC mechanism to increase the network performance. In case of this system is used latency-sensitive application; we suggest the use of redundant streams with the same time stamping, so the receiver can handle the samples properly. f. Deployment Network deployment in a 3D multimedia environment, as many other professional applications, can be a tricky business. The ring or daisy chain topologies, even if they’re the ones using the less cable, aren’t always the safest choice. The tree topology comes to us then as a natural choice for obvious reasons. So, we worked on a specific algorithm for network component placement to minimize cable length and simplify the installation. Using an L3 solution allows also the use of existing networks and coexistence with other IT traffic. g. Emulation We had developed a full software emulator, using exclusively standardized protocol for interoperability matter, using the following architecture: - A layer 3 solution based on UDP/IP, - An additional header is added to the audio packets (inspired by RTP, compatible but lighter) to ensure the respect of packets order and detect packet loss, - A PTP daemon to handle streamer/receivers synchronization,
-
Application layer that comprises: o A multichannel audio core handler, o An audio specific FEC mechanism, o Monitoring, configuration and control messages are exchanged via RTCP Tests have been conducted using several configurations of multichannel streams and the results have been validated using a spatial extension of the PEAQ measurement method. V. Conclusion & Perspectives We’re already witnessing a preview of the nextgeneration audio applications through the emerging 3D multimedia technologies with newer and higher requirements: more channels with higher audio quality to provide a better and newer multimedia experience. On the networking side, it does inevitably involve the usage of more bandwidth and new transmission protocols. The main downside of the existing technologies is the lack of incompatibility between each other: different architecture (L1, L2 or L3) with more or less proprietary protocols. Some solutions even require specific switches to function properly. The simplest way to ensure interoperability is that all manufacturers agree on a common basis while keeping the customized additional information as optional extensions, that can be potentially ignored or override in order to guarantee a minimal level compatibility for a safe-mode functioning. For instance, a first step is to perform network management and device control using XFN. Networks performances will continue to increase and soon even a highly reliable wireless multichannel audio transport will be possible (using 802.11ac or SuperWifi). Coexistence with IT traffic, Plug & Play capability on a higher level and Internet Bridging for Transmedia applications are potential enhancement tracks that have yet to be investigated thoroughly. VI.
References
[1] Jeff Berryman, “Technical Criteria For Professional Media Networks”, 44th AES Conference, San Diego 2011. [2] Patrick Warrington, “Digital Audio Networking”, Broadcast Engineering Magazine, 2003. [3] Axel Holzinger and Andreas Hildebrand “Real-time Linear Audio Distribution Over Networks: A Comparison Of Layer 2 And 3 Solutions Using The Example Of Ethernet AVB And Ravenna”, 44th AES Conference, San Diego 2011. [4] Frank Melchior and Sascha Spors “Spatial Audio Reproduction: From Theory to Production”, 128th Convention of the AES, London 2010. [5] Seppo Nikkilä, “Introducing Wireless Organic Digital Audio: A Multichannel Streaming Audio Network Based On IEEE 802.11 Standards”, 44th AES Conference, San Diego 2011. [6] M. Rautiainen, H. Aska, T. Ojala, M. Hosio, A. Mäkivirta and N. Haatainen, “Swarm Synchronization For Multi-Recipient Multimedia Streaming”, IEEE ICME 2009. [7] Gillian M. Wilson and M. Angela Sasse, “Investigating the Impact of Audio Degradations on Users: Subjective vs. Objective Assessment Methods”, Proc. OZCHI 2000, Sydney [8] Robert Boatright, “Understanding IEEE 1722: AVB Transport Protocol – AVBTP”, IEEE 802.1 Plenary, March 2009 [9] Universal Media Access Networks GmbH, (UMAN), “XFN Specification Version 1.0”. August 2009.