Communication Architectures and Algorithms for Media Mixing in Multimedia Conferences P. Venkat Rangan, Harrick M. Vin and Srinivas Ramanathan Multimedia Laboratory Department of Computer Science and Engineering University of California at San Diego La Jolla, CA 92093-0114 E-mail: fvenkat, vin,
[email protected], Phone: (619) 534-5419, Fax: (619) 534-7029
Abstract Advances in computer and communication technologies have stimulated the integration of digital video and audio with computing, leading to the development of computer-assisted multimedia conferencing. We address the problem of media mixing which arises in tele-conferencing applications such as tele-orchestra. We present a mixing algorithm which minimizes the difference between generation times of the media packets that are being mixed together in the absence of globally synchronized clocks, but in the presence of jitter in communication delays on packet switched networks. The algorithm is shown to be complete: given that there are no other message exchanges except media data between mixers and media sources,there cannot be any other algorithm that succeeds when our algorithm fails. Mixing can be accomplished by several different communication architectures. In order to support applications such as tele-orchestra, which involve a large number of participants, we propose hierarchical mixing architectures, and show that they are an order of magnitude more scalable compared to purely centralized or distributed architectures. Furthermore, we present mechanisms for minimizing the delays incurred by mixing in various communication architectures. We have implemented the mixing algorithms on a network of workstations connected by Ethernets, and have experimentally evaluated the performance of various mixing architectures. These experiments revealed interesting results such as the maximum number of participants that can be supported in a conference.
To appear in IEEE/ACM Transactions on Networking, Vol. 1, No. 1, February, 1993
i
1
1 Introduction 1.1 Motivation Until recently, voice, video, and data communications have been handled by different communication networks. The primary carrier of voice has been the public switched telephone network, video has been transmitted by cable TV, and data has been handled by specialized computer networks (LANs and
WANs).
This historical separation
can be attributed to fundamental differences in characteristics of voice, video and data traffic. Voice and video characteristics such as sensitivity to delay, high bandwidth requirement, and the ability to tolerate relatively high error rates are in complete contrast with requirements of data. Hence, voice and video transmission use circuit switching of analog signals, whereas data transmission typically uses digital packet-switched networks. Interest in ‘integrating’ these different media has been stimulated by advances in communication and computer technologies. Advances in communication technology have made available large bandwidth at modest cost, whereas advances in computer technology have resulted in the development of high-performance workstations with audio and video capabilities. These advances have led to the feasibility of supporting many multimedia applications on computer systems. One such class of applications is multimedia conferencing, which permits individuals to carry out collaborative activities by exchanging audio, video, and text information through multimedia workstations. There are various kinds of conferences, among which the most demanding ones (in terms of performance) are those in which video and voice streams are being continuously transmitted by individuals in the conference, and the individual video images and voice streams have to be mixed together to obtain a composite image and audio stream. An example of such an application is a tele-orchestra, in which a large number of performers are continuously playing their respective musical instruments, each at his or her own multimedia workstation. The audio streams being received from all the performers need to be continuously mixed so as to yield a coherent musical composition, and then broadcast to the entire audience (which may also consist of individuals receiving audio at their respective workstations). The techniques for mixing depend upon the type of the media as well as the application. In the case of audio, mixing multiple streams involves digitally summing the audio samples and then normalizing the result. In order to eliminate one’s own feedback, each participant, on receiving a mixed audio packet, removes his own contribution before playing it back. In the case of video, whereas it is possible to simply juxtapose the display of multiple video streams without having to mix them, applications such as tele-virtual conferencing may benefit from the mixing of individual video streams to synthesize the image of a virtual meeting room. The need to continuously mix media streams from several sources, coupled with the real-time nature of the media poses special requirements for media transmission protocols. The design of algorithms, protocols, and communication architectures for supporting media mixing in multimedia conferences is the subject matter of this paper.
2
1.2 Related Work In recent years, there have been many efforts towards integrating multimedia conferencing into computer systems. Lantz [3] proposes a text/graphics conferencing architecture where the basic goal is to fit into existing systems with minimal impact on the rest of the system. Sarin and Greif [6] have studied conferencing architectures for text and graphics, but their system does not include audio. Aguilar et al. at SRI [1] have proposed architectures for person-to-person voice conferencing. Ziegler et al. [8] present an analytical performance evaluation of voice conferencing on broadcast and ring networks. Ahuja et al at Bell Laboratories [2], and Swinehart et al. at Xerox PARC [4] have proposed architectures for video conferencing. However, the emphasis of all of most of these efforts has been person-to-person telephony-type conferences, and high performance applications such as teleorchestra which require continuous time-critical mixing of media streams from several sources are not supported. Algorithms, protocols, and communication architectures for supporting media mixing have not received much attention.
1.3 Research Contributions of This Paper In this paper, we address the problem of mixing media streams transmitted by multiple sources in a multimedia conference on a packet-switched network. Owing to the real-time nature of the media being mixed, packets from different sources generated at about the same time are required to be mixed together. More precisely, for coherent reproduction of audio and video in applications such as tele-orchestra, or for maintaining consistent ordering of user-triggered events in distributed real-time games, the difference between the generation times of packets being mixed must be minimized. Reduction in generation time differences also reduces the waiting time of packets during mixing, leading to lower buffering requirements. We present a mixing algorithm which minimizes the difference between generation times of the media packets that are being mixed together in the absence of globally synchronized clocks1 , but in the presence of jitter in transmission delays. The algorithm is shown to be complete: given that there are no other message exchanges except media data between a mixer and the media sources, there cannot be any other algorithm that succeeds when our algorithm fails. Mixing can be accomplished by several different communication architectures. In order to support applications such as tele-orchestra, which involve a large number of participants, we propose hierarchical mixing architectures, and show that they are an order of magnitude more scalable compared to purely centralized or distributed architectures. Furthermore, live performances such as tele-orchestra require that each performer receives the audio played by other performers almost instantaneously. Towards this end, we present mechanisms to minimize the real-time delays incurred by mixing in various communication architectures. We have implemented the mixing algorithms on a network of workstations connected by Ethernets, and have experimentally evaluated the performance of various mixing architectures. These experiments revealed interesting results such as the maximum number of participants that can be supported in a conference.
1 In future integrated networks, media capture and display sites that digitize and packetize media data may be connected directly (as opposed to via a host computer) to the network, e.g., digital HDTV, ISDN Telephone, etc. Such devices may not be capable of running sophisticated time synchronization protocols.
3
The rest of the paper is organized as follows: In Section 2, we formulate the problem of mixing. In Section 3, we present algorithms for mixing two media sources, and in Section 4, we extend them to more than two sources. In Section 5, we analyze various communication architectures for mixing, and in Section 6, we present mechanisms for minimizing real-time delays due to mixing. Section 7 presents our experience with implementing the mixing algorithms, and finally, Section 8 concludes the paper.
2 Problem of Media Mixing in Conferencing In its simplest form, a multimedia conference is established by setting up logical media communication channels among a group of participants. Once a conference is established, media information can be exchanged among its participants. In a multi-party conference, each participant must receive the composite media stream obtained by mixing the media streams transmitted by all the other participants. In a packet switched network environment, an important problem is to determine the set of packets from different sources that are to be mixed to form a composite media packet. We call such a set of packets as the Fusion Set. If each source generates media packets at a constant rate with a period of p, then the smallest time
window during which two independent sources are guaranteed to generate packets is p2 . In general, when there
m sources, the smallest time window during which all the m sources are guaranteed to generate packets is p ? 2 mp? . (The proof is by a simple induction on the number of sources). If m is unbounded, the smallest window during which we are guaranteed to have packets from each of the sources is p. This is formalized by the are
(
1)
following mixing rule for a multi-party conference: Mixing Rule: Packets n1 and n2 can belong to a Fusion Set (i.e., can be mixed together) if and only if
jg(n1 ) ? g(n2 )j p
(1)
p is the period of generation of media packets at a source, and g(n1 ) and g(n2 ) are the generation times of packets n1 and n2 at their respective sources. where
The fusion set can be determined in a straight-forward manner when the clocks of the generating sources are globally synchronized. However, clock synchronization requires sophisticated protocols for negotiation and agreement among media sources. We assume that the media sources can digitize, packetize, and transmit data on the network, but may not have the sophistication to exchange control messages and run complex protocols. Hence, any protocol that requires the exchange of messages other than media data is out of question. Furthermore, the overhead of clock synchronization protocols may be excessive compared to simpler protocols that are sufficient for determining a fusion set. Hence, determination of a fusion set in the absence of globally synchronized clocks is an important problem that needs to be solved in multimedia conferencing.
4
3 Algorithms for Media Mixing Consider a multimedia conference in which there are m sources S1, S2 , ..., Sm (see Figure 1). Each source generates media packets at a constant rate with period p, and transmits them to a mixer
M.
Let the communication delay
from a source to a mixer, which includes transmission times on the network and queuing delays at intermediate nodes, be bounded between ∆min and ∆max . If s is the arrival time at the mixer of a packet numbered
ns
generated by a source Ss , then the earliest and the latest possible generation times (denoted by gse (ns ) and gsl (ns), respectively) of packet ns can be derived as:
gse (ns ) = s ? ∆max
(2)
gsl (ns ) = s ? ∆min
(3)
Measured on the mixer’s clock, the generation time gs (ns) of a media packet ns at source Ss belongs to the interval [gse (ns ); gsl (ns )], which is referred to as the generation interval of packet ns.
Arrival times τ1 Packet numbers
τ2
n1 n2
M
τm
τs ns
nm Sm
S1 S2
Ss
Figure 1: Mixing in a multimedia conference. Mixer M mixes packets from sources S1 ; :::; Sm
M to carry out mixing is to buffer the media packets it receives from sources S1 , S2 , ..., Sm , until at least one packet is received from each of them, at which time, M can combine A straightforward mechanism for the mixer
all the packets and transmit the mixed packet. By the Mixing Rule, the separation in the generation times of packets (of different sources) being mixed together can be at most p. In the presence of communication delay
jitter, two packets generated p apart (at two different sources) may suffer delays of ∆max and ∆min , respectively, and arrive at the mixer
p + (∆max ? ∆min ) apart, which then is the maximum waiting time of a packet in the
mixer’s buffers before being mixed. However, such a “buffer and mix” scheme has several disadvantages: Due to a quirk of the effect of jitter, two packets generated as far apart as the mixer within a window of
p + 2 (∆max ? ∆min ) may also arrive at
p + (∆max ? ∆min ), thereby misleading the mixer to combine those two packets
in violation of the mixing rule. Higher the jitter (as in wide area networks), greater is the extent of the violation. This problem gets further compounded in the presence of packet losses. For instance, while mixing packets from three sources
S1 , S2 , and S3 , packet loss may cause the mixer to derive fusion sets in a pairwise manner for
5
fS1 ; S2g and fS2 ; S3g, each of which may contain packets maximally separated in their generation times (i.e., by p + 2 (∆max ? ∆min )). In order to mix packets from all three together, if the mixer takes a union of these pairwise fusion sets, packets separated as far apart as 2 (p + 2 (∆max ? ∆min )) may get mixed together, and this separation may increase linearly with the number of sources. On the other hand, if the mixer decides to derive a new fusion set containing more closely generated packets from all the three sources, it may be flouting the requirement of maintaining consistent relative ordering of packets in successive fusion sets. We now propose a mixing algorithm which guarantees that the packets being mixed together always satisfy the Mixing Rule, first for the case when there are only two media sources (i.e., a binary mixing), and then extend it to a case when there are multiple media sources (i.e., a M-ary mixing). The algorithm is robust towards packet losses on the network. Hence, even in the absence of arrival of a current packet from a source, the mixer can still derive a fusion set with the help of any earlier packet received from that source. Thus, the algorithm effectively decouples the determination of fusion sets from live mixing of packets. An extreme variation of this situation is the determination of fusion sets from packets transmitted serially (rather than, concurrently) by the various sources, which finds an important application in live performances. In particular, a live performance generally has an initial sound-check period preceding its commencement, during which a mixer can sequence through the various performers/performing instruments prompting each of them to transmit a audio packet, one at a time. At the end of the sound-check, the mixer computes the fusion set (in addition to, of course, carrying out routine adjustments such as volume level, tone frequency, etc.), and keeps it for later use during the actual performance.
3.1 Binary Mixing Algorithm
n1 and n2 from two sources S1 and S2 be 1 and 2 . the latest possible generation times of packets n1 and n2 are:
Let the arrival times at a mixer of packets
The earliest and
g1e (n1 ) = 1 ? ∆max g1l (n1) = 1 ? ∆min g2e (n2 ) = 2 ? ∆max g2l (n2) = 2 ? ∆min Since the generation times of successive packets from the same source are displaced by p, the generation interval of packet ns + k from source Ss can be estimated as:
gse (ns + k) = gse (ns ) + k p
(4)
gsl (ns + k) = gsl (ns) + k p
(5)
Since the exact generation times of packets are unknown at a mixer, determination of a binary fusion set will
l have to be based on the above estimates of the earliest and the latest packet generation times. If gmax
=
6
e max(g1l (n1 ); g2l (n2 )) and gmin
=
e min(g1e (n1 ); g2e(n2 )), then gmin
l ? ge . Consequently, if g1(n1 )j gmax min then, the generation times of packets
l g1 (n1); g2(n2 ) gmax
l ? ge p gmax min
and
jg2(n2 ) ? (6)
n1 and n2 are guaranteed to satisfy the Mixing Rule, and hence, packets
n1 and n2 belong to a binary fusion set.
Equation (6) defines a stronger condition than the Mixing Rule for
determining a fusion set. Nevertheless, since the exact generation times of packets are unknown at a mixer, determination of a fusion set will have to be based on Equation (6) rather than on the Mixing Rule. Hence, in the remainder of this paper, we use Equation (6), referred to as the Strong Mixing Rule, as the condition for determining a fusion set.
l ? ge in estimated generation times is at least as large as the generation The maximum uncertainty, gmax min
intervals of n1 and n2 , each of which equals ∆max ? ∆min in duration. Hence,
l ? ge ∆max ? ∆min gmax min Clearly, if the jitter in communication latency, ∆max ? ∆min exceeds p, the generation intervals of
l ? ge ) will also exceed p, and gmax min
(
n1 and n2 cannot satisfy the Strong Mixing Rule (Equation (6)).
In fact, since the
generation interval of every packet equals the transmission jitter in duration, if this jitter exceeds p, no two packets from
S1 and S2 can satisfy the Strong Mixing Rule.
Hence, ∆max
? ∆min p is a necessary condition for
determining a binary fusion set. This is the first result of importance, and is formally stated as Proposition 1 in Table 1.
l ? ge p, then p. If it is also the case that gmax min l e packets n1 and n2 form a binary fusion set. However, if ∆max ? ∆min p, but gmax ? gmin > p, then it may seem l ? ge > p, likely that a binary fusion set cannot be determined. Surprisingly, this is not true; even when gmax min On the other hand, suppose that the jitter ∆max ? ∆min
there are situations in which a mixer can determine a binary fusion set. This is because, even though the generation
n1 and n2 do not satisfy the Strong Mixing Rule, those of two other packets n10 and n02 may do so. However, notice that if packets n01 and n02 satisfy the Strong Mixing Rule, so will packets n1 and n02 + (n1 ? n01 ) (because, the generation intervals of successive packets from the same source are displaced by p). Hence, in order to determine a binary fusion set, it suffices to search for a packet from S2 that can form a binary fusion set with n1 . We shall now derive exactly those cases in which a mixer can determine a binary fusion set even when l ? ge > p. gmax min Without loss of generality, let us assume that g1l (n1 ) > g2l (n2 ) and g1e (n1 ) > g2e (n2 ). Depending on the relative positions of the generation intervals of n1 and n2 , two possible scenarios arise: intervals of
The generation intervals of n1 and n2 overlap (see Figure 2):
i ; i 2 [?1; 1] from S2 is considered, by comparing the generation intervals of n2 + i (derived by substituting s = 2 and k = i in Equations (4) and (5)) and n1 , it can be shown that n2 + i and n1 do not satisfy the Strong Mixing Rule. This is the second result of In such a case, no matter which packet
n
( 2+ )
importance, and is formalized as Proposition 2 in Table 1.
7
g e(n ) 1 1
g l (n ) 1 1 Time l g (n ) 2 2
g e(n ) 2 2
Figure 2: Packets with overlapped generation intervals g e (n 2 ) 2
g l (n ) 2 2
g e(n 1) 1
p, then a binary fusion set cannot be determined (Proposition 1).
? ∆min p, compute the generation intervals of packets n1 and n2 to be [g1e (n1 ); g1l (n1)] and [g2e (n2 ); g2l (n2 )], respectively, where, gse (ns ) = s ? ∆max , gsl (ns ) = s ? ∆min , and s 2 [1; 2]. Without
2. If ∆max
e loss of generality, let gmin
=
l min(g1e (n1 ); g2e (n2 )) = g2e (n2 ) and gmax
=
max(g1l (n1 ); g2l (n2 )) = g1l (n1 ).
(a) If g1l (n1 ) ? g2e (n2 ) p, then the binary fusion set is fn1 ; n2 g (by Equation (6).
(b) If the generation intervals of packets n1 and n2 overlap, and g1l (n1 ) ? g2e(n2 ) > p, then it is impossible for the mixer to determine a binary fusion set (by Proposition 2). (c) If the generation intervals of packets n1 and n2 are disjoint, and there exists an integer k (k ? 1) p g e (n1 ) ? g l (n2 ) < k p, then: 1
1 such that
2
i. If (k ? 1) p g1l (n1 ) ? g2e (n2 ) < k p, then two binary fusion sets are possible: fn1 ; n2 + k ? 1g,
which is the lower binary fusion set, and fn1 ; n2 + kg, which is the upper binary fusion set (by Proposition 3).
ii. If k p g1l (n1 ) ? g2e (n2 ) < (k + 1) p, then the binary fusion set is fn1 ; n2 + kg (by Proposition 4). iii. If (k + 1) p g1l (n1 ) ? g2e (n2 ) < (k + 2) p, then it is impossible to determine a binary fusion set (by Proposition 5).
3.2 Completeness of Binary Mixing Algorithm The effectiveness (i.e., the success) of the above algorithm in determining a binary fusion set critically depends
on the relative values of the transmission jitter (i.e., ∆max ? ∆min ), and the packet duration p. This is quantified in the following Theorem. Theorem 1 When a mixer executes the binary mixing algorithm on two packets n1 and n2 received at times 1 and 2 , respectively,
(1) if ∆max ? ∆min
p=2, the algorithm always produces a binary fusion set. (2) If p=2 < ∆max ? ∆min p, the algorithm may succeed or fail in determining a binary fusion set. (3) If ∆max ? ∆min > p, the algorithm fails to determine a binary fusion set.
(4) The algorithm is complete: given that there are no other message exchanges except media data between a mixer and the media sources, there cannot be any other algorithm that succeeds when our algorithm fails.
10 Proof: Let the earliest and latest generation times of packets n1 and n2 be computed as: gse (ns ) = s ? ∆max , and
gsl (ns ) = s ? ∆min , where s 2 [1; 2]. Again, without loss of generality, let g2e (n2 ) g1e (n1 ), and g2l (n2 ) g1l (n1 ). Hence, if the generation intervals of n1 and n2 overlap, then g2e (n2 ) < g1e (n1) < g2l (n2) < g1l (n1). Similarly, if the intervals are disjoint, then g2e (n2 ) < g2l (n2 ) < g1e (n1 ) < g1l (n1 ). The four assertions of the
above theorem can be proved as follows: 1. ∆max ? ∆min
p=2:
It may be observed that the lengths of the generation intervals (of packets n1 and n2 ), g1l (n1 ) ? g1e (n1 ) and g2l (n2 ) ? g2e (n2 ) are each equal to ∆max ? ∆min , and hence neither of them exceeds p=2. Therefore,
gl n ? g1e (n1)) + (g2l (n2 ) ? g2e (n2)) p
( 1( 1)
(8)
The generation intervals can be either overlapped or disjoint. If the generation intervals overlap, we obtain that,
l ? ge (gl (n1 ) ? ge (n1)) + (gl (n2) ? ge (n2 )) p gmax 1 1 2 2 min
(9)
Thus, a binary fusion set mixing can be derived using Equation (6). Suppose that, on the other hand, the generation intervals of packets n1 and n2 are disjoint; that is,
g1l (n1 ) ? g2e (n2) = (g1l (n1 ) ? g1e (n1 )) + (g1e (n1 ) ? g2l (n2)) + (g2l (n2 ) ? g2e (n2))
(10)
Let k be an integer such that:
k ? 1) p g1e (n1 ) ? g2l (n2 ) < k p
(
(11)
Using Equations (8), (10), and (11) we obtain that,
k ? 1) p g1l (n1) ? g2e (n2 ) < (k + 1) p
(
(12)
Hence conditions of either Proposition 3 or Proposition 4 are satisfied, leading to the determination of a binary fusion set. 2.
p=2 ∆max ? ∆min < p:
The lengths of the generation intervals of packets n1 and n2 will each lie between p=2 and p, and the
generation intervals can be overlapped or disjoint. If they are overlapped, g1l (n1 ) ? g2e (n2 ) will not exceed 2p, but can lie either between 0 and p in which case Equation (6) yields the binary fusion set,
or between
p and 2p in which case Proposition 2 shows that it is impossible to determine a binary
fusion set. Suppose that, on the other hand, the generation intervals of packets n1 and n2 are disjoint, that is,
g1l (n1 ) ? g2e (n2) = (g1l (n1 ) ? g1e (n1 )) + (g1e (n1 ) ? g2l (n2)) + (g2l (n2 ) ? g2e (n2))
(13)
11 Let k 1 be an integer such that,
k ? 1) p < g1e (n1 ) ? g2l (n2 ) k p
(
Given that the lengths of generation intervals are no less than
(14)
p=2 but are definitely less than p,
Equation (13) and (14) yield:
k p g1l (n1) ? g2e (n2 ) < (k + 2) p Hence, either Proposition 4 is satisfied in which case a binary fusion set can be determined, or Proposition 5 is satisfied in which case it is impossible to determine a binary fusion set. Thus, the algorithm may succeed or fail in determining a binary fusion set. 3. ∆max ? ∆min
> p:
By Proposition 1, it is impossible to derive a binary fusion set. 4. Completeness: The algorithm fails to determine a binary fusion set only when (a) ∆max ? ∆min (b)
> p, or l ? ge > p, p=2 ∆max ? ∆min < p and, either the generation intervals are overlapped and gmax min or they are disjoint but (k + 1) p g1l (n1 ) ? g2e (n2 ) < (k + 2) p, where k 1 is an integer such that, (k ? 1) p < g1e (n1 ) ? g2l (n2 ) k p.
In case (a), Proposition 1 shows that it is impossible to determine a binary fusion set, and in case (b), Propositions 2 and 5 show the impossibility. Hence, there cannot be any other algorithm that succeeds in such cases, which goes to show that our binary mixing algorithm is complete under the networking
2
assumptions stated in Section 2.
In the above theorem, the limitation to the effectiveness of the binary mixing algorithm arises due to the uncertainty in generation intervals, which is at most the jitter in communication delay. In practice, however, a mixer may be able to reduce the uncertainty (so as to be much below the jitter) by iteratively refining its estimates of the
ge (ns + i); gl (ns + i)]; 8i 2 [?1; 1] denote the generation intervals of packets ns + i from source Ss , predicted from that of ns using Equations (4) and (5). Arrival of a subsequent media packet n0s from Ss enables the mixer to recompute the generation intervals of ns + i from that of n0s to be: I 0 = [ge(n0s + i0 ); gl (n0s + i0 )], i0 = (ns ? n0s + i)), again using Equations (4) and (5). The actual generation instant of packet (ns + i) is guaranteed to be in both I and I 0 , and ? ? hence, in I \ I 0 . Therefore, I \ I 0 represents a revised and possibly more precise estimate of the generation interval for packet (ns + i). Surprisingly, the intersection is small when media packets (on which estimates are generation intervals, every time a new packet arrives. To see how, suppose that I
=[
based) undergo widely different communication delays. In the limit, the exact generation time can be determined if the intersection reduces to a single point, which can occur when two media packets transmitted by a source suffer network delays of ∆max and ∆min , respectively.
12
4 M-ary Mixing In multimedia applications such as a tele-orchestra, there may, in general, be more than two media sources. In this section, we extend the binary algorithm to handle mixing of media packets received from more than two sources. The resulting M-ary mixing algorithm yields fusion sets such that the entire range of generation times of all the packets in a fusion set are bounded within a window of
p (which is the shortest window possible, as shown in
Section 2). Let us suppose that a mixer has to determine a fusion set of packets
n1, n2 , ..., nm being received from
sources S1 , S2 , ..., Sm . In a straight forward extension of the binary algorithm, the mixer can determine the fusion
set for each of the pairs (S1 , S2 ), (S1 , S3 ), ..., (S1 , Sm ), and take the union of all of those (m ? 1) binary fusion sets. If the maximum of the latest generation times, and the minimum of the earliest generation times of all the packets in the union are within a window of p, then the union constitutes a M-ary fusion set. However, recall that in the binary algorithm, there are scenarios when two binary fusion sets (lower and upper) are possible. It is likely that, choosing one of the two binary fusion sets can lead to the formation of a M-ary fusion set, whereas choosing the other binary fusion set does not. In the worst case, given that there are
two alternatives for each binary fusion set, there are 2 m?1 alternatives for the unions of m ? 1 binary fusion sets that may need to be computed in the determination of a M-ary fusion set, yielding exponential execution times. We will now present an algorithm that uses a window sliding technique to obtain a M-ary fusion set in a linear number of executions of the binary algorithm.
4.1 Algorithm for M-ary Mixing Let
[ 1e
g ; g1l ], [g2e ; g2l ], ..., [gme ; gml ] be the generation intervals of packets n1, n2 , ..., nm generated by m sources,
respectively. Let
e = min ge gmin s2 1;m s [
]
l = max gl gmax s2 1;m s [
]
Without loss of generality, let the generation interval of packet n1 be disjoint from those of all the other packets
n1 by n01, where (n01 ? n1 ) is an integer sufficiently large such that the earliest generation time of packet n01 exceeds the maximum latest generation time of packets n2 , n3 , ..., nm ). Using the binary mixing algorithm, compute the binary fusion sets of packets from pairs of sources ( S1 , S2 ), (S1 , S3 ), ..., (S1 , Sm ). Clearly, there must exist binary fusion sets for each of the above pairs of sources, if there exists a M-ary fusion set for the m media streams. Since the generation interval of n01 is disjoint from those of all others, the (we can always replace packet
determination of each of the binary fusion sets would require the use of Propositions 3 or 4. Let F
=
fn01 ; n02; n03; :::; n0mg be the set obtained as the union of binary fusion sets corresponding to packet
n01 from source S1 using the propositions of Section 3, with the lower binary fusion set always being chosen in case of choice. If set F does not turn out to be a M-ary fusion set, it is only because the choice of lower binary fusion set for some packet n0s in F is incorrect; it must be set right by replacing packet n0s with (n0s + 1). We
13 refer to this correction technique as window sliding. There can be at most (m ? 1) incorrect choices in the set F . The window sliding operation starts by considering the packet with the lowest generation interval, and terminates when either (1) a M-ary fusion set is formed, or (2) a packet for which there is no choice of binary fusion sets is encountered, or (3) all the (m ? 1) choices have been corrected. The exact algorithm is as follows: M-ary Mixing Algorithm: 1. Let a mixer receive packets fn1 ; n2 ; ::::; nmg from m media sources. Let:
l = max gl (ns ) g˜max s2 2::m s [
]
Let k1 be the smallest positive integer such that
l g1e (n1) + k1 p > g˜max Replace packet n1 by packet n01 2.
=
n1 + k1. Mark packet n01 as FIXED.
8s 2 [2::m] do: (a) Using the binary mixing algorithm, compute a binary fusion set for sources
S1 and Si , choosing the
lower binary fusion set in case of a choice. If a binary fusion set cannot be determined, terminate the algorithm and report failure. (b) Let the binary fusion set be
fn01 , n0sg.
Mark n0s as CHOOSABLE if there had been a choice of lower
binary fusion set, else mark it FIXED. 3.
l and ge as follows: (a) Compute gmax min l = max gl (n0 ) gmax s2 1::m s s [
]
e = min ge (n0 ) gmin s2 1::m s s [
]
l ? ge p, then terminate the algorithm successfully and report the union fn0 ; n0 ; :::; n0 g (b) If gmax min m 1 2 as a M-ary fusion set of the m media streams. l ? ge (c) If gmax min i.
> p, then perform the following window sliding operation: e = ge (n0 ). If packet n0 is marked FIXED, then terminate the Without loss of generality, let gmin s s s algorithm and report failure.
ii. If packet n0s is marked CHOOSABLE, then replace it by packet (n0s + 1) in the union, and mark it as FIXED.
e and gl for the union. If gl ? ge iii. Recompute gmin max max min successfully and report the union as the M-ary fusion set. iv. Otherwise, go to step (i).
p, then terminate the algorithm
14
4.2 Completeness of M-ary Algorithm The correctness and completeness of the above algorithm in determining a M-ary fusion set is proved by the following theorem: Theorem 2 Suppose that a mixer receives packets
n1 , n2, ..., nm from m media sources.
The M-ary Mixing
Algorithm produces a fusion set that satisfies the mixing rule (i.e. it is correct). Given that there are no other message exchanges between media sources and the mixer, there cannot be any other algorithm that produces a M-ary fusion set when our algorithm fails (i.e. it is complete). Proof:
1. Correctness: This is self-evident from the computation of the M-ary fusion set in which, the earliest
l and latest estimates of generation times of packets are guaranteed to satisfy gmax
e p, and ? gmin
hence satisfy Equation (6). 2. Completeness: Suppose that there is an algorithm that determines a fusion set,
fn01; n200; ::::; n00mg
containing packet n10 . We will now show that our M-ary algorithm will produce the above fusion set.
Since the new algorithm produces a fusion set, it can also produce binary fusion sets for pairs of sources, (S1 , S2 ), (S1 , S3 ), ..., (S1 , Sm ). By the completeness property of our binary mixing algorithm, our algorithm will also succeed in determining binary fusion sets. Let the union of all these binary
fn10 ; n02; ::::; n0mg. Since our binary algorithm always chooses lower binary fusion sets, we obtain that, 8s 2 [2; m], either n0s = n00s or n0s = n00s ? 1. fusion sets as determined by our algorithm be
F
=
e g (n ’’) s s p
e g (n ’) 1 1 e g (n ’) s s Union of binary fusion sets
M-ary fusion set
Figure 4: Window sliding in the M-ary algorithm If n0s = n00s ? 1, then it must be the case that n0s forms a lower binary fusion set with n01 and is marked CHOOSABLE in our M-ary algorithm (on the other hand, n00s must form a upper binary fusion
set with n10 ). Since their binary fusion sets are all lower, the earliest generation time of all such n0s is guaranteed to occur below that of n01 , which is marked FIXED. Every window slide will replace a n0s
by n00s , and when the window slide operation reaches g1e (n01 ) which is marked FIXED, the algorithm terminates after having (1) replaced all n0s s that are marked CHOOSABLE, and (2) performed at most
m ? 1 window slides, yielding the set fn10 ; n200; ::::; n00mg. Hence, if there is any algorithm that produces 2 a M-ary fusion set, so will our M-ary mixing algorithm, which goes to show that it is complete.
15 Theorem 2 can be used to determine the media packet size p such that the mixing algorithm is effective in a given network environment:
p 2 (∆max ? ∆min ) However, in practice,
p cannot be chosen to be very large, because,
feedback delays in applications such as
tele-orchestra are required to be small, and are bounded by the human response time.
5 Communication Architectures for Media Mixing The process of mixing passes through a sequence of two phases: (1) a transient phase, during which a fusion set is determined, and (2) a steady phase, in which media packets are mixed using the fusion set. Typically, the process of mixing enters the transient phase whenever a new source joins a conference, requiring that the fusion set that had been computed earlier for the older sources be changed to accommodate the new source. If the fusion set determined during a transient phase contains packets n1 and n2 from sources S1 and S2 , respectively, then during the steady phase, 8k 0, packets (n1 + k) and (n2 + k) are mixed together.
The architecture for communication among participants of a conference during the steady phase can be centralized at one end of the spectrum, or fully distributed at the other end of the spectrum. The centralized architecture requires that each participant in a conference transmit media information to a central mixer. The mixer receives packets from all the participants, creates a composite packet by mixing the received packets, and then transmits it to all the participants. Each participant, on receiving the composite media packet, may have to perform some media dependent processing of the composite packet (such as removing his own contribution in the case of audio) before scheduling it for playback. At the other end of the spectrum is the distributed architecture, which requires that each participant in a conference transmit media information to each of the other participants. Mixing is performed by each participant independently. However, it is possible that the fusion sets derived by the participants may be different. To resolve this conflict, one of the participants is designated as the master mixer, and the fusion set computed by this master is propagated to all the participants. Whereas the centralized architecture is simple to implement but inflexible (i.e., does not provide features such as autonomous volume control of each media stream), the distributed architecture is flexible but incurs duplication of mixing computation and bandwidth usage. Neither architecture scales well (with either the number of participants or the geographical separation between participants) if the network, the network interface, or the processing power at the mixer is the bottleneck. By clustering together closely situated participants, and using a hierarchical mixing architecture (see Figure 5), we can bound the bandwidth and processing requirements at the mixers [7]. In a mixing hierarchy, participants constitute the leaf nodes, and the mixers constitute non-leaf nodes. During the transient phase, the root mixer computes the fusion set (since it is the only node that receives packet information from all the participants), and propagates it to each of the intermediate mixers. During the steady phase, each mixer receives media packets from its children, mixes them, and sends the composite packet to its parent. The mixer that is at the root of the
16
hierarchy forwards the final mixed packet to each of the leaf nodes. The bandwidth required for packet reception at each mixer is proportional to the number of its children, whereas the bandwidth for packet transmission is that of sending to just one parent. (Even though the root mixer has to send a mixed packet to each of the participants, since the mixed packet is common to all the participants, the root mixer needs to make only one packet transmission by using multicasting). Thus, by increasing only the height of the hierarchy while bounding the number of children of each mixer, the hierarchical architectures can be made highly scalable. Root mixer :
Mixers :
Participants :
P1
M1
P2
M
M2
Multicast to all participants
P3
P4
P5
Figure 5: A hierarchical architecture for mixing A special case of a hierarchical architecture is a directed ring, which can be thought of as a mixing tree in which each node has exactly one child. Such a configuration is appropriate for token ring based networks, and is analyzed by Ziegler et al. [8]. A generalization of the hierarchical architecture yields a graph-structured mixing architecture. In a nonhierarchical graph, there may be multiple paths between a participant and a mixer. Hence, a mixer may receive multiple mixed packets containing the same participant’s packet. To eliminate the duplication, the participant’s packet may have to be transmitted in addition to the mixed packet, leading to wastage of bandwidth. Since graphstructured architectures do not afford any special advantages over hierarchical ones, they are not very interesting for mixing.
6 Real-Time Performance of Mixing The interactive and real-time nature of collaborative applications require that the end-to-end delay experienced by media packets be bounded. When a mixer receives the first packet from one of the sources that goes to form a mixed packet, it has to delay the completion of the mixing process until it receives all other packets that constitute the mixed packet. However, if the network is unreliable, some of the packets that go to form a mixed packet may not arrive at the mixer. Hence, an important question is: how long should a mixer wait for packets from sources before deciding to transmit a partially mixed packet (which is not fully mixed because of the unavailability of packets from some of the sources)? A simple solution is for the mixer to transmit a partially mixed packet when a media packet that goes to form a subsequent mixed packet is received from one of the sources. However, this
17 causes mixing delays of the order of packet duration ( p) at the mixer. Extension of this solution to a hierarchical
mixing architecture of height H results in a mixing delay of p at each level, leading to an overall mixing delay of
H p, and end-to-end delay of H p + (H + 1) ∆max (H ∆max for the transmission delay from a leaf to the root
up the hierarchy, and an additional ∆max for the multicast of the final mixed packet from the root back to the leaf). The packet duration p cannot be chosen to be very small (typical values for voice packets on Ethernet are 20 to 150
ms) mainly to keep the packet transmission overhead low. Hence, even in small mixing hierarchies, the mixing delay will turn out to be unacceptably large for supporting interactive and real-time multimedia applications. We now present a simple algorithm that removes the proportional dependence of mixing delay on
p in a
general mixing hierarchy (of which, other mixing architectures are special cases). In this algorithm, each mixer maintains information about the expected generation times of packets at its children. When a new source Ss joins the conference, it sends a probe packet ns up the hierarchy to enable each intermediate mixer to compute the earliest and latest generation times for the packet. If the packet reaches a mixer at height h at time (ns ), the mixer computes its earliest and latest generation times as follows:
gse (ns ) = (ns ) ? h ∆max gsl (ns ) = (ns ) ? h ∆min Consider the process of forming the kth mixed packet at the mixer. Since packets are generated regularly at an interval of p, the mixer can estimate the earliest and the latest generation times of packet (ns + k) from source Ss
as follows:
gse (ns + k) = gse (ns ) + k p gsl (ns + k) = gsl (ns) + k p
The minimum earliest and maximum latest generation times of packets constituting the kth mixed packet are given by:
e (k) = ge (n + k) gmin min s2f1;2;:::;mg s s
l (k) = max gl (ns + k) gmax s2f1;2;:::;mg s
The earliest and latest arrival times of packets constituting the kth mixed packet can be pre-computed as follows:
e (k) = h ∆min + ge (k) min min l (k) = h ∆max + gl (k) max max Hence, all packets constituting the kth mixed packet must arrive at a mixer at height
h within the interval
e (k); l (k)]. At the root mixer, h = H (height of the entire tree), and l (k) = gl + H ∆max . The [min max max max l (k) ? ge (k). maximum aggregate end-to-end delay suffered by a packet from a leaf to the root is given by, max min Adding ∆max for the transmission of the final mixed packet from the root to participants, we obtain the maximum aggregate end-to-end delay of a packet as:
18
l + H ∆max ) ? ge ) + ∆max Aggregate End-to-End Delay = ((gmax min
p + (H + 1) ∆max
The above derivation of mixing delay has ignored the computational overhead of mixing at each mixer, which is assumed to be small compared to the communication and waiting delays.
7 Implementation and Experience We have implemented the mixing algorithms on a network of multimedia stations, each consisting of a computing workstation, a PC-AT, a video camera, and a TV monitor (see Figure 6). The workstation and PC-ATs are connected via Ethernets. The PC-ATs are equipped with digital video processing hardware that can digitize and compress motion video at real-time rates with a resolution of 480x200 pixels and 12 bits of color information per pixel, and audio hardware that can digitize voice at 8 KBytes/sec. Multimedia Station
Video Monitor
Workstation
Multimedia Station
Video Monitor
Camera
Workstation
PC-AT
Camera
PC-AT
ETHERNETS
GATEWAY
Figure 6: Hardware configuration We carried out several experiments to evaluate the performance limits of mixing in audio conferencing applications. Audio samples are packetized and transmitted on the Ethernet. In order to strike a balance between network transmission overhead (which favors large packet sizes), and the packetization delay (which favors small packet sizes), the audio packet size was chosen to be 512 samples, yielding
p
=
66.67 ms. The maximum
communication delay per packet is within 10 ms. We observed the performance of centralized and distributed mixing architectures, and experimentally measured the maximum number of participants that they can support. Figure 7 shows the variation of the fraction of packets reaching a mixer in a centralized architecture with increase in the size of a conference. When that fraction goes below 98%, there is a rapid deterioration of voice quality, and the mixing architecture breaks down. The break down point yields a maximum conference size of 20 in the presence of multicasting, and 12 in its absence for centralized architecture. In the presence of multicasting, distributed and centralized architectures behave in a similar fashion. However, in the absence of multicasting, their performance will be poorer due to the growth of bandwidth consumption as the square of the number of participants. For hierarchical architectures, given a maximum allowable end-to-end delay of about 100 ms, the analysis in Section 6 yields H can support conferences with up to 202
=
400 participants.
2, showing that they
Packets recieved at the mixer (%)
19
100
With Multicasting Without multicasting
90
80
70
60
50
40
30
20
10
0 0
5
10
15
20
25
30
Conference size (Number of participants)
Figure 7: Performance of audio mixing with increase in conference size
8 Concluding Remarks We have presented algorithms for mixing media data transmitted by multiple sources in multimedia conferencing applications such as tele-orchestra, carried out on packet-switched networks. The algorithms minimize both differences between generation times of media packets being mixed, and end-to-end delays of mixed packets. The algorithms are complete, i.e., there cannot be a more effective algorithm for mixing in the absence of any control message exchanges between participants and mixers. We have implemented the mixing algorithms on a network of workstations equipped with digital multimedia hardware. Experimental evaluations demonstrate that centralized and distributed mixing architectures are limited in the number of participants that they can support. In order to overcome their limitations, we have proposed hierarchical mixing architectures, which can significantly reduce bandwidth consumption, making them suitable for scalable multimedia conferences.
REFERENCES [1] L. Aguilar, J.J. Garcia-Luna-Aceves, D. Moran, E.J. Craighill, and R. Brungardt. Architecture for A MultiMedia Tele-Conferencing System. Proceedings of the SIGCOMM’86 Symposium on Communications Architectures and Protocols, Stowe, VT, pages 126–136, August 5-7, 1986. [2] S.R. Ahuja, J. Ensor, and D. Horn. The Rapport Multimedia Conferencing System. In Proceedings of COIS’88 Conference on Office Information Systems, Palo Alto, CA, pages 1–8, March 23-25, 1988. [3] K.A. Lantz. An Experiment in Integrated Multimedia Conferencing. In Proceedings of CSCW’86, pages 267–275, December 1986.
20
[4] P. Venkat Rangan and D. C. Swinehart. Software Architecture for Integration of Video Services in the Etherphone Environment. IEEE Journal on Selected Areas in Communication, 9(9):1395–1404, December 1991. [5] P. Venkat Rangan, Harrick M. Vin, and Srinivas Ramanathan. Communication Architectures and Algorithms for Media Mixing in Multimedia Conferencing. To appear in IEEE/ACM Transactions on Networking, 1(1), February 1993. [6] S. Sarin and I. Greif. Computer-Based Real-Time Conferences. IEEE Computer, 18(10):33–45, October 1985. [7] Harrick M. Vin, P. Venkat Rangan, and Srinivas Ramanathan. Hierarchical Conferencing Architectures for Inter-Group Multimedia Collaboration. In Proceedings of the Conference on Organizational Computing Systems (COCS’91), SIGOIS Bulletin, Vol. 12, No. 2-3, pages 43–55, November 1991. [8] C. Ziegler, G. Weiss, and E. Friedman. Implementation Mechanisms for Packet Switched Voice Conferencing. IEEE Journal on Selected Areas Communications, 7(5):698–706, June 1989.