ONE OF the edges of Internet telephony over traditional public switched telephone network (PSTN) technology is that it can use low bit rate codecs to reduce the ...
IEEE COMMUNICATIONS LETTERS, VOL. 6, NO. 5, MAY 2002
199
Measurement-Based Multi-Call Voice Frame Grouping in Internet Telephony Hyogon Kim, Member, IEEE, Inhye Kang, Member, IEEE, and Eenjun Hwang, Member, IEEE
Abstract—Grouping voice frames from multiple calls into a single packet as a means to reduce bandwidth requirement for IP voice calls is investigated. It is demonstrated that TCP-style Internet delay prediction can be instrumental to regulate the number of frames in each packet in a manner to minimize late losses of the frames under fast undulating Internet delay. The performance of the adaptive scheme is compared against those of fixed grouping cases with varying degrees of aggressiveness.
would incur 33 ms of additional delay for each piggybacked frame. Considering the mouth-to-ear delay requirement for Internet telephony calls as short as 175 ms [4], large bandwidth reduction is hard to expect. This problem is exacerbated by numerous other delay components to be accounted for in Internet telephony [5].
Index Terms—Frame grouping, Internet telephony, TCP delay estimation.
II. FRAME GROUPING WITH MULTIPLE CALLS
I. INTRODUCTION
O
NE OF the edges of Internet telephony over traditional public switched telephone network (PSTN) technology is that it can use low bit rate codecs to reduce the bandwidth requirement, thereby being able to carry more calls per given bandwidth. However, this is only true when considering the nominal codec rate. The standard RTP/UDP/IP encoding used in Internet telephony incurs 40 bytes header overhead for voice frames whose data sizes are typically of smaller sizes such as 10, 20 [9], or 24 bytes [8]. When the header overhead is taken into account, for instance, a 8-kb/s G.729 codec can require as much as 40 kb/s. This is a pressing problem for an Internet telephony service provider, who has to carry hundreds or thousands of such calls across Internet at times. There are two approaches to reducing the header overhead in Internet telephony. The first is obviously compressing the header itself, such as [6]. However, current compression schemes per se cannot be used on a path that spans multiple links [5]. The second approach is frame grouping, i.e., grouping multiple voice frames into a single IP packet [1]–[3]. In our previous work [11], we demonstrated that a TCP-style delay prediction algorithm can be used to adaptively determine the degree of frame grouping for a single call. Single call frame grouping can be used in Web(or IP Phone)-to-Phone scenario where only one endpoint is connected to PSTN, as well as in Phone-to-Phone scenario. Unfortunately, the single call frame grouping requires sufficient delay budget for gathering successively generated frames to be shipped in a single packet. For instance, G.723.1 encoder Manuscript received October 5, 2001. The associate editor coordinating the review of this letter and approving it for publication was Prof. C. Douligeris. This work was supported by Korea Science and Engineering Foundation under Interdisciplinary Research Grant R01-2001-00339. H. Kim and E. Hwang are with the School of Information and Communication, Ajou University, Suwon 442-749, Korea (e-mail: hyogon@ madang.ajou.ac.kr). I. Kang is with Samsung SECUi.com, Seoul 100-130, Korea. Publisher Item Identifier S 1089-7798(02)01931-2.
The inherent problem of the single-call frame grouping disappears if we can multiplex frames from different calls into the same packet. It is generally not feasible in Web-to-Phone scenario because of the lack of large-scale call aggregation point. However, it becomes easy when it comes to Phone-to-Phone scenario where a call initiates from PSTN, crosses Internet, and then terminates in PSTN. Then multiple calls will be placed between a pair of Internet telephony gateways, where grouping can be applied to the calls having the same origin–destination gateway pair. In such a case, a gateway can multiplex voice streams that share the same destination gateway address and the UDP port number. (That is, unless we allow a separate port to be used for each call, which is more complex to implement.) But each frame will maintain a separate RTP header after the grouping, since the RTP specification strongly discourages the use of RTP level multiplexing for numerous technical reasons [3]. Compared with the single voice stream case, multiplexing frames from disparate sources in a single packet can greatly reduce the bandwidth usage with far smaller delay jitter added to individual calls. Assume that the unit of time over which the grouping is applied is , and the activity ratio for each call (the talk spurt divided by the sum of silence and talk spurt durations) is . Also, let us assume that the number of calls between a particular o–d gateway pair is . Then the average number of frames arriving at the origin gateway for the destination gateway where is the frame rate. Given the header during is sizes of 12, 8, and 20 bytes for RTP/UDP/IP encapsulation, the bandwidth reduction can be easily calculated as
where is the frame size. Fig. 1 shows for G.723.1 as a funcas specified by ITU P.59 [7] tion of and , when for artificial human speech model. We notice that fast conas inverges to its theoretical maximum of creases, especially as grows. In particular, even for as small at as 20 ms, less than a single frame time, and at . Similar argument can be made for other low bit rate codecs such as G.729. Since , , and
1089-7798/02$17.00 © 2002 IEEE
200
Fig. 1.
IEEE COMMUNICATIONS LETTERS, VOL. 6, NO. 5, MAY 2002
Bandwidth reduction as a function of
1 and N , G.723.1.
Fig. 3. Successfully piggybacked frames, noon trace.
Fig. 2. Breakdown of delay components.
are rather given, not adjustable, is the only handle that we have for controlling . And as in the single-call grouping [11], can only be estimated from the measured mouth-to-ear delays, i.e., from the following condition: (1) is the required delay bound and is the sum of all where delay components except the delay for IP leg of the call (Fig. 2). is a possibly fast and wildly varying value due to But since between gateways, its prediction should the Internet delay be safe to not cause delay bound violations and consequent late losses. On the other hand, the prediction should not excessively so as to maximize for large bandwidth overestimate saving (see Fig. 1). So a delay prediction mechanism must be designed to overestimate, but only by a minimal amount. Fortunately, we have the delay estimation algorithm in TCP that achieves the very objective. For TCP, it is crucial to prevent spurious timeouts due to connection delay underestimation and slow packet loss detection due to overestimation. Therefore, it uses a retransmission timeout (RTO) calculation algorithm [10] that overestimates the network delay that just covers the delay envelope. This algorithm has been validated of its effectiveness in real Internet environment for more than a decade, and it exactly meets our need. To capture the delay envelope, the TCP RTO calculation algorithm first computes a smoothed mean delay, and then additionally takes the variance into account. In other words, it first captures the low frequency component of the delay fluctuation using a low pass filter, and then add a scaled mean deviation to it [10], i.e.
(2)
where is the predicted delay (so it predicts future from our perspective) and is the measured delay. is the smoothed mean delay and is the mean deviation. We will adopt, and denote by VAR, the method of (2) for adaptive frame grouping. To put the performance in perspective, we compare VAR with a static method that operates with a fixed . We denote the method by . This is to see how nonadaptive grouping methods perform, under varying degrees of aggressiveness. III. PERFORMANCE EVALUATION For our experiments, we obtained real Internet delay traces against four gateways from each of three free Internet PC-tophone service providers in Korea. The length of a measured delay time series is 30 seconds, and each time series was obtained roughly once an hour for a 3-wk period in August 2000. In this letter, we show the simulated results of multicall frame grouping against three 30-s delay traces among them, obtained at 9:45 am, 12:20 pm, and 8:08 pm on August 21, 2000, respectively. The noon trace (12:20 pm) has the highest average delay and delay variance [5], and the “night trace” (8:08 pm) is the most quiescent. The “morning trace” (9:45 am) lies in between. between 50 and 150 ms. In the experiments, we vary to 0. And we set , For simplicity, we set ms. We artificially generated the voice traffic for the 500 calls according to the average talk spurt and silence durations specified in P.59 [7], where the talk spurt and silence durations have the exponential distribution. We counted the number of delay violations and the number of successfully piggybacked frames. Fig. 3 shows the number of successfully (i.e., nonlate-loss) piggybacked frames for the noon trace , among the 197 532 under the two schemes. For frames in the voice traffic stream almost 190 000 frames were %. This transported without UDP/IP headers, i.e., shows how efficient the multiple call frame grouping can be. . The Note FIX schemes only start piggybacking when figure bears out our observation that the bandwidth reduction fast approaches the maximum as grows, since it increases given the same . The figure also shows that FIX manages to transport a slightly larger number of frames without headers.
KIM et al.: MEASUREMENT-BASED MULTI-CALL VOICE FRAME GROUPING IN INTERNET TELEPHONY
Fig. 4.
Late losses for piggybacked frames, noon trace.
However, this is at the cost of significantly higher late loss rates [5]. For space constraints, we only show the late loss rate for the noon trace (Fig. 4). Notice that for most values, the late loss rates of VAR is zero and if not it is bounded by 0.2%. For morning trace, [5] the simulation yields much stronger result that FIX at different values leads to orders of magnitude larger late loss rates (as large as 10%) than VAR. From these results, we can conclude that FIX is more aggressive and thus allows more “free rides,” but it is proportionally more risky. Although VAR slightly lags behind FIX in terms of successfully piggybacked frames, it is much safer than FIX.
201
Fig. 5. Unlimited VAR versus VAR with
1 20 ms, night trace.
V. CONCLUSION In multicall frame grouping, the additional delay needs not be large to achieve significant bandwidth saving. And adaptive frame grouping using TCP-style delay prediction generally well outperforms fixed schemes (comparable bandwidth savings and much less late losses). A caveat is that TCP’s RTO estimation algorithm per se is overly aggressive when the measured delay has fine (e.g., submillisecond) granularity and the delay fluctuation is mild. Fortunately, it can be effectively controlled by limiting the delay budget to a small value. It also helps to limit the delay and delay jitter of the individual calls, bounding the quality degradation due to frame grouping.
IV. LIMITATION OF TCP-STYLE ESTIMATION One caveat in the TCP-style delay estimation is that it tracks the real delay too close and can lead to delay violations. (This amounts to spurious timeouts in TCP.) The reason is that the is expected to be granularity of the timer used in measuring rather coarse in TCP. For instance, in TCP measurement error due to the coarse timer as large as 100 ms can happen but this is still regarded as “fine” timer granularity [10]. In contrast, our delay measurement has 0.1 ms granularity and it results in the delay prediction being too close to the real delay. And if there is a small bump in the delay follows, delay violation occurs. We find that if the delay samples do not fluctuate a lot and the measurement error is small, dilating the term is ineffective to curb the overly aggressive grouping, nor is arbitrarily inflating the term [5]. Fortunately, limiting is completely justified by the observation in Fig. 1, so without breaking the algorithm we where can modify Inequality (1) to ms in our experiments shown in Figs. 3 and 4, we set schemes to have the 20-ms limit where we also limited for fairness. Fig. 5 shows the impacts of the modification.
REFERENCES [1] M. Baldi and F. Risso, “Efficiency of packet voice with deterministic delay,” IEEE Commun. Mag., pp. 170–177, May 2000. [2] Application of the E-model: A planning guide, ITU-T Recommendation G.108, Sept. 1999. [3] H. Schulzerinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A transport protocol for real-time applications,”, RFC1889. [4] D. De Vleeschauwer, J. Janssen, and G. H. Petit, “Delay bounds for low bit rate voice transport over IP networks,” in SPIE Conf. on Performance and Control of Network Systems III, Sept. 1999, pp. 40–48. [5] H. Kim et al.. (2001, Aug.) The methods and the feasibility of frame grouping in Internet telephony. Ajou Univ., Internet Lab Tech. Rep. ILAB-TR-01-03. [Online] Available: http://wins.ajou.ac.kr/~hkim/ ILAB-TR-01-03.pdf [6] S. Casner and V. Jacobson, “Compressing IP/UDP/RTP headers for lowspeed serial links,”, RFC 2508. [7] Telephone transmission quality objective measuring apparatus: Artificial conversational speech, ITU Recommendation P.59, Mar. 1993. [8] Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s, ITU-T Recommendation G.723.1, Mar. 1996. [9] Coding of speech at 8kbit/s using conjugate-structure algebraic-codeexcited linear-prediction, ITU-T Recommendation G.729, Mar. 1996. [10] V. Paxson and M. Allman, “Computing TCP’s retransmission timer,”, RFC 2988. [11] H. Kim and I. Kang, “Measurement-based frame grouping for Internet telephony,” Electron. Lett., vol. 37, no. 1, pp. 71–72, Jan. 2001.