Quality Enhancement of Packet Audio with Time-Scale Modification Fang Liu, JongWon Kim* and C.-C. Jay Kuo Integrated Media Systems Center and Department of Electrical Engineering-Systems University of Southern California, Los Angeles, CA 90089-2564 *Department of Information & Communication, K-JIST (Kwang-Ju Institute of Science & Technology) GwangJu, 500-712, KOREA E-mail: {fliu,cckuo}@sipi.usc.edu,
[email protected] ABSTRACT In traditional packet voice or the emerging 2.5G and 3G wireless data services, smooth and timely delivery of audio is an essential requirement in Quality of Service (QoS) provision. It has been shown in our previous work that, by adapting time-scale modification to audio signals, an adaptive play-out algorithm can be designed to minimize packet dropping at the receiver end. By stretching the audio frame duration up and down, the proposed algorithm could adapt quickly to accommodate fluctuating delays including delay spikes. In this paper, we will address the packet audio QoS with emphasis on end-to-end delay, packet loss, and delay jitter. The characteristics of delay and loss will be discussed. Adaptive playback will enhance the audio quality by adapting to the transmission delay jitter and delay spike. Coupled with Forward Error Correction (FEC) schemes, the proposed delay and loss concealment algorithm achieves less overall application loss rate without sacrificing on the average end-to-end delay. The optimal solution of such algorithms will be discussed. We also investigate the stretching-ratio transition effect on perceived audio quality by measuring the objective Perceptual Evaluation of Speech Quality (PESQ) Mean Opinion Score (MOS). Keywords: Time-scale modification, SOLA, adaptive playout, delay jitter concealment, Forward Error Correction, and packetized audio.
1. INTRODUCTION There are many factors that could affect the Quality of Service (Qos) of a packet audio streaming session. Among them, end-to-end delay, delay jitter and packet loss are major issues to be addressed in this work. The end-to-end delay of a two-way communication has to be kept within a range to be interactive. If the connection is established via the best-effort Internet, the delay variation of transmitted packets is called delay jitter. Usually, a receiver buffer is utilized to smooth out delay jitter with limited capacity. The packets that are not accommodated by the buffer or receiver playout scheduling algorithm become receiver drop loss packets as part of the total application packet loss. Speech codecs such as G.723/G.729 use inter-frame coding. The loss of audio packets will not only result in the loss of sound intervals but also will affect the decoding of the next frame. Research efforts on Internet voice/audio streaming have focused on delay concealment in the presence of delay jitter [1], [2], in which silence intervals between talk-spurts are utilized to adapt to the delay variation. Each talk-spurt will adapt to the estimated delay obtained from the statistics of previous talk-spurts. By adjusting (either expanding or contracting) the silence length at the receiver according to the recent network situation, more late-arriving packets can be salvaged instead of being thrown away. However, in the work of Ramjee et al. [1] and Moon et al. [2], only the first packet in a talk-spurt can be used to adapt to the delay while the playout time of all other packets in this talkspurt has to be arranged according to the original schedule. In other words, no adaptation is allowed within a talkspurt. In the case of a large delay variation (such as delay spikes) starting in the middle of one talk-spurt, the algorithm has to wait until the next talkspurt. The basic idea of the adaptive playout mechanism with time-scale modification was recently introduced independently by Liu et al. [3] and Liang et al. [4]. Time-scale modification, including both expansion and contraction, modifies the time duration of an acoustic signal without changing its acoustic attributes, such as pitch, timber, and so on. By applying a varying degree of stretching to each packet (although it is important to maintain the average stretching factor within a talk-spurt), every packet could contribute to the concealment of network delay jitter/spike
as well as packet loss. Experimental results given in [3] and [4] prove that, by extending per-talk spurt adaptation to per-packet adaptation, we can achieve a lowered packet dropping rate without compromising the average end-to-end delay at the receiver. To compensate Internet packet loss, network error control schemes such as FEC, automatic repeat request (ARQ), and a hybrid of the above are the trend to enhance end-to-end audio quality. For a delay-stringent application, reactive protections via ARQ are impractical due to their latency. Only proactive protections based on packet-level FEC provide a viable solution, which sends redundant packets along with original source packets. There are two types of FEC depending on whether they are media specific or not [8]. Media independent FEC such as parity and Reed-Solomon (RS) codes are popular. Recently, Rosenberg et al. [5] proposed a new scheme, in which FEC based on RS codes was integrated with the silence-based adaptive playout scheme to provide a combined compensation for loss and delay jitter. They proposed several virtual extensions of adaptive playout algorithms in [1], [2] to couple loss/delay and control the target application loss rate. In contrast with the silence-based FEC extension, we use the packet-based time-scale modification technique to deal with fluctuating network packet delay and loss. Compared with silence-based adaptive playout [1] and its FEC extension [5], our scheme coordinates variable time-scale modification with FEC for packet delay/loss control. The decision on the time-scale modification factor (or the stretching factor) is performed on the packet-by-packet basis by considering the timing information of the current stage only. This packet-oriented approach might result in some performance degradation. However, with the help of the proposed content-aware variable stretching technique, a good tradeoff could be achieved, and the overall performance can be kept reasonable. The rest of this paper is organized as follows. The audio QoS and their measures are discussed in section 2, including the packet loss measurements, the Internet delay jitter simulation, and the objective quality measurement in terms of PESQ MOS. Our proposed packet-based SOLA algorithm is explained in section 3 with the emphasize on section 3.1. In section 4, the design issues of a good adaptive algorithm is discussed and analyzed. Finally, the conclusion and future work are given in section 5.
2. AUDIO QOS 2.1. Loss For delay-stringent Internet voice applications, proactive protections with packet-level FEC provide a good solution to error handling. It encompasses several schemes as discussed in [8]. Among the two types of FEC, media-independent and media-specific, we adopt the media-independent FEC for its generic applicability. Note that the effectiveness of FEC is affected by the type of network loss (e.g. strength and burstiness). Also, error correction latency and efficiency vary with the FEC type. For example, they are different for parity and RS codes. A basic FEC scheme (n, k) transmits k original data packets with n − k FEC packets (called the FEC block) encoded from the k data packets (call the data block). The original data can be reconstructed if any k of these n packets are received. One particular form of sending out this parity block is “piggybacking”. That is, the k packets of the FEC block is piggybacked to the first n − k packets of the data block in the next group. The network delay of any packet will be the minimum of its own arrival or the FEC recovery time (if recovery is applicable). Thus, the FEC scheme not only works for packet loss recovery, but also reshapes the delay distribution of packets to a certain degree. To model the network loss, we assume that audio packets experience certain network loss by the Bernoulli process with loss probability p. Note that similar results can also be obtained with the Gilbert process. In the case of waiting indefinitely for FEC recovery, there may still remain a percentage of packets that cannot be recovered (e.g. due to the limitation of FEC recovery). This percentage is referred to as the residue network loss probability and denoted by PR . Given certain p, PR with (n + 1, n) parity recovery can be derived by n
PR (%)|(n+1,n) = 1 − ((1 − p) + p · (1 − p) ) .
(1)
The value of PR is depicted in Fig. ?? (a), where PR ’s for grouping number n = 2, 3, 4, 5, 6, 7, 8 are shown. These curves could be used as a guideline to choose an appropriate grouping number n for a certain estimated network loss probability p. For example, PR is about 12% at p = 20% and n = 4.
If the RS code is used for FEC recovery with a grouping scheme (n + m, n), the residue network loss can be derived by n+m−1 n+m−1 i n+m−1−i PR (%)|(n+m,n) = 1 − (1 − p) + p · . (2) (1 − p) p i i=n
Curves of the PR value are shown in Fig. 1 (b) for network loss rate p = 10% with varying n and m numbers (only m ≤ n is considered). Then, the performance metric of the proposed adaptive playout, which is called the application loss probability and denoted by PA , can be defined. It is the overall loss percentage measured after adaptive playout. Due to the receiver dropping loss caused by delay jitter, PA cannot be lower than PR . Different adaptive playout algorithms strive to minimize the difference between PA and PR while keeping the average playout delay as low as possible. (N+1, N) parity code
(n+m, n) FEC scheme
20
18
n=2 n=3 n=4 n=5 n=6 n=7 n=8
5 Residue Loss Probability (%)
16
Residue Loss Probability (%)
6
N=2 N=3 N=4 N=5 N=6 N=7 N=8
14
12
10
8
6
4
4
3
2
1
2
0
0
2
4
6
8 10 12 Network Loss Probability (%)
14
16
(a)
18
20
0 1
2
3
4
5
6
7
8
m
(b)
Figure 1. The residue network loss probability PR with (a) parity FEC (n + 1, n), p = 1% − 20%. and (b) RS FEC (n + m, m), p = 10% .
2.2. Internet Delay Jitter Internet’s packet delay and loss behavior were studied by Bolot in [7]. He gathered several Internet traces of data by establishing round-trip data connections with probe packets. The round trip time (RTT) of each packet was logged, and the series of the RTT time were plotted. By carefully examining these data, he proposed a first-in-first-out (FIFO) queuing model to characterize the Internet packet audio delay behavior as shown in Fig. 2. This queuing system is a statistical multiplexer of packet audio and data traffic. In the analysis, it was also found that, when the Internet traffic is light, i.e. if the number of Internet packets in the buffer is small, the time a probe audio packet awaits for service is almost constant through out the session. However, if the Internet traffic is heavy, some probe audio packets can be piled behind a large Internet traffic and will be delayed consecutively.
Internet traffic Audio traffic
Packet Loss Random dropper
Figure 2. The queuing model used to characterize packet audio delay. In this work, we adopt the same queuing model as shown in Fig. 2 to simulate the delay jitter and loss behavior of audio packets under a certain Internet traffic load. With this model, one can easily manipulate the delay jitter variation and the packet-loss rate for performance evaluation of the proposed adaptive playout algorithm. Generally speaking, the audio traffic of interest is modeled by traffic of a fixed size (say, 10k bits) arriving in time of regular intervals (say, per second). The other Internet traffic has two components: bursty traffic of a larger size and interactive traffic of a small size.
A more detailed description of the model depicted in Fig. 2 is provided below. • FIFO Queue: The queue buffer can hold up to Qsize units of data with a servicing rate of p units of data per time slot in this figure. The D + G/D/1/Qsize queue, where D represents a deterministic distribution, is adopted. • Audio traffic input: Audio packets of fixed size A, including the payload of one audio packet and the packet header, arrive at every T time slots. • Internet traffic input: Internet traffic of variable sizes arrives every time slot. Internet packets have independent and identically distributed sizes drawn from a general multimodal distribution. We use a combination of several geometrically distributed random numbers of different parameters. With the above model, we can generate the time series of audio packets experiencing the Internet transmission delay. Example are shown in Fig. 3, where (a) and (b) are for light Internet traffic while (c) and (d) are for heavy Internet traffic. In Figs. 3 (a) and (c), we plot delay vi as a function of packet index i. Figs. 3 (b) and (d) are phase plots. In a phase plot, we plot delay vi of packet i on the X-axis and delay vi+1 of packet i + 1 on the Y -axis to form a pair (vi , vi+1 ). By varying the value of i, we can get a large number of dots in the plot. The phase plot is revealing in determining the heavy and the light Internet traffic conditions. As shown in Fig. 3 (b), the phase plot is evenly distributed along the diagonal line y = x for the light Internet traffic case. However, when the traffic becomes heavy, the phase plot demonstrates a correlation between vi and vi+1 . That is, when vi is high, vi+1 is likely to be high, too. phase plot 1 500
450
450
400
400
350
350
300
300 v(i+1) (ms)
v(i) (ms)
delay jitter simulation 1 500
250
250
200
200
150
150
100
100
50
50
0
0
200
400
600
800 1000 1200 sequence number i
1400
1600
1800
0
2000
0
50
(a) Delay jitter
400
400
350
350
300
300 v(i+1) (ms)
v(i) (ms)
450
250
200
150
150
100
100
50
50
600
800 1000 1200 sequence number
300
350
400
450
500
250
200
400
250 v(i) (ms)
phase plot 2
450
200
200
500
500
0
150
(b) Phase plot
delay jitter simulation 2
0
100
1400
1600
(c) Delay jitter
1800
2000
0
0
50
100
150
200
250 v(i) (ms)
300
350
400
450
500
(d) Phase plot
Figure 3. The simulated delay jitter plots and their associated phase plots: (a) and (b) are for the light traffic case while (c) and (d) are for the heavy traffic case. Traditionally, Internet packet loss is treated as a separate issue from delay jitter modeling. To model packet loss, it is common to set a random dropper to emulate the loss percentage. This is done in the last stage of our channel
simulation model as shown in Fig. 2. It is also observed that the number of consecutive lost packets follows the geometric distribution. We apply the simulated delay jitter and loss effect to packetized voice, and use the adaptive playout algorithm to control the stretching ratio for each packet. The stretching ratio is bounded by two factors: the end-to-end delay bound and the content-based stretching allowance, which will be detailed in Section 3.1.
2.3. Objective Quality Measurement Determining the subjective quality of transmitted speech data has always been an expensive and laborious process. Today, a tool described in ITU-T Rec. P.862, which is known as PESQ, can provide a rapid and repeatable result. PESQ is an objective measurement tool that predicts results of subjective listening tests on telephony systems. PESQ uses a sensory model to compare the original, unprocessed signal with the degraded signal from the network or the network element. The resulting quality score is analogous to the subjective ”Mean Opinion Score” (MOS) measured using panel tests according to ITU-T P.800. PESQ scores are calibrated using a large database of subjective tests. The ITU-T selection process that resulted in the standardization of PESQ involved a wide range of conditions with demanding correlation requirements set to ensure that PESQ provides a good performance in assessing speech quality in conventional fixed and mobile networks as well as packet-based transmission systems. PESQ takes into account coding distortions, errors, packet loss, delay and variable delay, and filtering in analogue network components. The DSLA (Digital Speech Level Analyzer) and the user interface have been designed to provide a simple access to this powerful algorithm, either directly from the analogue connection or from speech files recorded elsewhere. The performance of a network or a network element can be fully characterized using DSLA and PESQ. Whilst it is possible to use phonetically balanced sentences and other test patterns, accurate and repeatable measurements of the active speech level, activity, delay, echo, noise and speech quality can be obtained quickly using the artificial speech test stimulus in different languages. A graphical mapping of errors provides a useful insight into how the signal has been degraded. PESQ has demonstrated acceptable accuracy in the case of time warping of audio signals, according to the ITUT P.862 document. Here, time warping means the delay changes during silence intervals (result of dynamic buffer resizing or silence-based adaptive playout), or the large changes in packet delay leading to delay changes during talk spurt (result of delay spike). However, PESQ appears to be more sensitive to front-end and back-end temporal clipping (result of voice activity detection errors). Conversely, PESQ may be less sensitive than subjects to regular, short time clippings that occur during the speech (result of replacement of short sections of speech by silence). When performing the PESQ test for packet-based SOLA, we observe two signal dependent time-scale constraints to be considered for variable-rate time-scale modification. They are: stretching elasticity and stretching dynamic. Stretching elasticity is used to describe how flexible the time-scale modification ratio can be when applied to any individual audio segment, given that audio quality is not sacrificed. Stretching dynamic is used to describe how flexible the time-scale modification ratio can change from one segment to the next. Given a sequence of audio frames, when the frame sizes are small (around 20ms in duration), even a large stretching dynamic (for example, the stretching ratio alternates between 150% and 50% two levels) is not noticeable. However, the audio quality is noticeably affected when either large time-expansion (around 150% to 200%) or large time-compression (around 50%) is continuously engaged on a span of consecutive frames. We conducted the PESQ test on packet-based SOLA. The whole test sequence was packetized and stretched with the same stretching ratio. The ratio is varied from 60% to 150% in different rounds of our experiment. The PESQ score is shown in Fig. 4. It presents the monotonous relationship between stretching ratios and the objective measurement. The more stretched, the worse the quality. However, PESQ does not give a good measurement on the stretching elasticity and dynamic in our experiment. For one of test sequences in ITU-T release bundle of PESQ code, we tried an alternating stretching ratio pattern between 150% and 50%, namely, “... 1.5 0.5 1.5 0.5 1.5 0.5 ...”. The subjective listening test shows a very good hearing result. It means that, if the stretching dynamic has a high varing frequency with averaging 100% stretching ratio through a span of packets, our ear is not sensitive to high frequency variations. However, PESQ gives an average score of 3.2 to this case. This shows some limitation of PESQ. To fix this problem, instead of applying PESQ to the stretched audio directly, we calculate the averaged stretching ratio over a small window and then apply the PESQ test to this averaged value. In our experiment, we consider the
average is performed within a sliding window of 10 frames. In the above example, the averaged stretching factor for the sequence “... 1.5 0.5 1.5 0.5 1.5 0.5 ...” is 1. PESQ test on packet−based SOLA 4.5
4
PESQ MOS score
3.5
3
2.5
2
1.5
1 0.6
0.7
0.8
0.9
1 1.1 stretching ratio
1.2
1.3
1.4
1.5
Figure 4. PESQ score for packet-based SOLA.
3. PACKET-BASED SYNCHRONIZED OVERLAP-AND-ADD 3.1. Content Classification Speech classification was performed in [18], [19] using many speech signal properties such as the short-time energy function and the short-time zero-crossing rate (ZCR). In delay sensitive applications such as VoIP or audio streaming, the processing unit is a packet which typically contains audio of 20 − 30ms long. On one hand, we are not allowed to go back and forth to compare these speech features for clustering and classification. The detection of silence intervals has to be done on the fly. On the other hand, silence detection does not have to be very accurate. We only have to detect silence between adjacent talk-spurts. There are short pauses between words that may not be necessarily treated as silence. As a result, if the energy level of the audio signal is lower than a threshold continuously for some period, the segment is then claimed as silence. Silence intervals may or may not be transmitted, which depends on the particular coding algorithm used.
(a)
(b)
Figure 5. Illustration of high frequency content not represented by ZCR: (a) the audio waveform with high frequency contents, (b) illustration of contour crossing The ZCR values does not always provide good measurement about the frequency contents of an audio signal. For example, in Fig. 5 (a), the second half of the signal has high frequency contents that cannot be revealed by ZCR values. For the signal in Fig. 5 (b), it has a smooth low frequency contour and some high frequency components superimposed on top of it. Motivated by this observation, we define a concept called the Contour Crossing Rate (CCR) that calculates how often the signal is crossing the contour of the underlying waveform, where the contour can be calculated using local averaging. The spectrogram of the signal in Fig. 5 (a) is shown in Fig. 5 (c). The corresponding CCR and ZCR values are shown in Fig. 5(d). From this example, we see that the CCR value is more accurate than the ZCR value in revealing the high frequency contents.
(a) 160 short-time ccr short-time ZCR
140
120
crossing rate
100
80
60
40
20
0
0
5
10
(b)
15
20 25 crossing rate index
30
35
40
(c)
Figure 6. Comparison of the ZCR and CCR: (a)the sample waveform, (b) its spectrogram, and (c) the corresponding windowed CCR and ZCR values. In our experiment, we calculate the contour z(m) of signal x(m) via z(m) = β · x(m) + (1 − β) · z(m − 1), with coefficient β = 0.5. The short-time CCR value becomes Cn = (sgn [x(m) − z(m)] ∗ sgn [x(m − 1) − z(m − 1)])w (n − m) ,
(3)
(4)
m
where sgn [x(n)] =
and w(n) =
1, 0,
1, x(n) ≥ 0 0, x(n) < 0.
(5)
0 ≤ n ≤ N − 1, otherwise.
(6)
The short-time energy function is calculated via En =
1 2 [x (m) w (n − m)] , N m
(7)
where w(n) is given by (6). A uniform stretching ratio applied to all audio packets may affect the region of a signal with a sharp transition, which is called the edge. To avoid the artifact due to time-scale modification in the edge, it is desirable to preserve edges in the signal during the stretching process. Thus, it is necessary to have a variable-rate time-scale modification scheme according to audio contents. In this work, we employ the energy level and the short-time CCR values to classify voice contents into three categories: silence, edge and non-edge parts. The transient part is usually related to a high energy level such as fricatives and attack of sounds, etc. Extending an interval of a high sound pressure tends to result in the reverberant effect when SOLA is applied to sound’s edges. Therefore, their duration should be preserved faithfully. The silence segment has the lowest energy level, and can be stretched with virtually any ratio. The relative energy level is a well known tool to detect silence. However, since low amplitude fricatives and silence periods may have a similar energy level but different frequency tributes, the use of the energy level alone is not sufficient for silence detection. A more elaborate mechanism including CCR (or ZCR) is employed to determine frequency contents.
The proposed content classification algorithm to discriminate silence, edge, and non-edge voice is given in Fig. 7 in form of pseudo-codes. The classification is performed on each voice frame. A fixed threshold of the silence energy level is not proper for all recording environments. Here, we use a silence level register SP L to average all energy levels of classified silence segments. The value of SP L is initially set to be 0.001 times the maximum sample value. For example, if the voice is recorded with 16 Bits/Sample, the initial value of SP L is equal to 0.001x65535 = 65.535. If the energy En i of a frame is lower than 1.2SP L, this frame is classified to silence. The value of SP L is only updated when a new frame is classified to silence. The value of γ is selected to be 0.6 in our experiment. Edges are defined as those frames that have dramatic changes of energy inside a frame. Stretching edge frames will result in unpleasant sound, and should be avoided. 01 02 03 04 05 06 07 08 09 11 12 13 14 15 14 15 15 16
Initialize Start classification on frame i Calculate En (i) Calculate Cn (i) If En (i) < 1.2 ∗ SP L or En (i) < 1.0 Frame i is silence SP L = γ · SP L + (1 − γ) · En (i) Else If En i > 2 If En (i) > 1.8 ∗ En (i − 1) and Cn (i) > 100.0 Frame i is edge Else If En (i) < En (i − 1)/1.8 Frame i is edge Else Frame i is non-edge talkspurt Else Frame i is non-edge talkspurt Take next frame i = i + 1, Goto 02
Figure 7. The proposed audio content classification algorithm. Fig. 8 shows the result of the proposed audio content classification algorithm applied to a test sequence, which is the voice segment of a female speaker speaking a sequence of digits - ”one nine one one four four six”. The x-axis is the time index in the unit of seconds and the y-axis is the corresponding waveform. Detected silence and edge intervals are shown in the figure. The solid line indicates the frames belonging to talkspurts while the dotted line indicates the edge frames. Our proposed content classification algorithm works well in this test.
3.2. Packet-based Time-Scale Modification with SOLA Figs. 9 (a) and (b) illustrate the packet version of time-scale modification with SOLA. Let us first consider the time-scale expansion as given in Fig. 9 (a). The input packet is duplicated into two copies, and we would like to overlap them to create a new packet of longer length. Let α ˆ > 1 be the target stretching ratio. Then, the overlap start position is set to in the neighborhood of δ = (ˆ α − 1) · N. The calculation of the synchronization point Km follows the SOLA calculation around a tolerance range −kmax ≤ Km < kmax . Once Km is determined, a ramp function is used as the weighting function for the two copies as shown in Fig. 9 (a). The new length of the output is stretched from N to N + δ + Km , where Km is much less than δ. The time-scale contraction operation can be done in a similar way. Let α ˆ < 1 be the target stretching ratio. Then, the overlap start position is set to in the neighborhood of δ = (1 − α ˆ ) · N.
Content Classification
4000
3000
2000
1000
0
−1000
waveform talkspurt edge
−2000
−3000
0
1
2
3
4
5 4
x 10
Figure 8. Results of the audio content classification algorithm. The local adjustment Km of the synchronization point can be found. A ramp function is used as the weighting function but in a different way as shown in Fig. 9 (b). The length of the contracted packet is shrinked from N to N − δ + Km .
2Kmax
weighting function
2Kmax
weighting function
Km 0 0
0
δ
original
N
δ 0
weighting function
N+δ+Km
N
original
weighting function time-scale expanded
(a)
0
N-δ+Km
time-scale compressed
(b)
Figure 9. Packet-based time-scale modification with SOLA: (a) time-scale expansion and (b) time-scale contraction.
4. ON THE DESIGN AND ANALYSIS OF ADAPTIVE PLAYOUT ALGORITHMS In this section, we briefly comment on issues related to the design and analysis of adaptive playout algorithms. There are three tools to be considered in the design of adaptive playout algorithms to conceal audio/speech delay at the receiver, i.e. • good delay estimation schemes; • fast and flexible scheme in response to delay;
• time-scale modification of received audio for delay concealment while audio quality distortion should be minimized at the same time. For delay estimation, there are methods proposed. For example, the silence-based adaptive playout algorithm was studied in [1], [2], in which several algorithms were used to predict the mean and the variance of end-to-end delay according to the mean and the variance of delay measured on received packets of the previous talk-spurt. It was shown that the delay histogram-based approach gave the best result. For fast and flexible adaption, various low complexity FEC schemes and time-scale modification methods can be tried. For quality preservation, time-scale modification based on audio contents seems to provide a satisfactory solution. The optimal solution to any silence-based delay adaptive algorithms was analyzed in [2]. That is, under a certain loss rate, the minimal average end-to-end delay can be achieved by discarding packets with the highest delay in order, if all delayed data are known off-line. This is an assumption which is difficult to get in the real world streaming environment. Furthermore, the analysis of our proposed adaptive playout algorithm would be more complicated. Due to the use of per packet stretching, it is not necessary to discard packets with the highest delay in order to lower the average end-to-end delay. As shown in Fig. 10, we present two possible scheduling algorithms indicated by the solid and the dotted lines, respectively. In terms of delay, packet (i + 1) has a longer delay than packet (i + 7) during network transmission. We see that packet (i + 1) is discarded in one scheduling algorithm while packet (i + 7) is discarded in the other scheduling algorithm. However, the two scheduling algorithm may have the same average delay. In this case, the optimal solution may not be unique, and can be achieved by several different adaptation strategies.
ti t i +1 t i+ 2 t i +3 t i+ 4 t i+5 t i+6 t i+7 t i +8 Receiving time of each packet Playout schedule algorithm 1 Playout schedule algorithm 2 Figure 10. Comparison of two playout scheduling schemes using time-scale modification.
5. CONCLUSION AND FUTURE WORK Audio QoS can be determined by many factors such as the average end-to-end delay, the application loss rate, and received audio quality. We proposed an adaptive playout algorithm with time-scale modification to lower the drop loss rate without increasing the average end-to-end delay in comparison with the traditional playout and the silence-based playout algorithms. When being integrated with FEC schemes, the proposed method can lower the network packet loss for a tighter QoS requirement. We also examined the objective speech quality measurement such as PESQ. It turns out that PESQ does not provide a good objective quality measure for packet-based time-scale modified speech signals. Further refinement of the PESQ measure is needed.
ACKNOWLEDGMENTS * This research was funded by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, under Cooperative Agreement No. EEC-9529152.
REFERENCES 1. R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive playout mechanisms for packetized audio applications in wide-area networks”, in Proc. IEEE INFOCOM, 1994. 2. S. B. Moon, J. Kurose, and D. Towsley, “Packet audio playout delay adjustment: performance bounds and algorithms”, ACM/Springer Multimedia Systems, vol. 5, pp. 17-28, Jan. 1998. 3. F. Liu, J. Kim, and C.-C. J. Kuo, “Adaptive delay concealment for Internet voice applications with packet-based time-scale modification”, in Proc. IEEE ICASSP’2001, May, 2001. 4. Y. L. Liang, N. Farver, and B. Girod, “Adaptive playout scheduling using time-scale modification in packet voice communications”, in Proc. IEEE ICASSP’2001, May, 2001. 5. J. Rosenberg, L. Qiu, and H. Schulzrinne, “Integrating packet FEC into adaptive voice playout buffer algorithms on the Internet”, in Proc. IEEE INFOCOM ‘2000, Mar. 2000. 6. C. Perkins and O. Hodson, “Options for repair of streaming media”, Request for Comments (Informational) 2354, Internet Engineering Task Force, June 1998. 7. J.-C. Bolot, “Characterizing end-to-end packet delay and loss in the Internet”, Journal of High-Speed Networks, pp. 305-323, Dec. 1993. 8. C. Perkins, O. Hodson, and V. Hardman, “A survey of packet loss recovery techniques for streaming audio”, IEEE Network Magazine, pp. 40 -48, Sept./Oct. 1998. 9. J.-C. Bolot and A. Vega-Carcia, “The case for FEC-based error control for packet audio in the Internet”, ACM Multimedia Systems, 1993. 10. M. G. Podolsky and S. McCanne, ”Soft ARQ for Layered Streaming Media”, Journal of VLSI Signal Processing 27, pp. 81-97, 2001. 11. H. Sanneck, A. Stenger, K.B. Younes, and B. Girod, “A new technique for audio packet loss concealment”, in Proc. IEEE GLOBECOM, pp. 48-52, 1996. 12. A. Stenger, K. B. Younes, R. Reng, and B. Girod, “A new error concealment technique for audio transmission with packet loss”, in Proc. EUSIPCO, 1996. 13. F. Liu, J. Kim, and C.-C. J. Kuo, “Interactive low bit-rate speech/audio streaming over Internet via MPEG-4”, SPIE Photonic east’99, Sept. 1999. 14. M. R. Portnoff, ”Time-scale modification of speech based on short-time fourier analysis”, IEEE Trans. on Acoustic, Speech, and Signal Processing, vol. 29, no. 3, pp. 374-390, June 1981. 15. S. Lee, H. D. Kim, and H. S. Kim, “Variable time-scale modification of speech using transient information”, in Proc. IEEE ICASSP’1997, pp. 1319-22, 1997. 16. E. Hardam, “High quality time scale modification of speech signals using fast synchronized-overlap-add algorithms”, in Proc. IEEE ICASSP’1990, pp. 409-412, 1990. 17. W. Jiang, H. Schulzrinne, ”Modeling of Packet Loss and Dealy and Their Effect on Real-Time Multimedia Service Quality”, in NOSSDAV 2002. 18. L. R. Rabiner, R. W. Schafer, ”Digital Processing of Speech Signals”, Prentice-Hall, ISBN: 0-13-213603-1, Sept. 1978. 19. Jonathan A. Marks, ”Real Time Speech Classification and Pitch Detection”, IEEE COMSIG, 1988.