However data networks always suffer from packet loss, delay, and delay jitter that ... techniques were developed to recover lost voice packets and are broadly ...
Tanta University Faculty of Engineering Department of Electronics and Electrical Communication Engineering
Packet Voice Transmission over Data Networks A Thesis Submitted for Degree of Master of Science In
Electronics and Electrical Communication Engineering By
Eng. Sameh Atef Napoleon Anis Engineer at the Department of Electronics and Electrical Communication Engineering, Faculty of Engineering, Tanta University Supervised by Prof. Dr.
Ass. Prof
Mohamed El-Said Nasr
Salah Eldein Khamis
Head of Department of Electronics and Electrical Communication Engineering Faculty of Engineering - Tanta University
Ass. Prof. of Electrical Engineering Department of Electronics and Electrical Communication Engineering Faculty of Engineering - Tanta University
Tanta 2006
Abstract
Nowadays, voice transmission over data networks has become of wide spread. However data networks always suffer from packet loss, delay, and delay jitter that greatly affect the quality of the perceived voice quality. Delay jitter can be cured through using playout buffers. However delay can cause further loss for voice packets. Many recovery techniques were developed to recover lost voice packets and are broadly divided to sender-based and receiver-based recovery techniques. Sender-based techniques require more transmission band width and add delay which is undesirable. Receiver-based techniques depend on received packets and voice properties making it attractive as they remedy the sender-based recovery problems. Examples of it are, Silence Substitution, Packet repetition, Waveform Substitution, Sample interpolation, Pitch waveform Replication and Double Sided Pitch waveform Replication. It is noted that the recovery quality increases as the recovery technique complexity increases. Two enhanced recovery techniques are developed in this thesis. The Switched Recovery Technique (SRT), which is a compromise between complexity and quality by employing recovery techniques according to instantaneous measure for the loss rate. The other developed new technique is the Parallel Recovery Technique (PRT) which takes into account the effect of loss on the after-loss decoded voice by copying the last received packet to the input to the decoder while the last decoded waveform before loss is used to recover the voice waveform during the loss. In this way the decoder history is kept as close as possible to that at the coder which minimizes the impact of loss. The proposed techniques were tested with different codecs such as LD-CELP, GSM FR, GSM HR, FS CELP. The loss distributions generated using the two-state Markov model and the codecs performance measured by the ITU-T P.682 standard. The use of SRT was found to reduce average complexity and leads to reasonable recovery quality, while the PRT has lead to improved quality. The LD-CELP has a special performance and doesn’t need SRT nor PRT.
i
Acknowledgement
I would like to express my deepest gratitude to both my supervisor Prof. Dr. Mohamed El-Said Nasr. Head of Department of Electronics and Electrical Communication Engineering, Faculty of Engineering, Tanta University and Dr. Salah Eldein Khamis Ass. Prof. of Electrical Engineering Department of Electronics, Faculty of Engineering, Tanta University, for their great support and constructive guidance. I would like also to say thanks to my family, and friends for their spiritual encouragement and motivation.
ii
Publications
[1] Mohamed E. Nasr, Sameh A. Napoleon, “On Improving Voice Quality Degraded by Packet Loss in Data Networks,” The 22nd National Radio Science Conference (NRSC’ 2005), March 15-17, 2005. Also:
[1] Mohamed E. Nasr, Sameh A. Napoleon, “On Improving Voice Quality Degraded by Packet Loss in Data Networks,” The IEEE Africon 2004, September, 15-17, 2004. [2] Salah Khamis, Sameh A. Napoleon, “Enhanced Recovery Technique for Improving Voice Quality Degraded by Packet Loss in Data Networks,” Submitted for publication in the Alex. Eng. Journal.
iii
Contents
Abstract ................................................................................................................................i Acknowledgement...............................................................................................................ii Publications ....................................................................................................................... iii Contents..............................................................................................................................iv List of Abbreviations....................................................................................................... vii List of Figures .....................................................................................................................x List of Tables................................................................................................................... xiii List of Symbols.................................................................................................................xiv Chapter 1 Introduction.....................................................................................................1 Chapter 2 Data Networks Overview................................................................................5 2.1 IP Networks ................................................................................................................5 2.1.1 Protocol Architecture...........................................................................................5 2.1.2 Real-Time Transport Protocol (RTP)..................................................................6 2.2 ATM Networks...........................................................................................................8 2.2.1 ATM protocol architecture ..................................................................................9 2.2.2 ATM Adaptation Layer (AAL) .........................................................................10 2.3 Frame Relay..............................................................................................................11 2.3.1 Control Plane .....................................................................................................12 2.3.2 User Plane..........................................................................................................12 2.3.3 User data transfer...............................................................................................13 2.4 Modeling network packet loss..................................................................................15 2.4.1 Modeling loss in IP networks ............................................................................15 2.4.2 Modeling loss in ATM networks.......................................................................17 iv
Chapter 3 Voice Compression Techniques ...................................................................20 3.1 Analysis by Synthesis (AbS) ....................................................................................22 3.1.1 The Short-term Predictor...................................................................................24 3.1.2 Autocorrelation Method ....................................................................................25 3.1.3 Long-Term Predictor (LTP) ..............................................................................26 3.2 Standard Voice Compression Techniques................................................................28 3.2.1 GSM Full Rate (RPE-LTP) coder .....................................................................28 3.2.2 Federal Standard CELP (FS1016) coder ...........................................................29 3.2.3 GSM Half Rate coder (GSM HR) .....................................................................30 3.2.4 LD-CELP...........................................................................................................31 Chapter 4 Packet Loss and Recovery Techniques .......................................................34 4.1 Packet loss in data networks.....................................................................................34 4.2 Packet loss recovery techniques ...............................................................................35 4.3 Sender-Based Packet Loss Recovery .......................................................................36 4.3.1 Automatic Repeat reQuest (ARQ) ....................................................................36 4.3.2 Media-independent FEC....................................................................................37 4.3.3 Media-specific FEC...........................................................................................38 4.3.4 Interleaving........................................................................................................40 4.4 Receiver-based Recovery Techniques......................................................................42 4.4.1 Insertion .............................................................................................................42 4.4.2 Interpolation ......................................................................................................44 4.4.3 Regeneration Based Recovery...........................................................................56 4.4.4 Model-Based Recovery .....................................................................................56 Chapter 5 Enhanced Recovery Techniques ..................................................................58 5.1 Switching Recovery Technique (SRT).....................................................................58 5.2 Parallel Recovery Technique (PRT).........................................................................60 Chapter 6 Performance Evaluation and Results..........................................................63 6.1 Test components.......................................................................................................63 v
6.1.1 Coder and decoder.............................................................................................63 6.1.2 Recovery techniques..........................................................................................64 6.1.3 Network .............................................................................................................64 6.1.4 Perceptual quality evaluation ............................................................................64 6.2 Simulation system ....................................................................................................65 6.3 Waveform repair for ordinary recovery techniques .................................................65 6.3.1 WS and SI..........................................................................................................66 6.3.2 PWR and DSPWR.............................................................................................68 6.4 Packet Loss vs. coding algorithm.............................................................................71 6.5 Switching Recovery Technique................................................................................74 6.6 Parallel Recovery Technique (PRT).........................................................................77 Chapter 7 Conclusions ....................................................................................................85 7.1 Conclusions ..............................................................................................................85 7.2 Future Work..............................................................................................................86 Appendix A Source Code ...............................................................................................87 References........................................................................................................................108 اﻟﻤﻠﺨﺺ اﻟﻌﺮﺑﻰ........................................................................................................................111
vi
List of Abbreviations
AAL
ATM adaptation layer
AbS
Analysis by Synthesis
APLP
average of the packet loss periods
ARQ
Automatic Repeat reQuest
ATM
Asynchronous Transfer Mode
BU
Both packets Unvoiced
BV
Both Voiced
BWAA
Backward Amplitude Adjustment
CBR
Constant Bit Rate
CELP
Code Excited Liner Prediction
CLP
Conditional Loss Probability
DSPWR
Double Sided Pitch Waveform Replication
FEC
Forward Error Control
FP
Following Packet
FS CELP
Federal Standard Code Excited Linear Prediction
FV
Following packet Voiced
FWAA
Forward Amplitude Adjustment
GSM HR
GSM Half Rate
IETF
Internet Engineering Task Force
IP
Internet Protocols
LAN
Local Area Networks
LARs
Log-Area Rations vii
LD-CELP
Low-Delay Code Excited Linear prediction
LP
Linear Prediction
LPC10
Linear Prediction Coder 1015
LSB
Least Significant Bits
LSFs
Line Spectrum Frequencies
LTP
Long-Term Predictor
MAN
Metropolitan Area Networks
MSB
Most Significant Bits
NS
Noise Substitution
PCM
Pulse Code Modulation
PESQ
Perceptual Evaluation of Speech Quality
PMP
Phase Matching using Pitch difference
PP
Preceding Packet
PR
Packet Repetition
PRT
Parallel Recovery Technique
PSA
Pitch Segment Adjustment
PV
Preceding packet Voiced
PWR
Pitch Waveform Replication
QL
Quality Loss
QoS
Quality-of-Service
REP-LTP
Regular Excitation Pulse with Long-Term Predictor
RTCP
RTP Control Protocol
RTP
Real Time Protocol
RUDP
Reliable User Data Protocol
SI
Sample Interpolation
SIP
Session Initiation Protocol
SMDS
Switched Multimegabit Data Service
SRT
Switched Recovery Technique viii
SS
Silence Substitution
TCP
Transmission Control Protocol
TPLP
transient packet loss period
UDP
User Datagram Protocol
ULP
Uneven Level Protection
VBR
Variable Bit Rate
VoIP
Voice over the IP network
VoP
Voice over Packet
VSELP
Vector Sum Excited Linear Prediction
WAN
Wide Area Networks
WS
Waveform Substitution
XOR
eXclusive OR
ix
List of Figures
Figure 2.1: VoIP protocol architecture.................................................................................6 Figure 2.2: Header for IP packet carrying real-time application..........................................7 Figure 2.3: ATM reference model........................................................................................9 Figure 2.4: User-Network interface protocol architecture .................................................12 Figure 2.5: LAPF core format ............................................................................................14 Figure 2.6: Gilbert Model...................................................................................................16 Figure 2.7: Statistical multiplexing of voice sources. ........................................................18 Figure 3.1: Basic model for Analysis by Synthesis............................................................22 Figure 3.2: Coder and Decoder for the GSM FR ...............................................................29 Figure 3.3: The FS CELP coder block diagram .................................................................30 Figure 3.4: GSM HR coder block diagram ........................................................................31 Figure 3.5: LD-CELP coder ...............................................................................................32 Figure 4.1: Recovery Techniques.......................................................................................36 Figure 4.2: Media-independent FEC ..................................................................................38 Figure 4.3: Position of redundant data in case of, (a) low and medium network loads, and (b) heavy network load. ......................................................................................................39 Figure 4.4: Interleaving of units across multiple units.......................................................41 Figure 4.5: Positive peak detector algorithm flowchart .....................................................46 Figure 4.7: Procedure PSA .................................................................................................50 Figure 4.8: Illustration of procedure PMP..........................................................................52 Figure 4.9: Illustration of the procedure FWAA and BWAA, where the dotted lines represent the waveforms after reconstruction.....................................................................53 Figure 5.1: SRT block diagram
Control lines
SRT ..........................................60 x
Figure 5.2: Block diagram for PRT
Control lines
PRT....................................62
Figure 6.1: Simulation system used for testing recovery techniques .................................65 Figure 6.2: Waveform Substitution and Sample Interpolation...........................................66 Figure 6.3: Peaks of the voice segment as detected by the PWR peak detector ................68 Figure 6.4: PWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovewred using PWR..................................................................69 Figure 6.5: DSPWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovered using DSPWR.....................................................70 Figure 6.6: PESQ vs. loss rate for LD-CELP.....................................................................71 Figure 6.7: PESQ vs. loss rate for the GSM FR coder.......................................................72 Figure 6.8: PESQ vs. loss rate for the GSM HR coder ......................................................72 Figure 6.9: PESQ vs. loss rate for the FS CELP (FS1016) coder ......................................73 Figure 6.10: SRT applied for GSM FR, GSM HR, FS CELP............................................76 Figure 6.11: Enhancement for the WS technique by PRT-WS technique for GSM FR ...77 Figure 6.12: Enhancement for SI technique by PRT-SI technique for GSM FR coder....78 Figure 6.13: Enhancement for the PWR technique by the PRT-PWR technique for the GSM FR coder....................................................................................................................78 Figure 6.14: Enhancement for the DSPWR technique by the PRT-DSPWR technique for GSM FR coder....................................................................................................................79 Figure 6.15: Enhancement for WS technique by the PRT-WS technique for GSM HR coder ...................................................................................................................................79 Figure 6.16: Enhancement for the SI technique by the PRT-SI technique for GSM HR coder ...................................................................................................................................80 Figure 6.17: Enhancement for the PWR technique by the PRT-PWR for GSM HR coder ............................................................................................................................................80 Figure 6.18: Enhancement for the DSPWR technique by the PRT-DSPWR for the GSM HR coder.............................................................................................................................81
xi
Figure 6.19: Enhancement for the WS technique by the PRT-WS technique for FS CELP coder ...................................................................................................................................81 Figure 6.20: Enhancement for the SI technique by the PRT-SI technique for FS CELP coder ...................................................................................................................................82 Figure 6.21: Enhancement for the PWR technique by the PRT-PWR technique for FS CELP coder.........................................................................................................................82 Figure 6.22: Enhancement for the DSPWR technique by the PRT-DSPWR for FS CELP coder ...................................................................................................................................83
xii
List of Tables
Table 2.1: The functions of each layer for the ATM reference model...............................10 Table 5.1: Complexity of recovery techniques normalized to SS......................................58 Table 6.1: Distinct features of codecs used in test ............................................................63 Table 6.2: PESQ vs. loss rate for the different codec types .............................................73 Table 6.3: The SRT applied for GSM FR, GSM HR, FS CELP........................................76 Table 6.4: PESQ enhancement by the PRT technique .......................................................83
xiii
List of Symbols
ak
: Short-term predictor coefficients
C
: Link Capacity in bps
E
: Short-term average prediction error
e(j)
: Prediction error
ew(j) : Perceptually weighted prediction error f(k)
: Probability of loss run of length k
Gk
: Long-term predictor gain
H(z)
: The all-pole digital filter
L
: Update frame-length for the short-term predictor
La
: Analysis frame-length for the short-term predictor
LARi : Log Area Ratios M
: Total Number of voice sources
Ma
: Number of active voice sources
Ms
: Number of voice sources that saturates the link
N
: Analysis frame length for the long-term predictor
n
: Number of elements in the time series {xi }i∞=1
n0
: Number of zeros in the time series {xi }i∞=1
n1
: Number of ones in the time series {xi }i∞=1
n01
: Number of times that a 1 follows a 0 in {xi }i∞=1
n10
: Number of times that a 0 follows a 1 in {xi }i∞=1
p
: Transition probability from 0 to 1 in the 2-States Markov model
xiv
pˆ
: Estimated value for p from {xi }i∞=1
Pl
: Long-term predictor
Ps
: Short-term predictor
q
: Transition probability from 1 to 0 in the 2-States Markov model
qˆ
: Estimated value for q from {xi }i∞=1
r
: Loss rate (or ratio)
rˆ
: Estimated loss rate (or ration)
R(j)
: Autocorrelation coefficients
s(j)
: Speech sample
s ( j )
: Predicted speech sample
u(j)
: Excitation signal after the long-term predictor
v(j)
: Excitation signal before the long-term predictor
w(j)
: Window function
xi
: Elements of the binary time series {xi }i∞=1 and takes values 1 or 0
α
: Long-term predictor delay
β
: Short-term predictor order
π0
: The state 0 probability in the 2-States Markov model
π1
: The state 1 probability in the 2-States Markov model
ρ
: Residual after the short-term predictor
xv
Chapter 1: Introduction
Chapter 1
Introduction
Voice over Packet Data Networks is to transmit voice over Packet Switching Data Networks by converting it into packets while keeping reliability and voice quality as in circuit switched telephone networks, and gaining cost savings [1]. In fact, the convergence of voice and data networks is rapidly taking place across the globe. Data Networks can be classified according to their transfer protocols as IP (Internet Protocol) networks, ATM (Asynchronous Transfer Mode) networks, FR (Frame Relay) networks [2, 3], etc. IP protocol is the widely spread and used in networks for transporting data, and ATM was designed to carry multimedia and data as well. Hence, IP and ATM networks are adopted in this thesis. Packet Switched Networks can also classified by their coverage area into Wide Area Network (WAN), Metropolitan Area Network (MAN), and Local Area Network (LAN) [4]. Packet Switched Networks always suffer from packet loss, delivery delay, and delay jitter [1]. Packet loss occurs due to discarding packets in congestion periods as in IP networks and buffer overflow as in ATM switches [5, 6, 7], or by dropping packets at gateway/terminal due to late arrival or miss delivery due to errors in packet header. Delay is the consequence of long routs and queuing in routing nodes. Delay jitter is the different delay for each received packet and is the result of different routs for each packet and hence different queuing delays. For multimedia applications, such as voice, delay can cause packet loss due to discarding, so the first two impairments can be merged into packet loss. Delay jitter can be cured using a playout buffer to add delay to early arrived packets to regulate packet arrival time as seen by the voice decoder. The impact of packet loss on perceived voice quality depends on several factors, including loss pattern, codec 1
Chapter 1: Introduction
type and packet size. The packet losses can be modeled using the Bernoulli model and Gilbert model (Two-state Marcov Model) for IP networks [8, 9, 10], and random model or the burst model for ATM [11]. The growth in voice over packet networks demand lead to excess demand of bandwidth, so using low bit-rate voice coders became inevitable [12]. Various standardized coders can be used such as Low-Delay Code Excited Linear prediction, LDCELP (ITU-T G728) at 16 Kbps, Regular Excitation Pulse with Long-Term Predictor, REP-LTP ( GSM standard) at 13 Kbps [13, 14, 15], Vector Sum Excited Linear Prediction VSELP (GSM half rate standard) at 5.6 Kbps, Code Excited Liner Prediction, CELP 1016 (Federal Standard) at 4.8 Kbps [16]. For data packets and non-real time applications, there is always time to repair the loss with exact repair via retransmissions or Forward Error Control (FEC) which is not applicable for real-time applications such as voice and telephony over packet networks or video conferencing as delay in packet delivery causes packet discarding at the receiver end [18]. Hence the need of real-time recovery methods. These methods always depend in recovery on the voice characteristics. Recovery techniques used with packet voice can be categorized into main categories, namely: sender-based recovery and receiver-based recovery. In sender-based recovery, the sender is responsible of recovery. It requires more bandwidth than the original packet voice stream and adds more network delay. However, in receiver-based recovery, the recovery process initiated by the receiver. It doesn’t require extra transmission bandwidth that is why the methods in this category are used for recovery in this thesis. This category includes, Silence Substitution (SS), Packet Repetition (PR), Waveform Substitution (WS), Sample Interpolation (SI) [19, 20 21 22], Pitch Waveform Replication (PWR) and Double Sided Pitch Waveform Replication (DSPWR). They are ordered in ascending way in terms of computational load (i.e. complexity) [22]. This thesis aims to enhance known receiver-based recovery techniques and to develop new receiver-based techniques. To enhance recovery techniques, performance 2
Chapter 1: Introduction
evaluation of recovery techniques mentioned above is done with the purposed voice coders. The Performance is measured using the ITU-T P.682 Perceptual Evaluation of Speech Quality PESQ. It is found that best recovery quality is by using DSPWR then PWR then SI then WS and PR are nearly with the same performance for all coders but LD-CELP which has the best recovery performance through using PR. SS is the worst recovery method for all coders. It is noted that as the complexity of the recovery techniques increases the recovered voice quality increases excluding LD-CELP. Two enhancements for a recovery technique can be done, reducing its complexity or increasing its performance. A Switched Recovery Technique (SRT) was developed to compromise between complexity and quality. It is noted that at low loss rates, all receiver-based recovery methods tend to improve recovered voice quality by nearly the same amount. Hence it is logical to choose the least complex method for recovery. As the loss rate increases the SRT begins to switch to the recovery technique which will improve the quality of the perceived voice with the lowest possible complexity. The average complexity of the SRT is found to be: CSRT = P(ri < 3 )* Creptition + P(3 ≤ ri < 7 )* Cint erpolation + P( 7 ≤ ri < 10 )* CPWR + P(ri ≥ 10 )* CDSPWR
Where ri is the instantaneous loss rate with probability of P(ri) and CSRT, Creptition, Cinterpolation, CPWR and CDSPWR are the complexity of the SRT technique, packet repetition method, interpolation method, PWR method, and DSPWR method respectively. The numerical values appear in the equation are the values at which the SRT switches and are determined as described in chapter 5.
Clearly CSRT is less than CDSPWR. Hence we
succeeded in recovery of voice packets at different loss rate with reduced average complexity compared to using DSPWR for best recovery performance but with the highest complexity. Quality enhancement is gained through using the Parallel Recovery Technique (PRT), where two parallel recovery techniques, one of them is the Packet Repetition method. LPC-based or differential-based coders, build the current waveform using current received packet and the decoder history. It is noted that the loss affects the after-loss 3
Chapter 1: Introduction
waveform also which makes excess quality degradation even when using a sophisticated recovery technique. The Parallel Recovery Technique (PRT) is developed to enhance the quality of the speech via repair in the loss location while minimizing the effect of loss on the after-loss stream as well. This is done through combining the PR in conjunction with any other recovery technique. Further, three voice compression methods are used, to test the PRT, the GSM standard Full Rate, GSM Half Rate, and the CELP (FS1016) coders. The results show improvement in perceived voice quality with the addition of negligible complexity. The thesis organized as follows, chapter 2 reviews Data Networks, IP and ATM and modeling of loss. Chapter 3 is an overview of low bit rate voice coding techniques. Chapter 4 explores the detailed description of different recovery techniques. Chapter 5 describes the SRT and the PRT techniques. The results of testing the ordinary and developed recovery techniques are located in chapter 6. Finally, the thesis conclusions and future work are given in chapter 7.
4
Chapter 2: Data Networks Overview
Chapter 2
Data Networks Overview
Data Networks can be classified according to coverage area or by the protocol used to transfer data via them. By the first classification, data networks can be Wide Area Networks (WANs) that cover a large area and distant computers or Metropolitan Area Networks (MANs) which covers a country for example or Local Area Networks (LANs) which connect workstations in a building for example. When networks classified by transfer protocol used, they may be IP networks i.e. networks that transfer data using the Internet Protocol or ATM networks for the ATM protocol or FR networks etc. In the thesis IP and ATM networks are investigated because the IP protocol is the most common and widely used protocol in transferring data across the Internet, and ATM is designed to support data transfer as well as multimedia or real-time applications.
2.1 IP Networks IP network is based on the "best effort" principle which means that the network makes no guarantees about packet loss rates, delays and jitter. For voice traffic, the perceived voice quality will suffer from these impairments (e.g. loss, jitter and delay). We will focus on transferring voice over the IP network which is abbreviated, VoIP.
2.1.1 Protocol Architecture Voice over IP (VoIP) is the transmission of voice over network using the Internet Protocol. Here, we introduce briefly the VoIP protocol architecture, which is illustrated in figure 2.1. The Protocols that provide basic transport (RTP), call-setup signaling (H.323, SIP) and QoS feedback (RTCP) [1] are shown
5
Chapter 2: Data Networks Overview
Figure 2.1: VoIP protocol architecture
In this thesis, we focus on voice transmission and the signaling part is not considered.
2.1.2 Real-Time Transport Protocol (RTP) Two main types of traffic ride upon Internet Protocol (IP): User Datagram Protocol (UDP) and Transmission Control Protocol (TCP). In general, TCP is used when a reliable but not delay sensitive connection is needed, and UDP when simplicity and reliability is not a main concern. Due to the time-sensitive nature of voice traffic, UDP/IP is the logical choice to carry voice. More information is needed on a packet-by-packet basis than UDP offers, however. So, for real-time or delay-sensitive traffic, the Internet Engineering Task Force (IETF) adopted the RTP (Real Time Protocol). VoIP rides on top of RTP, which rides on top of UDP. Therefore, VoIP is carried with an RTP/UDP/IP packet header. RTP is the standard for transmitting delay-sensitive traffic across packet-based networks. RTP rides on top of UDP and IP. RTP gives receiving stations information that is not in the connectionless UDP/IP streams. As shown in figure 2.2, two important bits of information are sequence information and timestamping. RTP uses the sequence information to determine whether the packets are arriving in order and if a packet is lost, and it uses the time-stamping information to determine the interarrival packet time (jitter).
6
Chapter 2: Data Networks Overview
Figure 2.2: Header for IP packet carrying real-time application
RTP can be used for media on demand, as well as for interactive services such as Internet telephony. RTP (refer to figure 2.2) consists of a data part and a control part, the latter called RTP Control Protocol (RTCP). The data part of RTP is a thin protocol that provides support for applications with real-time properties, such as continuous media (for example, audio and video), including timing reconstruction, loss detection, and content identification. RTCP provides support for real-time conferencing of groups of any size within an Internet. This support includes source identification and support for gateways, such as audio and video bridges as well as multicast-to-unicast translators. It also offers QoS feedback from receivers to the multicast group, as well as support for the synchronization of different media streams. Using RTP is important for real-time traffic, but a few drawbacks exist. The IP/RTP/UDP headers are 20, 8, and 12 bytes, respectively. This adds up to a 40-byte header, which is big compared to the payload for packetized voice. Large RTP header can be compressed to 2 or 4 bytes by using RTP Header Compression (CRTP).
7
Chapter 2: Data Networks Overview
2.1.2.1 Reliable User Data Protocol Reliable User Data Protocol (RUDP) builds in some reliability to the connectionless UDP protocol. RUDP enables reliability without the need for a connection-based protocol such as TCP. The basic method of RUDP is to send multiples of the same packet and enable the receiving station to discard the unnecessary or redundant packets. This mechanism makes it more probable that one of the packets will make the journey from sender to receiver. This also is known as forward error correction (FEC). Few implementations of FEC exist due to bandwidth considerations (a doubling or tripling of the amount of bandwidth used). Customers that have almost unlimited bandwidth, however, consider FEC a worthwhile mechanism to enhance reliability and voice quality.
2.2 ATM Networks ATM stands for Asynchronous Transfer Mode. ATM involves the transfer of data in discrete chunks. Also, like packet switching, ATM allows multiple logical connections to be multiplexed over a single physical interface. In the case of ATM, the information flow on each logical connection is organized into fixed-size packets, called cells with length of 53 bytes, out of them 5 bytes for the cell header [2]. ATM is a streamlined protocol with minimal error and flow control capabilities; this reduces the overhead of processing ATM cells and reduces the number of overhead bits required with each cell, thus enabling ATM to operate at high data rates. Further, unlike IP, ATM uses fixed-size cells which simplifies the processing required at each ATM node, again supporting the use of ATM at high data rates. Unlike IP networks, it is designed for real time multimedia and data transportation, not for data only [2]. Also ATM is connection-oriented. That is, in order for a sender to transmit data to a receiver, a connection has to be established first. The connection is established during the call set-up phase, and when the transfer of data is completed, it is turned down. ATM, unlike IP networks, has built-in mechanisms for providing different quality-of-service (QoS) to different types of traffic [3]. As in section 2.2 we will focus attention for the voice traffic only over ATM.
8
Chapter 2: Data Networks Overview
2.2.1 ATM protocol architecture The reference model for the ATM protocol is depicted in figure 2.3. The reference model consists of a user plane, a control plane and a management plane (more details about these planes are in [4, 5, and 6]. Within the user and control planes is a hierarchical set of layers. The user plane defines a set of functions for the transfer of user information between communication end-points; the control plane defines control functions such as call establishment, call maintenance, and call release, and the management plane defines the operations necessary to control information flow between planes and layers, and maintain accurate and fault tolerant network operation.
Figure 2.3: ATM reference model
Within the user and control planes, there are three layers; the physical layer, the ATM layer, and the ATM adaptation layer (AAL). Table 2-1 summarizes the functions of each layer [7]. The physical layer performs primarily bit level functions, the ATM layer is primarily responsible for the switching of ATM cells, and the ATM adaptation layer is responsible for the conversion of higher layer protocol forms into ATM cells. The function that the physical, ATM, and adaptation layers perform are described in more detail in the following: 9
Chapter 2: Data Networks Overview Table 2.1: The functions of each layer for the ATM reference model Higher Layer Functions Higher Layers .convergence CS AAL .segmentation and reassembly SAR .generic flow control .cell-header generation/extraction ATM layer .cell VPI/VCI translation .cell multiplex and demultiplex Layer management .cell-rate decoupling .HEC, header-sequence generation/verification TC physical .cell delineation layer .transmission -frame adaptation .transmission -frame generation/recovery bit timing PM physical medium AAL : ATM Adaptation layer. VCI: Virtual Channel Identifier. CS : Convergence Sublayer. HEC: Header Error Control. SAR : Segmentation And Reassembly. TC: Transmission Control. VPI : Virtual Path Identifier. PM: Physical Medium
PHY independent PHY dependent
2.2.2 ATM Adaptation Layer (AAL) AAL Type 1: This service is used by the applications that require a Constant Bit Rate (CBR), such as uncompressed voice and video, and usually referred to as isochronous. This type of application is extremely time-sensitive and therefore end-to-end timing is paramount and must be supported. Isochronous traffic is assigned service class A. AAL Type 2: again this service is used for compressed voice and video (packetized isochronous traffic), however, it is primarily developed for multimedia applications. The compression allows for a Variable Bit Rate (VBR) service without losing voice and video quality. The compression of voice and video (class B) however, does not negate the need for end-to-end timing. However, timing is still important and is assigned a service class just below that of AAL Type 1. AAL Type ¾: This adaptation layer supports both connection-oriented and compatibility with IEEE 802.6 that is used by Switched Multimegabit Data Service (SMDS). Connection-oriented AAL Type 3 and AAL Type 4 payloads are provided with a service class C while connectionless-oriented AAL Type 3 and AAL Type 4 payloads are 10
Chapter 2: Data Networks Overview
assigned the service class D. The support for IEEE 802.6 significantly increases cell overhead for data transfer when compared with AAL Type 5. AAL Type 5: For data transport, AAL Type 5 is the preferred AAL to be used by applications. Its connection-oriented mode guarantees delivery of data by the servicing applications and doesn’t add any cell overhead.
2.3 Frame Relay Frame relay is a data link control facility designed to provide a streamlined capability for use over high-speed packet-switched networks [4]. At the beginnings of data transfer over packet switched networks, the X.25 protocol was developed to transfer data and resist data loss. The X.25 approach results in considerable overhead [4]. All of this overhead may be justified when there is a significant probability of error on any of the links in the network. This approach may not be the most appropriate for modern digital communication facilities. Today's networks employ reliable digitaltransmission technology over high-quality, reliable transmission links, many of which are optical fiber. In addition, with the use of optical fiber and digital transmission, high data rates can be achieved. In this environment, the overhead of X.25 is not only unnecessary, but degrades the effective utilization of the available high data rates. Frame relaying is designed to eliminate much of the overhead that X.25 imposes on end user systems and on the packet-switching network. Figure 2.4 depicts the protocol architecture to support the frame-mode bearer service. We need to consider two separate planes of operation: a control (C) plane, which is involved in the establishment and termination of logical connections, and a user (U) plane, which is responsible for the transfer of user data between subscribers. Thus, C-plane protocols are between a subscriber and the network, while U-plane protocols provide end-to-end functionality.
11
Chapter 2: Data Networks Overview
Figure 2.4: User-Network interface protocol architecture
2.3.1 Control Plane The control plane for frame-mode bearer services is similar to that for common-channel signaling in circuit-switching services, in that a separate logical channel is used for control information. At the data link layer, LAPD (Q.921) is used to provide a reliable data link control service, with error control and flow control, between user (TE) and network (NT) over the D channel. This data link service is used for the exchange of Q.933 control-signaling messages.
2.3.2 User Plane For the actual transfer of information between end users, the user-plane protocol is LAPF (Link Access Procedure for Frame-Mode Bearer Services), which is defined in Q.922. Q.922 is an enhanced version of LAPD (Q.921). Only the core functions of LAPF are used for frame relay: • Frame delimiting, alignment, and transparency • Frame multiplexing/demultiplexing using the address field
12
Chapter 2: Data Networks Overview
• Inspection of the frame to ensure that it consists of an integral number of octets prior to zero-bit insertion or following zero-bit extraction • Inspection of the frame to ensure that it is neither too long nor too short • Detection of transmission errors • Congestion control functions The last function listed above is new to LAPF, and is discussed in a later section. The remaining functions listed above are also functions of LAPD. The core functions of LAPF in the user plane constitute a sub layer of the data link layer; this provides the bare service of transferring data link frames from one subscriber to another, with no flow control or error control. Above this, the user may choose to select additional data link or network-layer end-to-end functions. These are not part of the frame-relay service. Based on the core functions, a network offers frame relaying as a connection-oriented link layer service with the following properties: • Preservation of the order of frame transfer from one edge of the network to the other • A small probability of frame loss
2.3.3 User data transfer The operation of frame relay for user data transfer is best explained by beginning with the frame format, illustrated in figure 2.5a. This is the format defined for the minimumfunction LAPF protocol (known as LAPF core protocol). The format is similar to that of LAPD and LAPB with one obvious omission: There is no control field. This has the following implications: • There is only one frame type, used for carrying user data. There are no control frames. • It is not possible to use inband signaling; a logical connection can only carry user data.
13
Chapter 2: Data Networks Overview
• It is not possible to perform flow control and error control, as there are no sequence numbers.
Figure 2.5: LAPF core format
The flag and frame check sequence (FCS) fields function as in LAPD and LAPB. The information field carries higher-layer data. If the user selects to implement additional data link control functions end-to-end, then a data link frame can be carried in this field. Specifically, a common selection will be to use the full LAPF protocol (known as LAPF control protocol) in order to perform functions above the LAPF core functions. Note that the protocol implemented in this fashion is strictly between the end subscribers and is transparent to ISDN. The address field has a default length of 2 octets and may be extended to 3 or 4 octets. It carries a data link connection identifier (DLCI) of 10, 17, or 24 bits. The DLCI serves the same function as the virtual circuit number in X.25: It allows multiple logical frame relay connections to be multiplexed over a single channel. As in X.25, the connection identifier 14
Chapter 2: Data Networks Overview
has only local significance; each end of the logical connection assigns its own DLCI from the pool of locally unused numbers, and the network must map from one to the other. The alternative, using the same DLCI on both ends, would require some sort of global management of DLCI values. The length of the address field, and hence of the DLCI, is determined by the address field extension (HA) bits. The C/R bit is application-specific and is not used by the standard frame relay protocol.
2.4 Modeling network packet loss Packet loss in a data network, caused by several factors. Buffer overflow in switches is common, long network delays can force the real-time application to discard latedelivered packets and consider them as lost ones, and miss-routing due to header errors. To test the recovery techniques, they must be tested in a range of loss rates. Hence the need to simulate network loss by using a loss generator. For accurate and more realistic results, the loss in a network has to be modeled to be used inside the loss generator. Before proceeding in discussing loss models, define a binary time series {xi }i∞=1 where xi takes a value of 0 if the ith packet arrived successfully and the value 1 if it was lost. The time series is a discrete-valued time series which takes on values in the set.
2.4.1 Modeling loss in IP networks 2.4.1.1 Bernoulli Loss Model In the Bernoulli loss model [l], also called the Memoryless Packet Loss Model [8], the sequence of random variables {xi }i∞=1 is independent and identically distributed. That is, the probability of xi being either 0 or 1 is independent of all other values of the time series and the probabilities are the same irrespective of i. Thus this model is characterized by a single parameter, r. Estimated value, rˆ can be calculated as [9]: rˆ =
n1 , n
(2-1) 15
Chapter 2: Data Networks Overview
Where n1 is the number of ones in the time series of length n. The model can be implemented by picking a random number for each packet, and deciding whether it is lost based on the value of the number. For a sequence of n packets, the number of lost packets tends to n.r for large values of n [8]. 2.4.1.2 The Gilbert Model Also called, 2-state Markov Model. Markov model can capture temporal loss dependency. In figure 2.6, p is the probability that the next packet is lost, provided the previous one has arrived, q is the opposite. 1-q is the conditional loss probability (clp). Normally p + q p then a lost packet is more likely provided that the previous packet was lost than when the previous packet was successfully received. On the other hand, p > (1 − q) indicates that loss is more likely if the previous packet was not lost. 2.4.1.3 Markov Chain Model of kth order A kth-order Markov chain model is a more general model for capturing dependencies among events. The next event is assumed to be dependent on the last k events, so it needs 2k states. Let xi denote the binary event for ith packet, 1 for loss, 0 for non-loss. The parameters to be determined in a kth order Markov model are: P (x i | x i −1 , x i − 2 ,...., x i − k ) for all combinations of x i , x i −1 , x i − 2 ,....., x i − k [10]. Note that Bernoulli Model and Gilbert Model are two special cases for the kth Markov Model with k=0 and k=1 respectively. In [9], measured end-to-end packet loss for 128 hours with 2 hours fore each trace, and found that Gilbert model can be used to model packet loss for most of traces. So, in simulating the network loss, the Gilbert Model was used.
2.4.2 Modeling loss in ATM networks There are two simple ATM loss models that are used in testing recovery techniques [11], Random model, and Burst model. 2.4.2.1 Random model The Random model assumes that cell losses in ATM are distributed randomly. This model generates losses (ones in the time series {xi }i∞=1 ) with the desired cell loss rate using a random number generator. This is a straightforward approach as the characteristics of an ATM network are not directly involved in the computation of random cell losses [11]. Shortcomings for this model, are that it does not take into consideration the behavior of the voice traffic in an ATM network, which is burst like in
17
Chapter 2: Data Networks Overview
its behavior [11] and that it does not affected by network parameters, such as network load and link capacity[11]. 2.4.2.2 Burst Model On contrary to Random model, this model takes into account the nature of a voice source that is alternates between talk-spurts and silence periods. The voice source considered active and generates cells only during talk-spurts while during silence periods, voice source considered inactive [11]. The voice cells generated by different voice sources are then fed to a common queue and transmitted over the link on a first-come first-served basis. This is shown in figure 2.7.
Figure 2.7: Statistical multiplexing of voice sources.
When the queue reaches the threshold for congestion, cells are discarded according to a cell discarding algorithm which is a priority based algorithm [11]. The receiver considers the cells which have not been received, within the delay time of the playout buffer memory, to be messing. The burst model is a mathematical model that can be used to characterize cell losses for voice traffic in ATM networks and produces the distribution of equilibrium cell loss rates for a given set of ATM network parameters. Let r(Ma) be the equilibrium cell loss rate then: ⎧0 ⎪ r (M a ) = ⎨ M a - M s ⎪ M a ⎩
if
Ma ≤Ms
if
Ma >Ms
(2-5)
18
Chapter 2: Data Networks Overview
where Ma is the number of active voice sources, Ms is the number of the voice sources that saturates the link of a capacity C bps. If Ma>Ms then the link saturates and (Ma-Ms) cells will be lost [11]. The probability of having Ma active voice sources within M voice sources can be obtained using the binomial distribution [11]. Since loss is more frequent in IP networks than ATM networks, and loss generator based on Markov model produces single packet as well as burst losses depending on the loss rate, it is reasonable to use "Markov model"-based loss generator to simulate losses to test the different recovery techniques, that will be discussed in chapter 4, because the aim of loss model is to simulate loss frequency as well as loss distributions in the observed packet stream.
19
Chapter 3: Voice Compression Techniques
Chapter 3
Voice Compression Techniques
As digital communication and secure communication have become increasingly important, the theory and practice of data compression have received increased attention. While it is true that in many systems bandwidth is relatively inexpensive, e.g., fiber optic, in most systems the growing amount of information that users wish to communicate or store necessitates some form of compression for efficient, secure, and reliable use of the communication or storage medium. An example where compression is required results from the fact that if speech is digitized using a simple PCM system consisting of a sampler followed by scalar quantization, the resulting signal will no longer have a small enough bandwidth to fit on ordinary telephone channels. That is, digitization causes bandwidth expansion. Hence data compression will be required if the original communication channel is to be used [12]. Voice coders are mainly classified into three types, waveform coders, vocoders and hybrid coders. Waveform coders Waveform coders reproduce the analog waveform as accurately as possible, including background noise. They operate at high bit rate. G.711 is the waveform coder to represent 8 bit compressed pulse code modulation (PCM) samples with the sampling rate of 8000Hz. This standard has two forms, a-Law and μ-Law. A-Law G.711 PCM encoder converts 13 bit linear PCM samples into 8 bit compressed PCM samples, and the decoder does the conversion vice versa. μ-Law G.711 PCM encoder converts 16 bit linear PCM samples into 8 bit compressed PCM samples. G.726, based on Adaptive Differential Pulse Code Modulation (ADPCM) technique, convert the 64 kbit/s A-law or μ-law pulse code 20
Chapter 3: Voice Compression Techniques
modulation (PCM) or 128Kbits linear PCM channel to and from a 40, 32, 24 or 16 kbit/s channel. The ADPCM technique applies for all waveforms, high-quality audio, modem data etc. Vocoders Vocoders do not reproduce the original waveform. The encoder builds a set of parameters, which are sent to the receiver to be used to drive a speech production model. Linear Prediction Coding (LPC), for example, is used to derive parameters of an adaptive (a time-varying) digital filter. This filter models the output of the speaker’s vocal tract. The quality of vocoder is not good enough for use in telephony system. Hybrid coders The prevalent speech coder for VoIP is the hybrid coder, which melds the attractive features of waveform coder and vocoder. It is also attractive because it operates at a low bit rate as low as 4-16 kbps. They use analysis-by-synthesis (AbS) techniques. An excitation signal is derived from the input speech signal in such a manner that the difference between the input and the synthesized speech is quite small. An enhancement to the operation is to use a pre-stored codebook of optimized parameters (a vector of elements) to encode a representative vector of the vector of the input speech signal. This technique is known as vector quantization (VQ). In this chapter, the discussion will be focused on Low Bit Rate standard codecs because of: • They are all having algorithms that require segmentation of speech stream which is suitable for packet networks that readily require that segmentation. • Save capacity for data networks to face the growth of demand. • Most of speech codecs used in data networks are of the analysis-by-synthesis type i.e. the low bit rate codecs.
21
Chapter 3: Voice Compression Techniques
3.1 Analysis by Synthesis (AbS) The basic model for analysis-by-synthesis predictive coding of speech is shown in figure 3.1. The model consists of three main parts. The first part is the synthesis filter which is an all-pole time-varying filter for modeling the short-time spectral envelope of the speech waveform. It is often called short-term correlation filter because its coefficients are computed by predicting a speech sample from few previous samples (usually previous 8-16 samples, hence the name short term). The synthesis filter could also include a long-term correlation filter cascaded to the short-term correlation filter. The long-term predictor models the fine structure of the speech spectrum [13]. The second part of
s(j)
Synthesis filter
Excitation generator
v(j)
Long-term Predictor
Short-term Predictor
u(j)
1 1 − Pl ( z )
s ( j )+
1 1 − Ps ( z )
-+
e(j)
Error weighting filter
ew(j)
Error minimization
Synthesis filter
Excitation generator
v(j)
Long-term Predictor
1 1 − Pl ( z ) Pl = Gz −α
u(n)
and
Short-term Predictor
1 1 − Ps ( z )
s ( j )
Ps (z ) = ∑ k =1 ak z − k β
Figure 3.1: Basic model for Analysis by Synthesis
the model is the excitation generator. This generator produces the excitation sequence which is to be fed to the synthesis filter to produce the reconstructed speech at the 22
Chapter 3: Voice Compression Techniques
receiver. The excitation is optimized by minimizing the perceptually weighted error between the original and synthesized speech. As it is shown in figure 3.1, a local decoder is present inside the encoder, and the analysis method for optimizing the excitation uses the difference between the original and synthesized speech as an error criterion, and it chooses the sequence of excitation which minimizes the weighted error. The efficiency of this analysis-by-synthesis method comes from the closed loop optimization procedure, which allows the representation of the prediction residual using a very low bit rate, while maintaining high speech quality. The key point in the closed-loop structure is that the prediction residual is quantized by minimizing the perceptually weighted error between the original and reconstructed speech rather than minimizing the error between the residual and its quantized version as in open-loop structures. The third part of this model is the criterion used in the error minimization. The most common error minimization criterion is the mean squared error (mse). In this model, a subjectively meaningful error minimization criterion is used, where the error e(n) is passed through perceptually weighting filter which shapes the noise spectrum in a way to make the power concentrated at the formant frequencies of the speech spectrum so that, the noise is masked by the speech signal. The encoding procedure includes two steps: firstly, the synthesis filter parameters are determined from the speech samples (10-30 ms of speech) outside the optimization loop. Secondly, the optimum excitation sequence for this filter is determined by minimizing the weighted error criterion. The excitation optimization interval is usually in the range of 4-7.5 ms which is less than the LPC parameter update frame. The speech frame is therefore divided into sub-blocks, or sub-frames where the excitation is determined individually for each sub-frame. The quantized filter parameters and the quantized excitation are sent to the receiver. The decoding procedure is performed by passing the decoded excitation signal through the synthesis filters to produce the reconstructed speech.
23
Chapter 3: Voice Compression Techniques
In the following subsections, we will discuss the LPC synthesis and pitch synthesis filters and the computation of their parameters, as well as the error weighting filter and the selection of the error criterion. The definition of every excitation method will be discussed in separate sections.
3.1.1 The Short-term Predictor The short-term predictor models the short-time spectral envelope of the speech. The spectral envelope of a speech segment of length L samples can be approximated by the transmission function of an all-pole digital filter of the form [13] H (z ) =
1 1 = β 1 − Ps (z ) 1 − ∑ ak z − k k =1
(3-1)
where Ps (z ) = ∑ k =1 ak z − k β
(3-2)
is the short-term predictor. The coefficients {ak} are computed using the method of Linear Prediction (LP). The set of coefficients {ak} is called the LPC parameters or the predictor coefficients. The number of coefficients β is called the predictor order. The basic idea behind linear predictive analysis is that a speech sample can be approximated as a linear combination of past speech samples (8-16 samples), i.e. s ( j ) = ∑ k =1 ak s ( j − k ) β
(3-3)
where s(n) is the speech sample and ~s (n) is the predicted speech sample at sampling instant n. The prediction error, e(n) is defined as e ( j ) = s ( j ) − s ( j ) = s ( j ) − ∑ k =1 ak s ( j − k ) β
(3-4)
The short-term average prediction error is defined as E = ∑ e 2 ( j ) = ∑ [s ( j ) − ∑ k =1 ak s ( j − k )]2 β
j
(3-5)
j
To find the values of {ak} that minimize E, we set ∂E / ∂ai = 0 for i=1,…, β. Then after differentiating and rearranging
24
Chapter 3: Voice Compression Techniques
∑
β
a φ (i , k ) = φ (i , 0) , i = 1,....., β .
k =1 k
where φ (i , k ) = ∑ s ( j − i )s ( j − k ) , i = 1,....., β .
(3-6) (3-7)
j
Equation (3-6) defines a set of β equations in β unknowns, {ak}. The most common and widely used method for solving this set of equations is the autocorrelation method.
3.1.2 Autocorrelation Method In this approach, we assume that the error in Equation (3-5) is computed over the infinite duration -∞ < j < ∞. Since this cannot be done in practice, it is assumed that the waveform segment is identically zero outside the interval 0 < j < La -1 where La is the LPC analysis frame length [13, 14]. This is equivalent to multiplying the input speech by a finite length window w(j) that is identically zero outside the interval 0 < j < La-1. Using a rectangular window causes sharp truncation of the speech segment which increases prediction error. So it is plausible to use a tapered window like a Hamming window as given below [13]: w ( j ) = 0.54 − 0.46 cos(2π j /( La − 1)) , 0 ≤ j ≤ La − 1
(3-8)
Considering Equation (3-5), e(j) is nonzero only in the interval 0 < j < La + β- 1. Thus φ (i , k ) = ∑ n =0
La + β −1
s ( j − i )s ( j − k )
i = 1,...., β k = 0,...., β
(3-9)
Setting m = j − i , equation (3.13) can be rewritten as φ (i, k ) = ∑m =0
La −1− ( i − k )
s ( m) s ( m + i − k )
(3-10)
So, φ (i , k ) is the short-term autocorrelation of s(m) evaluated for (i-k). That is φ (i, k ) = R(i − k )
(3-11)
where R (λ ) = ∑ m =λ s ( j )s ( j − λ ) La −1
(3-12)
The set of equations (3-11) can be expressed in the matrix form as
25
Chapter 3: Voice Compression Techniques R (1) R (2) ⎛ R (0) ⎜ R (0) R (1) ⎜ R (1) ⎜ R (2) R (1) R (0) ⎜ # # # ⎜ ⎜ R ( β − 1) R ( β − 2) R ( β − 3) ⎝
" R ( β − 1) ⎞ ⎛ a1 ⎞ ⎛ R (1) ⎞ ⎟⎜ ⎟ ⎜ ⎟ " R ( β − 2) ⎟ ⎜ a2 ⎟ ⎜ R (2) ⎟ " R ( β − 3) ⎟ ⎜ a3 ⎟ = ⎜ R (3) ⎟ ⎟⎜ ⎟ ⎜ ⎟ % # ⎟⎜ # ⎟ ⎜ # ⎟ " R (0) ⎟⎠ ⎜⎝ aβ ⎟⎠ ⎜⎝ R ( β ) ⎟⎠
(3-13)
This most efficient algorithm to find ai from this matrix equation is the Durbin's recursion algorithm which is as follows: E(0) = R(0) For i = 1 to β do k i = [ R (i ) − ∑ j =1 a (ji −1) R (i − j )] / E (i − 1) i −1
a i( i ) = k i
For j = 1 to i-1 do a i( i ) = a (ji −1) − k i ai(−i −j1)
E (i ) = (1 − k i2 ) E (i − 1)
The final solution is given as a j = a (j β ) ki is known as reflection coefficients and rang from -1 to 1. A sufficient condition for a stable predictor is
ki ≤ 1
The ki are then converted to log-area rations (LARs) which is suitable for transmission as they are less sensitive to channel errors. LARs are computed as: ⎛ 1 − ki LARi = log⎜⎜ ⎝ 1 + ki
⎞ ⎟⎟ ⎠
(3-14)
More suitable parameters that are less sensitive for errors in transmission channel are Line Spectrum Frequencies (LSFs). More details about calculating the LSFs from ai are found in [14, 15].
3.1.3 Long-Term Predictor (LTP) While the short-term predictor models the spectral envelope of the speech segment being analyzed, the long-term predictor (LTP), or the pitch predictor, is used to model the 26
Chapter 3: Voice Compression Techniques
fine structure of that envelope [13,14]. Inverse filtering of the speech input removes some of the redundancy in the speech by subtracting from the speech sample its predicted value using the past β samples (usually β=10) are used to predict the present sample of speech. The short-term prediction residual, however, still exhibits some periodicity (or redundancy) related to the pitch period of the original speech when it is voiced. This periodicity is on the order of 20-160 samples (50-400 Hz pitch frequencies). Adding the pitch predictor to the inverse filter further removes the redundancy in the residual signal and turns it into a noise-like process. It is called pitch predictor since it removes the pitch periodicity, or long-term predictor since the predictor delay is between 20 and 160 samples. The long-term predictor is very essential in low bit rate speech coders, as in the CELP, where the excitation signal is modeled by a Gaussian process, therefore long-term prediction is necessary to insure that the prediction residual is very close to random Gaussian noise process. The general form of LTP is [13] 1 1 = m2 1 − Pl 1 − ∑ G k z − (α + k ) k =− m
(3-15)
1
m1 = m2 = 0 is common and in this case the LTP called one-tap predictor. In this case Pl = Gz −α and α , G are calculated through minimization of the mean squared residual
error after short-term and long-term predictors. The order of this filter is α which ranges from 20 to 160. e ( j ) = ρ ( j ) −G ρ ( j − α )
(3-16)
where ρ(n) is the residual after short-term predictor. Doing so, α and G can be calculated from the following equations [13]:
∑ G= ∑
N −1
ρ ( j )ρ ( j − α )
j =0 N −1
[ ρ ( j − α )]2 j =0
⎛ [∑ N −1 ρ ( j ) ρ ( j − α )]2 ⎞ j =0 ⎟ α = max ⎜ ⎜ ∑ N −1[ ρ ( j − α )]2 ⎟ j =0 ⎝ ⎠
(3-15)
(3-16)
27
Chapter 3: Voice Compression Techniques
N is the number of samples over which long-term prediction is calculated. 3.1.3.1 LTP using adaptive codebook method The above calculation for LTP is an open loop solution. However, a significant improvement in results is achieved by using a closed loop solution i.e. inside the AbS loop which is described in details in [13].
3.2 Standard Voice Compression Techniques 3.2.1 GSM Full Rate (RPE-LTP) coder The GSM Full Rate coder (or GSM FR shortly)and also called the Regular Pulse Excited coder with Long Term Predictor (RPE-LTP). This coder has an STP of order 10 and one tap LTP. Its excitation signal is a set of regularly spaced ten pulses per each subframe that has 40 samples. The frame size of this coder is 160 samples, i.e. each frame has four subframes. The excitation pulses can take one of four positions. The coder output frame contains the 10 STP filter coefficients represented by LSFs (Line Spectral Frequencies) and the LTP gain and delay and the ten pulses amplitudes and the position of them. figure 3.2 shows the encoder and decoder for this codec. The bit rate for this codec is 13 Kbps [15].
28
Chapter 3: Voice Compression Techniques
Figure 3.2: Coder and Decoder for the GSM FR
3.2.2 Federal Standard CELP (FS1016) coder Federal Standard CELP (or FS CELP) stands for the Federal Standard Code Excited Linear Prediction. The excitation signal is chosen from a stochastic codebook. The STP order is 10 and the LTP uses the adaptive codebook approach. The bit rate for this codec is 4.8 Kbps. The coder is depicted in figure 3.3 [16].
29
Chapter 3: Voice Compression Techniques
Figure 3.3: The FS CELP coder block diagram
3.2.3 GSM Half Rate coder (GSM HR) Stands for the GSM Half Rate coder and also called VSELP or Vector Sum Excited Linear Prediction. Its bit rate is 5.6 Kbps. Its excitation is two stochastic codebooks which is shorter than that of FS CELP to reduce the codebook index search complexity. LTP with adaptive codebook approach exists. STP order is 10. Its coder is depicted in figure 3.4 [16].
30
Chapter 3: Voice Compression Techniques
Figure 3.4: GSM HR coder block diagram
3.2.4 LD-CELP Is the G.728 ITU-T standard and called Low Delay CELP. The general block diagram of LD-CELP codec is given in figure 3.5 [17].
31
Chapter 3: Voice Compression Techniques
Figure 3.5: LD-CELP coder
As, the bit rates of around 16 kbits/s and lower, the voice quality of waveform codecs falls rapidly. Thus at these rates CELP codecs and their derivatives, tend to be used. However because of the forward adaptive determination of the short-term filter coefficients used in most of these codecs, they tend to have high delays. The delay of a voice codec is defined as the time from when a voice sample arrives at the input of its encoder to when the corresponding sample is produced at the output of its decoder, assuming the bit stream from the encoder is fed directly to the decoder. For a typical hybrid voice codec this delay will be of the order of 50 to 100 ms, and such a high delay can cause problems. Therefore the ITU released a set of requirements for a new 16 kbits/s standard, the chief requirements being that the codec should have voice quality comparable to the waveform codec of 32 kbits/s (G.721) in both error free conditions and over noisy channels, and should have a delay of less than 5 ms and ideally less than 2ms. All the ITU requirements were met by a backward adaptive CELP codec, which was developed at AT&T Bell Labs, and was standardized as G.728. This codec uses backward adaptation to calculate the short-term filter coefficients, which means that rather than buffer 20 ms or so of the input speech to calculate the filter coefficients they are found from the past reconstructed speech. (This means that the codec can use a much shorter frame length than traditional CELP codecs, and G.728 uses a frame length of only 5 32
Chapter 3: Voice Compression Techniques
samples giving it a total delay of less than 2 ms. A high order (β=50) short term predictor is used, and this eliminates the need for any long term predictor. Thus all ten bits, which are available for each five-sample vector at 16 kbits/s, are used to represent the fixed codebook excitation. Of these ten bits seven are used to transmit the fixed codebook index, and the other three are used to represent the excitation gain. Backward gain adaptation is used to aid the quantization of the excitation gain, and at the decoder a post filter is used to improve the perceptual quality of the reconstructed voice [17].
33
Chapter 4: Packet Loss and Recovery Techniques
Chapter 4
Packet Loss and Recovery Techniques
Data transmitted across packet switched networks are normally subject to delay, delay jitter, resequencing of packets, and loss of packets [18]. Recently, the voice over Packet (VoP) applications are gaining increasing interest since they can be realized using inexpensive network services. However, due to the unreliable nature of packet delivery, the quality of the received stream will be adversely affected by packet loss or delay [19].
4.1 Packet loss in data networks Packet loss always occurs due to discarding packets in congestion periods as in IP networks and buffer overflow as in ATM switches, or by dropping packets at gateway/terminal due to late arrival or miss delivery due to errors in packet header. Delay and delay variation (jitter) are the main network impairments that affect voice quality [1]. The end-to-end delay is the time elapsed between sending and receiving a packet. It mainly consists of the following components [1]: • Propagation delay: depends only on the physical distance of the communications path and the communication medium. When transmitted over fiber, coax or twisted wire pairs, packets incur a one-way delay of 5 μs/km. • Transmission delay: the sum of the time it takes the network interface to send out the packet. • Queuing delay: the time a packet has to spend in the queues at the input and output ports before it can be processed. It is mainly caused by network congestion. • Codec processing delay: including codec’s algorithmic delay and lookahead delay.
34
Chapter 4: Packet Loss and Recovery Techniques
• Packetization/de-packetization delay: the time needed to build data packets at the sender, as well as to strip off packet headers at the receiver. • Playout buffer delay, the time waited at playout buffer at the receiver/terminal. The ITU has recommended one-way delays no greater than 150ms for most applications, with a limit of 400ms for acceptable voice communications [1]. For multimedia applications, such as voice, delay can cause packet loss due to discarding due to late arrival, so packet loss and loss due to delay can be merged into packet loss [10]. Delay jitter can be cured using a playout buffer to add delay to early arrived packets to regulate packet arrival time as seen by the voice decoder.
4.2 Packet loss recovery techniques Recovery techniques may be divided into two classes: sender-based and receiver-based. In Sender-based recovery, the sender participates in the loss recovery. The sender-based recovery techniques can be subdivided into Active techniques such as Retransmission, and Passive techniques such as Interleaving and Forward Error Check (FEC). FEC can be Media-independent or Media-specific[19]. The receiver-based recovery techniques enable destinations to recover lost packets independent from the sender, making them more flexible. They can be classified as insertion repair, interpolation repair or regeneration repair [19]. figure 4.1 shows a more detailed classification for different recovery techniques.
35
Chapter 4: Packet Loss and Recovery Techniques
Figure 4.1: Recovery Techniques
4.3 Sender-Based Packet Loss Recovery The basic sender-based mechanisms available to recover the packet loss are: Automatic Repeat reQuest (ARQ), media independent/specific Forward Error Correction (FEC), FEC with uneven level protection (ULP), and interleaving.
4.3.1 Automatic Repeat reQuest (ARQ) ARQ is considered as active recovery technique. Using ARQ [18], a lost packet is retransmitted to the receiver by the sender. ARQ-based schemes consist of three parts:
36
Chapter 4: Packet Loss and Recovery Techniques
Lost data detection by the receiver or by the sender (timeout). Acknowledgment strategy: The receiver sends acknowledgments that indicate which data are received or which data are missing. Retransmission strategy: It determines which data are retransmitted by the sender. Although its robustness against the burst losses, it cannot be used in real-time application, such as VoP, because of the large amount of delay and bandwidth overhead.
4.3.2 Media-independent FEC In FEC, lost data would be recovered at the receiver without further reference to the sender. Both the original data and the redundant information are transmitted to the receiver. There are two kinds of redundant information: those that are either independent or dependent on the media stream. The media-independent FEC does not need to know the original data type (speech, audio, or video). In media-independent FEC, original data together with some redundant data, called parities, are transmitted to the receiver. The redundant data is derived from the original data using the exclusive-OR (XOR) operation: one parity packet is generated for the given original packets; or using Reed- Solomon codes: multiple independent parities can be computed for the same set of packets. Reed-Solomon codes allow achieving optimal loss protection, but lead to higher processing costs than schemes based on XOR operation. Even though the XOR operation results in a sub-optimal protection, it is preferred for practical implementations since we can compute several parity packets with lower processing cost than Reed-Solomon method. The FEC transmits k original data packets (D) and additional h redundant parity packets (P). figure 4.2 shows an example for k=3 and h=2. The FEC encoder produces two redundant packets (P1, P2) from three data packets. If one data packet (D3) and one parity packet (P1) are dropped, the receiver can recover the data packet (D3) by using the successfully received data packets, D1, D2, and P2.
37
Chapter 4: Packet Loss and Recovery Techniques Network loss in FEC block
D3 D2 D1
D2 D1 P2 P1 D3 D2 D1
D3 D2 D1 P2
FEC Encoder
FEC Decoder
D3
P2 P1
Figure 4.2: Media-independent FEC
FEC is effective even for a small h/k ratio. For the FEC decoder, losses of consecutive packets can be corrected for large values of k, since the decoder uses not only the parity bits but also the data bits. However, if k increases, the reconstruction delay at the receiver also increases. There are several advantages of the media-independent FEC schemes: They are source independent: the operation of FEC does not depend on the contents of the original data and the repair is the exact replacement for a lost packet. The original data packet can be used by receivers which are not capable of FEC, since the redundant data are usually sent as a separate stream. However, it has a main disadvantage that is the FEC coding requires high delay or bandwidth for the efficient encoding and decoding. If we increase k or h, it causes the additional delay or bandwidth, respectively.
4.3.3 Media-specific FEC 4.3.3.1 Embedded Speech Coding technique The embedded speech coding technique is introduced in [20]. This technique implements voice recovery by adding redundancies to the audio packets sent by the source. In order to implement this technique without incurring excessive bandwidth, both toll and non-toll quality voice coding algorithms are used in the primary and redundancy transmissions respectively. Therefore, the redundant voice segment is of lower quality than the primary voice data. The use of different coding algorithms is necessary, as better quality voice coding algorithms demand higher network bandwidth and more processing 38
Chapter 4: Packet Loss and Recovery Techniques
power. In this way, the output speech waveform, at the receiver, will consist of periods of toll quality speech interspersed with periods of synthetic quality speech. The synthetic quality speech coding algorithm, Linear Predictive Coding (LPC), is used for redundant voice encoding. The placement of redundancies is dependent on the network loads condition. In a light and intermediate loads condition, losses are essentially nonconsecutive for an audio stream, and for heavy loads, the behavior is similar, but consecutive losses are more prevalent. Hence, [20] proposed that redundancies placed immediately in the following packet works well in a light and medium network loads. Whereas, redundancies placed in a number of packets later is more suitable for heavy loads. Figure 4.3 is a pictorial description of the packet structure. This technique requires low-delay to synthesize the quantized signal at the decoder, since only a single-packet
Primary Speech Coding
Secondary Coding
Primary Speech Coding
Secondary Coding
Primary Speech Coding
Secondary Coding
Primary Speech Coding
Secondary Coding
Primary Speech Coding
Secondary Coding
Primary Speech Coding
Secondary Coding
delay is usually added. This makes it suitable for interactive applications, such as VoP.
Primary Speech Coding
Secondary Coding
Primary Speech Coding
Secondary Coding
(a)
(b)
Figure 4.3: Position of redundant data in case of, (a) low and medium network loads, and (b) heavy network load.
39
Chapter 4: Packet Loss and Recovery Techniques
4.3.3.2 Uneven Level Protection The idea of uneven Level Protection (ULP) is arisen from the fact that different portions of data (of speech data, in particular) have unequal importance for the reconstruction quality, and the capability to assign different priority levels for different data portions which is possible in ATM networks. For example, if speech is coded using code-exited linear prediction (CELP), then pitch and prediction filter parameters are of considerably higher importance than the excitation codebook. An error in prediction filter parameters can reduce the quality considerably and even result in an unstable system, while an error in codebook index will generally result in little or even no loss in perceptual quality. This motivates a usage of uneven protection for data units having unequal importance. In general, ULP is applied by assigning high priority to linear prediction coefficients (LPC) and pitch parameters and low priority to excitation information for linear prediction-based coders. For waveform coders, ULP is applied by assigning high priority to most significant bits (MSB) and low priority to least significant bits (LSB) in waveform coders, [21].
4.3.4 Interleaving Interleaving is a useful packet-loss recovery technique for applications where endto-end delay is of secondary importance [19]. If the size of a data unit produced at a time by a coder is smaller than the allowed payload size in a packet, then a few data units may be combined into a single packet. However, in order to reduce the packet-loss effects, the original data units are not combined in the same sequential order as produced by the coder but interleaved by the transmitter. Units are resequenced before transmission so that originally adjacent units are separated by a guaranteed distance in the transmitted stream and returned to their original order at the receiver and provided to the decoder. As can be seen from figure 4.4, the effect of a packet loss (one packet contains a few data units) is distributed over small intervals corresponding to distributed data units instead of adjacent data units. For example, units are 5 ms in length and packets 20 ms (i.e., 4 units/packet), 40
Chapter 4: Packet Loss and Recovery Techniques
then the first packet would contain units 1, 5, 9, 13; the second units 2, 6, 10, 14; and so on, as illustrated in figure 4.4. It can be seen that the loss of a single packet from an interleaved stream results in multiple small gaps in the reconstructed stream, as opposed to the single large gap which would occur in a noninterleaved stream. The effect of a packet loss is reduced due to the following reasons: The resulting small gap intervals correspond typically to speech intervals considerably shorter than a phoneme length. Therefore, humans are able to mentally interpolate the gap intervals, and speech intelligibility is not decreased. This is in contrast to the noninterleaving situation where a single lost packet can result in a complete phoneme lost, which decreases intelligibility of speech. If the receiver is using some form of error concealment (e.g. the gaps due to packet loss are filled using interpolation, a receiver-based recovery technique, of the received adjacent data units), then a higher performance is obtained if interpolation is performed for small intervals instead of longer intervals. The interleaving operation is described by interleave length, L (the distance between subsequent units after interleaving), bundling factor, B (number of data units in one packet), and the sequence length, n, which are related as n =BL. 1
2
3
4
5
6
7
8
1
5
9 13
2
6 10 14
1
5
9 13
2
6 10 14
1
2
4
5
6
8
Packetized data stream
9 10 11 12
13 14 15 16
3
4
8 12 16
Interleaved data stream
4
8 12 16
Data stream experienced loss
7 11 15
9 10
12
13 14
16
Reconstructed stream
Figure 4.4: Interleaving of units across multiple units.
41
Chapter 4: Packet Loss and Recovery Techniques
Increasing the interleave length, L, increases the distance between resulting gaps due to a single packet loss, which minimizes the packet-loss effect, without increase in network load. At the same time, this increases the overall delay, which might be prohibitive for interactive applications. The choice of bundling factor, B, is defined by the receiver buffer size, and the delay requirements.
4.4 Receiver-based Recovery Techniques Unlike sender-based recovery techniques, the receiver-based recovery techniques are independent and don't require any action to be taken by the source party. On a closer look to figure 4.1, one can notice that all receiver-based recovery but insertion-based recovery depends mainly on the voice properties. The following discussion will show this notice.
4.4.1 Insertion Insertion-based repair schemes derive a replacement for a lost packet by inserting a simple fill-in. The simplest case is splicing, where a zero-length fill-in is used; an alternative is silence substitution, where a fill-in with the duration of the lost packet is substituted to maintain the timing of the stream. Better results are obtained by using noise or a repeat of the previous packet as the replacement. The distinguishing feature of insertion-based repair techniques is that the characteristics of the signal are not used to aid reconstruction. This makes these methods simple to implement, but results in generally poor performance [19]. 4.4.1.1 Splicing Lost units can be concealed by splicing together the audio on either side of the loss; no gap is left due to a missing packet, but the timing of the stream is disrupted. This technique has shown to perform poorly [19]. Low loss rates and short clipping lengths (4– 16 ms) faired best, but the results were intolerable for losses above 3 percent. The use of splicing can also interfere with the adaptive playout buffer required in a packet audio 42
Chapter 4: Packet Loss and Recovery Techniques
system, because it makes a step reduction in the amount of data available to buffer. The adaptive playout buffer is used to allow for the reordering of misordered packets and removal of network timing jitter, and poor performance of this buffer can adversely affect the quality of the entire system. It is clear; therefore, that splicing together audio on either side of a lost unit is not an acceptable repair technique [19]. 4.4.1.2 Silence Substitution (SS) The silence substitution method just fills the gap left by lost packet with silence (i.e., zero) to maintain the speech timing sequence [22]. This is the simplest and least complexity [22] method among all the techniques; however, it is not able to maintain an acceptable quality of playback audio in the event of high lost rate and large packet size [23]. It is only effective with short packet lengths (< 4 ms) and low loss rates (< 2 percent) [19, 22]. The performance of silence substitution degrades rapidly as packet sizes increase, and quality is unacceptably bad for the 40 ms packet size in common use in network audio conferencing tools [19]. 4.4.1.3 Noise Substitution (NS) Instead of filling in the gap left by a lost packet with silence, background noise is inserted instead [19]. A number of studies of the human perception of interrupted speech have shown that the ability of the human brain to subconsciously repair the missing segment of speech with the correct sound occurs for speech repair using noise substitution but not for silence substitution. In addition, when compared to silence, the use of white noise has been shown to give both subjectively better quality and improved intelligibility. It is therefore recommended as a replacement for silence substitution [19, 20]. 4.4.1.4 Packet Repetition (PR) The repetition method uses the packet preceding to the lost packet as the substitution. Its complexity is also close to zero, same as that of the silence substitution 43
Chapter 4: Packet Loss and Recovery Techniques
method. However, the repetition method has better recovery performance than the silence substitution method. The reconstructed speech using this method can tolerate a packet loss rate of up to 4% [22]. The subjective quality of repetition can be improved by gradually fading repeated units. The GSM system, for example, advocates the repetition of the first 20 ms with the same amplitude followed by fading the repeated signal to zero amplitude over the next 320 ms. The use of repetition with fading is a good compromise between the other poorly performing silence substitution and noise substitution, and the more complex interpolation-based methods [19] that will be discussed in the subsequent sections.
4.4.2 Interpolation The following recovery techniques are categorized as interpolation methods because the missing speech segments are substituted by other segments from the same speech stream with some modifications that differ according to each method. The advantage of interpolation-based schemes over insertion-based techniques is that they account for the changing characteristics of the speech signal [19]. 4.4.2.1 Waveform Substitution (WS) In this method, when a segment of voice fails to arrive at the destination on time, the previous segment of voice is used to replace the missing segment of voice. The assumption of this technique is that the speech characteristics have not changed much from a preceding speech segment and it is logical to use the previous segment of speech to reconstruct the missing portion. This method does not work for large packet size as the voice characteristics are most likely to change noticeably from one previous packet to the next. Moreover, it also does not guard against the continuous loss of multiple packets where voice characteristics do not remain the same over the duration of packet loss. As with silence substitution, it does not demand lots of processing power [23, 24]. Hence, it is used in some of the interactive voice communication applications [24, 25]. 44
Chapter 4: Packet Loss and Recovery Techniques
4.4.2.2 Sample Interpolation (SI) This technique is similar to Waveform Substitution; however, it does not directly replace all missing audio segments with the previously received segments. It modifies the previous audio packets before substituting the missing audio segments with it. The method assumes that the audio characteristics change slightly over a short period of time. In order to use previously received samples to replace the missing audio segments and at the same time accommodating the slight change in audio attribute, the missing samples are estimated based on the previous samples' characteristics. A simple form of sample modification is linear interpolation of audio. In comparison, it requires more processing power than the Waveform Substitution, but it offers a better contingency solution. As with Waveform Substitution, it is not usable in a prolonged duration of packets loss as it is likely that the audio characteristics will change significantly [23]. 4.4.2.3 Pitch Waveform Replication (PWR) The pitch waveform replication method uses two parallel detectors which continually detect the positive and negative peaks of the speech, respectively, to estimate the pitch of the packet before the lost one. The pitch search can be done on either side of the loss [19] or only on the preceding side. Figure 4.5 illustrates how the positive peak detector works. In figure 4.5, assume the speech signal is x(n). The positive peak detector updates the value of MAX with successive local maxima of speech samples until no update has occurred for the number of hold samples. Then, MAX decays exponentially by a factor (i.e., the value of the decreasing factor is smaller than one) until it is exceeded by a speech sample. The negative peak detector works analogously. From the two peak detectors, we can obtain the four time intervals that separate the most recent three maxima and minima respectively. By using these four pitch estimations; PWR can decide whether the speech before the missing packet is voiced. If the speech is not voiced or the pitch detection fails, PWR uses the repetition method to recover the lost packet. If the speech is voiced, PWR reconstructs the missing packet by duplicating the pitch period sample
45
Chapter 4: Packet Loss and Recovery Techniques
preceding the missing packet throughout the region of the lost packet. The packet loss rate tolerable is up to about 10% [22]. PWR can be considered a refinement on waveform substitution [19]. Start
n=1 position = 0 size = packet size factor = decreasing factor count = hold MAX = x(0)
n = n+1
no
End
n< size?
yes
x(n)>MAX?
0
- count
=0 MAX=MAX*factor
Store position
Figure 4.5: Positive peak detector algorithm flowchart
Complexity of PWR can be reduced by converting the original speech waveform into trinary one [26]. The method is based on the speech features of local periodicity. However, by converting the original speech waveform into trinay one, amplitude of the waveform is simplified from arbitrary decimal numbers to definite trinary integers. Then, by encoding a trinary integer and its successive count into a pair, the number of data is 46
Chapter 4: Packet Loss and Recovery Techniques
greatly reduced from sample count to pair count. In addition, by determining the local threshold optimally, only the locally highest positive and negative peaks are captured after converting the waveform into trinary. This capturing makes finding the minimum speech period easy and high accurately. The found minimum period of the original speech waveform is substituted cyclically during the packet loss duration. Here, in order to minimize waveform discontinuity at the beginning and the end of the packet loss duration, interpolation between each two successive samples of the original waveform is performed such as keeping the continuity of phase variation in the reproduced speech waveform [26]. The flowchart of the algorithm for this method is shown in figure 4.6a. A demonstration for it is shown in figure 4.6b where, T+ and T- are the positive and negative thresholds respectively, τ is the part of the found pitch period required to complete the last pitch period before loss, q is the detected loss period, A's indicate the numbers of samples that exceeded T+ and encoded as +1, B's indicate the numbers of samples that are less than T- and encoded as -1 and C's are the numbers of samples that have amplitudes between T+ and T- and encoded as zero. Generally, there are three important factors that affect the speech quality, namely amplitude continuity, phase continuity and frequency continuity. Explicitly, the amplitudes, phases and frequencies have to be continuous at the boundaries between the substitution packets and their neighboring packets (including the previous and subsequent packets received), otherwise audible noise will occur [22]. It is noted that the PWR method has two shortcomings. First, PWR only copes with the continuity for the boundaries between the reconstructed packets and their previous packets when the speech is voiced, whereas the continuity for the boundaries between the reconstructed packets and the subsequent ones is not properly dealt with. This is referred to as the discontinuity problem of PWR. The second shortcoming of the PWR method is that it uses the repetition method to recover the lost packet when the speech is unvoiced or when the pitch detection fails.
47
Chapter 4: Packet Loss and Recovery Techniques Found minimum Determine threshold levels T+ and T- from the maximum and minimum values of the original speech waveform, respectively, in the definite length of duration.
τ
T+ q
a
T-
Convert the original speech waveform into trinary one, according to the threshold levels T+ and T-.
→ Time (ms) A1 B1
B2
A2
A3
B3 B4 B5 B6
A4 B7B8
b
Encode a value of trinary waveform and its successive count into a pair for each trinary segment.
C1
C2
C3
C4
→ Time (ms) Packet loss detected
no c
yes Do pattern matching with the definite maximum allowable difference, to determine the first matched part, and to find the minimum period of the original waveform.
→ Time (ms)
A1 A2 d A3 A4
Substitute the found minimum period of the original waveform cyclically, starting at the position S with displacement d from the beginning of the found minimum period.
During loss
yes
no Figure 4.6a: PWR with trinary encoding method flow chart [26].
(1,3) (1,4) (1,1) (1,2)
B1 B2 B3 B4 B5 B6 B7 B8
(0,49) (0,3) (0,43) (0,3) (0,44) (0,5) (0,47) (0 4)
C1 C2 C3 C4 C5
(-1,1) (-1,6) (-1,5) (-1,4) (-1,5)
Original Speech waveform with packet loss, where T+ is +ve threshold, T- is –ve threshold. Trinary waveform. Substitution reproduced waveform Ternary-encoded pairs Figure 4.6b: Waveforms for PWR with trinary encoding method [25].
Figure 4.6: PWR with trinary encoding method
48
Chapter 4: Packet Loss and Recovery Techniques
However, the reconstructed speech will not have acceptable quality when there is a transition from voiced to unvoiced at the substitution packets. This is referred to as the repetition problem of PWR [22]. 4.4.2.4 Double Sided Pitch Waveform Replication (DSPWR) DSPWR remedies the above mentioned problems of the PWR. While PWR searches for pitch in the preceding packet only, DSPWR searches for it in both preceding and following packets surrounding loss. This adds the ability to detect if during loss a transition to a voiced segment occurred, in this case a careful repair is required since the loss during such a transition causes sever impact on speech quality [27]. DSPWR can tolerate loss rates up to 30%. According to the status of the two pitch detectors (for preceding and following packets); there are four repair procedures [22]: • If both detectors succeeded in finding the pitch, a scheme called BV (both voiced) used for recovery. • If the detector for the preceding packet succeeded in finding pitch, but the detector for the following packet failed to find it, a recovery scheme called PV (preceding voiced) used. • If only the pitch detector of the following packet succeeded, a scheme for recovery called FV (following voiced) is used. • If both pitch detectors failed a scheme called BU (both unvoiced) used for recovery. Before discussing these recovery procedures, the problems of pitch segment adjustment, phase discontinuity, and amplitude adjustment, have to be dealt with first. Pitch Segment Adjustment (PSA) The purpose of Pitch Segment Adjustment is to eliminate the amplitude discontinuity at the boundaries between the reconstructed pitch segments used to recover the lost packet [22]. Note that in the PWR method there may be amplitude discontinuity between the head and tail of the pitch segment and due to its frequent occurrences, this 49
Chapter 4: Packet Loss and Recovery Techniques
problem is as important as the one caused by the phase discontinuity [22]. Pitch Segment Adjustment, will be abbreviated as PSA. The PSA is done as follows: • Let x(n) be the speech signal, P is the pitch • Search the region from x(n-1-P-3) to x(n-1-P+3) to find the sample x(n-1-P+i) whose value is the most close to that x(n-1). • if i≤3 then the amount of lagging is d=3-i, else the amount of leading is d=2(i-3). • Calculate diff =
x(n − P − d ) − x(n − 1) d +1
(4-1)
• for j=1 to j=d do x(n-P+j-1)=x(n-1)+j*diff
(4-2)
The PSA is depicted in figure 4.7.
Procedure PSA for the preceding packet, making an amplitude adjustment to the pitch (P) period segment: (a) the pitch lagging condition, (b) the pitch leading condition, (c) the searching procedure for the value of lagging/leading.
Figure 4.7: Procedure PSA
50
Chapter 4: Packet Loss and Recovery Techniques
Eliminating Phase Discontinuity As mentioned in 4.2.2.3, PWR suffers from phase discontinuity at the boundary between the recovered packet and the subsequent one. To eliminate this phase discontinuity, the phase of the beginning sample of the packet following the lost one is computed, and then both of the pitch segments are used to reconstruct the lost packet, and use the pitch difference between the pitches computed from both sides of the lost packet to eliminate the phase difference. This procedure called Phase Matching using Pitch difference [22], and abbreviated PMP. Suppose that the pitch of the preceding packet is PP, the pitch of the following packet is PF, the size of packet is n, the phase of the beginning sample of the following packet is phase, the number of pitch segments of the preceding packet used to reconstruct the lost packet is a, the number of pitch segments of the following packet used to reconstruct the lost packet is b, and the number of remaining samples after the fill-up with the pitch segments is c. Then, the lost packet is reconstructed as illustrated in figure 4.8. From figure 4.8, we can derive the following equations. a*PP + b*PF + c = n
(4-3)
hence, (a + b)*PP + b*(PF -PP) + c = n
(4-4)
hence, b*(PF - PP) + c = n-(a + b)*PP
(4-5)
The right hand side means the phase if the lost packet was built using the pitch found in the preceding packet only. hence, b*(PF-PP) + c = n mod PP + k*PP
(4-6)
To erase the phase discontinuity, c has to be equal to phase. Replacing c with phase in the above equation, we get: b*(PF-PP) + phase = n mod PP + k*PP
(4-7)
Hence, b*(PF-PP) = n mod PP + k*PP – phase
(4-8)
Hence b * pitch difference = phase difference where, pitch difference = PF-PP
(4-9)
Using the above equation, a, b and c can be obtained. 51
Chapter 4: Packet Loss and Recovery Techniques
The PMP can be implemented as following: • Step 1:
Calculate
initial
state
values,
⎢ n ⎥ a=⎢ ⎥, ⎣ PP ⎦
b = 0,
c = n mod PP ,
phase _ diff = c − phase , and pitch _ diff = PF − PP
• Step 2: if phase _ diff = pitch _ diff = 0 then finish and use PP or PF for building the lost packet. • Step 3: else if sign( pitch _ diff ) = sign( phase _ diff ) then b = round (
phase _ diff ) , a = a-b, and c = c - b*pitch_diff pitch _ diff ⎢ n ⎥
if ( PP − phase _ diff < c − phase ) then a = ⎢ ⎥ , b = 0 , c = n mod PP ⎣ PP ⎦ • Step 4: else if phase _ diff > 0 and pitch _ diff < 0 then phase_diff = phase_diff - PP and goto Step 3. • Step 5: else if
phase _ diff < 0 and pitch _ diff > 0 then
phase_diff = phase_diff + PP and goto Step 3.
The lost packet is reconstructed by using the pitch segments of the packet at both sides of the lost packet.
Figure 4.8: Illustration of procedure PMP
Adjusting Recovered Packet Amplitude The amplitude of the reconstructed packet is adjusted in such a way that the amplitude is continuous inside the packet as well as to the neighboring packets. Two amplitude adjustment procedures are used. One of them called FWAA (standing for forward amplitude adjustment) and the other called BWAA (standing for backward 52
Chapter 4: Packet Loss and Recovery Techniques
amplitude adjustment), are described below. Assume that the amplitude of the selected pitch segment for the previous packet is VP, the amplitude of the selected pitch segment of the future segment is VF, the packet size is n and the signal segment required to adjust is from x(start) to x(stop) in the reconstructed packet [22]. The corresponding scenarios are given in figure 4.9.
Figure 4.9: Illustration of the procedure FWAA and BWAA, where the dotted lines represent the waveforms after reconstruction Procedure FWAA
1- Compute factor =
VF − VP VP * n
2- for i = start to stop do x(n) = x(n)*(1 + factor*i)
53
Chapter 4: Packet Loss and Recovery Techniques Procedure BWAA
1- Compute factor =
VP − VF VF * n
2- for i = start to stop do x(n) = x(n)*(1 + factor*(n-i)) Scheme BV For the case that both the preceding and following packets are voiced. • Step 1. Use the procedure PSA to adjust the PP period segment just preceding the lost packet and the PF period segment just following the lost packet. • Step 2. Compute the peak amplitude of the PP period segment just preceding the lost packet and denote this amplitude as VP. Also, compute the peak amplitude of the PF period segment just following the lost packet and denote this amplitude as VF. • Step 3. Compute the phase of the beginning sample of the packet following the lost packet by the algorithm proposed in the phase-matching recovery method. • Step 4. Use procedure PMP to derive the parameters a, b and c. • Step 5. Copy repetitions of the preceding pitch segment into the lost packet. • Step 6. Then copy the first c samples of the preceding pitch segment into the lost packet. • Step 7. Copy b repetitions of the following pitch segment into the lost packet. • Step 8. Use procedure FWAA to adjust the amplitude of the leading pitch segment in the lost packet. • Step 9. Use procedure BWAA to adjust the amplitude of the rear pitch segment in the lost packet. If just one side of the pitch estimation is successful, we will use the successful side to reconstruct the lost packet. First, we adjust the pitch segment consisting of pitch samples just preceding or following the lost packet by using the procedure PSA as depicted in 54
Chapter 4: Packet Loss and Recovery Techniques
figure 4.7 and calculate the peak amplitudes of this pitch segment. Then, we reconstruct the missing packet by duplicating this pitch segment throughout the region of the lost packet. In addition, we adjust the amplitude of the reconstructed speech by using the procedure FWAA or BWAA to make the amplitude linearly decrease from the voiced side to the other side. Note that according to our experiments, we do not need to consider the phase discontinuity from a voiced speech to unvoiced speech, since the energy of unvoiced speech is in general so small and barely audible. Schemes PV and FV are outlined below. Scheme PV For the case that the preceding packet is voiced. • Step 1. Adjust the PP period segment just preceding the lost packet by using the procedure PSA. • Step 2. Compute the peak amplitude of the PP period segment just preceding and following the lost packet. • Step 3. Repeat the preceding segments throughout the region of the lost packet as the substitution. • Step 4. Adjust the amplitude of the substitution using procedure FWAA. Scheme FV For the case that the following packet is voiced. • Step 1. Adjust the PF period segment just following the lost packet by using the procedure PSA. • Step 2. Compute the peak amplitude of the PF period segment just preceding and following the lost packet. • Step 3. Repeat the following segments throughout the region of the lost packet as the substitution. • Step 4. Adjust the amplitude of the substitution using procedure BWAA.
55
Chapter 4: Packet Loss and Recovery Techniques
If the pitch estimations on both sides failed, we reconstruct the lost packet by the rear half packet of the preceding packet and the first half packet of the following packet. This method is easy to implement and found to be effective to reduce the noise caused by the transition from voiced to unvoiced and vice versa. The scheme is presented as follows. Scheme BU For the case that both preceding and following packets are unvoiced. • Step 1. Copy the rear half of the preceding packet into the region of the first half of the lost packet. • Step 2. Copy the first half of the following packet into the region of the rear half of the lost packet.
4.4.3 Regeneration Based Recovery Regenerative repair techniques use knowledge of the audio compression algorithm to derive codec parameters, such that audio in a lost packet can be synthesized. These techniques are necessarily codec-dependent but perform well because of the large amount of state information used in the repair. However, they are computationally intensive [19].
4.4.4 Model-Based Recovery In model-based recovery the speech on one, or both, sides of the loss is fitted to a model that is used to generate speech to cover the period loss. This technique works well for short blocks (8/16 ms) to ensure that the speech characteristics of the last received block have a high probability of being relevant [19]. 4.4.4.1 Interpolation of Transmitted State For codecs based on linear prediction, it is possible that the decoder can interpolate between states. For example, the ITU G.723.1 speech coder interpolates the state of the linear predictor coefficients either side of short losses and uses either a periodic excitation the same as the previous frame, or gain matched random number generator, depending on 56
Chapter 4: Packet Loss and Recovery Techniques
whether the signal was voiced or unvoiced. For longer losses, the reproduced signal is gradually faded. The advantages of codecs that can interpolate state rather than recoding the audio on either side of the loss is that there is are no boundary effects due to changing codecs, and the computational load remains approximately constant. However, it should be noted that codecs where interpolation may be applied typically have high processing demands [19]. From the above description of different recovery techniques, it is noted that all sender-based recovery techniques require more bandwidth than the original speech packets stream also they add delay. Regeneration-based and insertion-based techniques don't guard against consecutive packet loss and recovered speech quality using these methods degrades rapidly as loss rate increase. The interpolation-based techniques, which are receiver-based techniques are codec independent and don't require any extra transmission bandwidth, so they will be adopted through out the rest of the thesis.
57
Chapter 5: Enhanced Recovery Techniques
Chapter 5
Enhanced Recovery Techniques
In chapter 4, all the explored recovery techniques suffer from shortcomings, such as high complexity as in DSPWR, or non optimum recovery quality as in WS, SI, and PWR. Two improvements on recovery techniques may be done. Reducing complexity or enhancing their performance. These two enhancement directions have lead to development of the two recovery techniques discussed in this chapter.
5.1 Switching Recovery Technique (SRT) As is clear in chapter 4, usage of SS, PR, WS, SI, PWR, and DSPWR for recovering the lost voice packets. However Liao et al has measured their complexity in terms of execution time normalized to the SS recovery techniques [22]. Table 5.1 summarizes these results. The source code for these recovery techniques was not available and we had to build the source code of these different recovery techniques and realized the results in Table 5.1. The source code is built using MATLAB. A flow chart of each recovery technique was built and then converted to the code. Table 5.1: Complexity of recovery techniques normalized to SS
Complexity
SS
PR
WS
SI
PWR
DSPWR
1
1.5-2
4-6
10-15
25-50
450-600
Switching Recovery Technique (SRT) is a recovery technique that makes use of all other recovery techniques. The idea of SRT is that at low loss rates, the performance of the all recovery techniques is approximately the same. So we can use the recovery technique with the least complexity but with a small loss in quality. As the loss rate 58
Chapter 5: Enhanced Recovery Techniques
increases the SRT continue switching to the recovery technique which improves the quality but with the least complexity. In this way, SRT complexity increases as the loss rate increases because it switches to a more complex recovery technique to improve the quality. The Switched Recovery Technique (SRT) is a recovery technique that compromises between complexity and recovered voice quality. In designing SRT, two parameters should be defined. The first parameter is the transient packet loss period (TPLP), which is used to capture the packet loss period distribution. This value is calculated as the most recent packet loss period (the period between the current lost packet and the previous one). The second parameter is the average of the packet loss periods (APLP), which corresponds to the packet loss rate. This value is calculated as the mean of the most recent five packet loss periods [22]. To find where and when a recovery technique must be switched we have to study the performance of the recovery techniques which will be used in the SRT under arrange of packet loss rate to determine the switching points. The block diagram for the SRT implementation is shown in figure 5.1. As seen, the “depacketizer” feeds the “Transient Packet Loss Meter” and “Controller 2”with the incoming packets sequence numbers. If a sequence number is missing, the “Controller 2” turns SW2 to the SRT via the control link a. The “Controller 1”, at the same time uses measured TPLP by the “Transient Packet Loss Meter” to tune SW1 to the proper recovery technique as its parameters were set. The control links, b, c, d, e are simultaneously used by “Controller 1” to turn on the selected recovery technique. The SRT complexity can be determined according to loss rate profile measured and the quality required. Let probability of transient loss rate of ri be P( ri ) where ri =
1 TPLP
(5-1)
and CSRT, Creptition, Cinterpolation, CPWR and CDSPWR are the complexity of the SRT technique, packet repetition method, interpolation method, PWR method, and DSPWR method respectively then [22], C SRT = P(ri < r1 )* C reptition + P(r1 ≤ ri < r2 )* C interpolation + P(r2 ≤ ri < r3 )* C PWR + P(ri ≥ r3 )* C DSPWR (5-2)
59
Chapter 5: Enhanced Recovery Techniques
where r1 , r2 , and r3 are the transient loss rates that the SRT will switch at. The amount of computation saved, Csaved, by using SRT rather than using DSPWR only can be determined by [22]: Csaved = (CDSPWR −Crepetition )* P (ri < r1) + (C DSPWR −Cinterpolation )* P (r1 ≤ ri < r2 ) + (C DSPWR −CPWR )* p (r2 ≤ ri < r3 )
(5-3) Note that the complexity of the controllers and the meter are so small so they are
Playout Buffer
Depacketizer
Input packet voice stream
neglected. Decoder
e Transient Packet Loss Meter
SW2
SI Controller 1
d SW1 PWR
Controller 2
Output Waveform
WS
c DSPWR b a
Figure 5.1: SRT block diagram Control lines SRT
5.2 Parallel Recovery Technique (PRT) As noted in chapters 4 and section 5.1, SRT and all other receiver-based recovery techniques, try to improve the overall decoded voice quality through enhancing the waveform reconstructed during the period of loss. Most codecs used for data networks are LPC-based or differential-based codecs, which generate low bit-rate traffic. These codecs
60
Chapter 5: Enhanced Recovery Techniques
build the current waveform using current received packet and the decoder history ( u(n) as shown in chapter 3). It is noted by [26] that the loss affects the after-loss waveform also which makes excess quality degradation even when using a sophisticated recovery technique. In order to minimize the error on decoded waveform after loss period, the history of the decoder must set as close as possible to the decoder history if no loss encountered. This may be done by re-encoding the recovered waveform using the recovery technique to generate parameters which may resemble that enclosed in the lost packets. However this extremely increases the recovery technique complexity. Another method is to use Packet Repetition (PR) to copy the parameters in the last received packet before loss so it is decoded instead of decoding null packets. In this way, decoder history is maintained as close to the correct history as possible with minimal increase in the recovery technique complexity. Using Packet Repetition and any other known receiverbased recovery technique is called Parallel Recovery Technique (PRT). The PRT is implemented for test as in the block diagram in figure 5.2.
61
1
SW1
Decoder
1
2 2
Controller
SW2
Output waveform
Playout Buffer
Depacketizer
Input packet voice stream
Chapter 5: Enhanced Recovery Techniques
a
Packet Repetition
PRT Receiver-based Recovery Technique b
Figure 5.2: Block diagram for PRT Control lines PRT
When the depacketizer detects a missing sequence number, it triggers the controller to send the control signals a, b to turn the switches SW1, SW2 to position 2 i.e. makes the input to the decoder is copies of the last received packet and the output waveform recovered using a conventional recovery technique from the last received waveform. The enhanced recovery using PRT will be called PRT-“Recovery Technique” for example the enhanced WS will be called PRT-WS.
62
Chapter 6: Performance Evaluation and Results
Chapter 6
Performance Evaluation and Results
To evaluate a recovery technique performance for certain a voice codec over a range of loss rates, a communication link between the sender including coder and the receiver including decoder and the recovery technique through a data network
have to be
established in reality or by simulating each component of the communication link. In this chapter, the evaluation of recovery techniques for different codecs is studied through simulation.
6.1 Test components 6.1.1 Coder and decoder The codecs used in this work are the LD-CELP, GSM FR coder (RPE-LTP), GSM HR coder (VSELP) and FS CELP. These codecs represent different bit rates and different excitation methods. Table 6.1 summaries distinctive features of these coders [13, 14, 16]. Table 6.1: Distinct features of codecs used in test codec
Bit rate (kbps)
Excitation method
Parameters in transmitted frame
other
LD-CELP
16
Stochastic codebook
Codebook index
Backward adaptation
GSM FR
13
Regular pulse excitation
GSM HR
5.6
Two stochastic codebooks
FS CELP
4.8
Stochastic codebook
Pulse amplitudes, grid position, STP coefficients, LTP parameters 1st , 2nd codebook indices and gains, LTP parameters, STP parameters Codebook index and gain, LTP parameters, STP parameters
Forward adaptation Forward adaptation Forward adaptation
63
Chapter 6: Performance Evaluation and Results
6.1.2 Recovery techniques The SS, WS, SI, PWR, DSPWR, SRT, PRT are tested
6.1.3 Network The network simulated by using the Gilbert Model (2-State Markov model) because loss generator based on it produces single packet as well as burst losses depending on the loss rate. It is reasonable to use "Markov model"-based loss generator to simulate losses to test the different codecs with the different recovery techniques, because the aim of loss model is to simulate loss frequency as well as loss distributions in the observed packet stream. 6.1.3.1 Loss generation using the Gilbert Model We used these steps for generating and allocating lost packets in the time series {xi }i∞=1 : • We observe a stream of packets of length m so the time series will be {xi }im=1 • Set {xi }im=1 =0, meaning that no loss occurs at this step. • Using given loss rate, r, calculate n1 , the number of 1's in the time series {xi }im=1 , n1 = ⎢⎡ r .n ⎥⎤
• Calculate the probabilities, f (k ) , of all the possible loss run lengths using (2-4) for k = 1,2,3,... , and a typical value of q
• Knowing f (k ) ,and n1 , the number of bursts with length k can be calculated as
⎡ f (k ) × n1 ⎤ • Randomize locations of the bursts using random numbers generated from a uniform distribution random number generator.
6.1.4 Perceptual quality evaluation The perceived voice quality is measured using the ITU-T P.862, Perceptual Evaluation of Speech Quality (PESQ) which is an objective method for end-to-end speech quality 64
Chapter 6: Performance Evaluation and Results
assessment of narrow-band telephone networks and speech codecs [28]. This standard is software that simulates the human ear. Its scale ranges from -0.5 to 4.5 which correspond to bad to excellent respectively. The well known objective MOS (Mean Opinion Score) score at five grade scale (e.g., excellent = 5 and bad = 1) is directly calculated from PESQ algorithm. It makes the applications very convenient.
6.2 Simulation system The block diagram for the simulation system for testing the recovery techniques is shown in figure 6.1.
Input Speech
Σ
Playout Buffer
Lossless Network
Depacketizer
Coder
Packetizer
Loss Model
Decoder
Recovery Technique
PESQ
Output Speech
Measured Quality
Figure 6.1: Simulation system used for testing recovery techniques
6.3 Waveform repair for ordinary recovery techniques To show what a certain recovery technique do in the recovered waveform, a close observation for the waveform done in this thesis. This is done by framing a PCM coded voice and delete one or two frames then apply a recovery technique and note the repaired waveform. In this steps no need for loss model. 65
Chapter 6: Performance Evaluation and Results
6.3.1 WS and SI Figure 6.2 compare repair using Waveform Substitution (WS) and Sample Interpolation (SI).
Amp litude Normalized Amplitude
0.6 0.4
(a)
0.2 0 -0.2 -0.4 50
100
150
200
250 300 Original speech segment
350
400
450
500
Sample Number
Amp litude Normalized Amplitude
0.6 0.4
(b)
0.2 0 -0.2 -0.4 50
100
150
200
250 300 A lost part of the segment
350
400
450
500
Sample Number
AmAmplitude plitude Normalized
0.6 0.4
(c)
0.2 0 -0.2 -0.4 50
100
150
200 250 300 350 Loss recovered using Waveform Substitution
400
450
500
Sample Number
AmAmplitude plitude Normalized
0.6 0.4
(d)
0.2 0 -0.2 -0.4 50
100
150
200 250 300 350 Loss recovered using Sample Interpolation
400
450
500
Sample Number
Original speech segment. A lost part of the speech segment. Lost part recovered with Waveform Substitution. Lost part recovered with Sample Interpolation.
Figure 6.2: Waveform Substitution and Sample Interpolation
It is worth mentioning that found literature about Sample interpolation doesn’t suggest any implementation algorithm and simple algorithm equations are derived then evaluated 66
Chapter 6: Performance Evaluation and Results
experimentally for best reconstructed speech quality. To realize the Sample Interpolation method, it should be found a way to detect if the waveform of the preceding packet increases or decreases. Measuring the first and last peaks is a method. Also measuring the RMS values for the first and second halves of the preceding packet. Generally the lost packet samples are built using this general interpolation equation: ~ x (n) = x(n). f SI (n)
(6-1)
where ~x (n ) is the estimated packet, x (n ) is the original samples used to build the lost packet, and f SI (n) is the interpolation function. Computing linear f SI (n) using peaks: • Divide the packet samples into two sets of samples, x + (n) and x − (n) where, •
⎧ ⎪ x ( n) x ( n) = ⎨ ⎪0 ⎩
•
⎧ ⎪ x ( n) and x (n) = ⎨ ⎪0 ⎩
for x(n) ≥ 0
+
for x(n) < 0
−
for x(n) ≤ 0 for x(n) > 0
and n=0: N-1, where N is the packet length. • Find the value for the first positive peak, V + i , and the value of the last positive peak, V + f , of the preceding packet. • Find the value for the first negative peaks, V − i , and the value of the last negative peak, V − f , of the preceding packet. • Calculate m + =
V + f − V +i V − f − V −i , m− = N N
⎧ n.m + ⎪⎪1 + + • Then f SI (n) = ⎨ V f− ⎪1 + n.m ⎪⎩ V − f
for x + (n) for x − (n)
67
Chapter 6: Performance Evaluation and Results
The RMS method is analogous to peaks method with V + i , V + f are replaced by the RMS values of the first and second halves of x + (n) , and V − i , V − f are replaced by the RMS values of the first and second halves of x − (n) .
6.3.2 PWR and DSPWR Figure 6.3 shows how the PWR’s peak detector traces the positive and negative peaks using the algorithm shown in chapter 4. 1
0.8
Speech segment found +ve peaks found -ve peaks +ve peaks trace -ve peaks trace
Normalized Amplitude
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
0
20
40
60
80
100
120
140
160
180
200
Sample Number
Figure 6.3: Peaks of the voice segment as detected by the PWR peak detector
In figures 6.4, 6.5, waveforms recovered using PWR and DSPWR are shown
68
Chapter 6: Performance Evaluation and Results
Normalized A m p Amplitude lit u d e
1
0.5
(a) 0
-0.5 50
100
150
200
250 300 Original speech segment
350
400
450
SampleNumber number 500 Sample
Normalized A m p litAmplitude ude
1
0.5
(b) 0
-0.5 50
100
150
200
250 300 A lost part of the segment
350
400
450
Sample Number number 500 Sample
Normalized Amplitude A m p lit ude
1
0.5
(c) 0
-0.5 50
100
150
200
250 300 Loss recovered using PWR
350
400
450
Sample Number number 500 Sample
Figure 6.4: PWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovewred using PWR
69
Normalized Amplitude
Chapter 6: Performance Evaluation and Results
(a)
Normalized Amplitude
Sample Number
(b)
Normalized Amplitude
Sample Number
(c)
Sample Number
Figure 6.5: DSPWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovered using DSPWR
70
Chapter 6: Performance Evaluation and Results
6.4 Packet Loss vs. coding algorithm Figure 6.6, figure 6.7, figure 6.8 and figure 6.9 show the results we had when tested the coders. Note that LD-CELP has different performance than all the other coders in the sense that highest quality for recovery was from PR for LD-CELP but for DSPWR for all the other codecs. This is because the backward adaptation in it. Besides, the output frames of the encoder contain only the codebook information. But for all other coders, a packet loss means loss in all needed information by the decoder that is contained in that lost packet. Hence LD-CELP will not require the SRT nor the PRT because it performs well with PR which is the least complex recovery technique after SS.
4 PR DSPWR 3.5
PWR SI WS
PESQ
3
SS
2.5
2
1.5
1
0
2
4
6
8
10 Loss Rate (%)
12
14
16
18
20
Figure 6.6: PESQ vs. loss rate for LD-CELP
71
Chapter 6: Performance Evaluation and Results
4 SS WS 3.5
PR SI PWR
PESQ
3
DSPWR
2.5
2
1.5
1
0
2
4
6
8
10 Loss Rate (%)
12
14
16
18
20
Figure 6.7: PESQ vs. loss rate for the GSM FR coder
4 data1 SS WS 3.5
PR SI PWR
PESQ
3
DSPWR
2.5
2
1.5
1
0
2
4
6
8
10 Loss Rate (%)
12
14
16
18
20
Figure 6.8: PESQ vs. loss rate for the GSM HR coder
72
Chapter 6: Performance Evaluation and Results
4 SS WS 3.5
PR SI PWR
PESQ
3
DSPWR
2.5
2
1.5
1
0
2
4
6
8
10 Loss Rate (%)
12
14
16
18
20
Figure 6.9: PESQ vs. loss rate for the FS CELP (FS1016) coder
These results in table form are in table 6.2 Table 6.2: PESQ vs. loss rate for the different codec types Codec
LD-CELP
GSM FR
GSM HR
Recovery Method SS PR WS SI PWR DSPWR SS PR WS SI PWR DSPWR SS PR WS SI
0% 3.74 3.74 3.74 3.74 3.74 3.74 3.57 3.57 3.57 3.57 3.57 3.57 3.56 3.56 3.56 3.56
Quality measured using PESQ 2% 10% 18% 2.91 1.62 1.34 3.40 2.71 2.25 3.05 1.90 1.50 3.04 1.96 1.55 3.08 2.03 1.70 3.10 2.14 2.00 2.75 1.84 1.68 3.20 2.25 1.76 3.03 2.23 1.74 3.25 2.31 1.88 3.26 2.33 1.90 3.26 2.50 2.02 2.75 1.78 1.61 3.12 2.20 1.68 3.01 2.30 1.71 3.20 2.31 1.72
20% 1.30 2.14 1.40 1.51 1.62 1.89 1.60 1.71 1.72 1.72 1.88 1.99 1.61 1.62 1.7 1.72
73
Chapter 6: Performance Evaluation and Results
FS CELP
PWR DSPWR SS PR WS SI PWR DSPWR
3.56 3.56 2.86 2.86 2.86 2.86 2.86 2.86
3.75 3.75 2.10 2.51 2.40 2.60 2.61 2.62
2.32 2.44 1.25 1.60 1.62 1.70 1.72 1.87
1.78 2.01 1.06 1.12 1.21 1.26 1.27 1.40
1.74 2.00 1.06 1.02 1.12 1.15 1.17 1.33
6.5 Switching Recovery Technique Switching Recovery Technique (SRT) is a recovery technique that makes use of all other recovery techniques. The idea of SRT is that at low loss rates, the performance of the all recovery techniques is approximately the same. So we can use the recovery technique with the least complexity but with a small loss in quality. As the loss rate increases the SRT continue switching to the recovery technique which improves the quality but with the least complexity. In this way, SRT complexity increases as the loss rate increases because it switches to a more complex recovery technique to improve the quality. The Switched Recovery Technique (SRT) is a recovery technique that compromises between complexity and recovered voice quality. In designing SRT, two parameters should be defined. The first parameter is the transient packet loss period (TPLP), which is used to capture the packet loss period distribution. This value is calculated as the most recent packet loss period (the period between the current lost packet and the previous one). The second parameter is the average of the packet loss periods (APLP), which corresponds to the packet loss rate. This value is calculated as the mean of the most recent five packet loss periods [22]. To find where and when a recovery technique must be switched we have to examine figures 6.6, through 6.9. Hence it is deduced that: • The DSPWR method gives the highest PESQ for all loss rates and all codecs but LD-CELP (but with the highest computational load and delay). • The silence substitution technique gives the worst PESQ for all loss rates and all codecs, so it will not be used in our SRT. 74
Chapter 6: Performance Evaluation and Results
• Packet repetition and waveform substitution have nearly the same PESQ for all codecs but LD-CELP, and all loss rates so the packet repetition will be used due to its close to zero complexity and it is better than waveform substitution for small loss rates. • The Switched Recovery Technique (SRT) also needs to know where it will switch to the higher recovery technique so we need a parameter called Quality Loss (QL) which is defined as the difference between PESQ for DSPWR and the PESQ for the recovery method chosen by the SRT, all measured at the same loss rate. From results presented in section 6.4, QL should be kept below 0.3 for proper operation of SRT. To understand how QL controls the SRT, let, for example, the codec used is GSM FR and the loss rate be 2% so the QL≈ 0.1, hence the SRT is switched to repetition recovery method. As the loss rate increases, the QL increases till it reaches near 0.3, the SRT decides to switch to interpolation recovery method and so on. The SRT is set to choose between recovery methods as follows: For loss rates below 3% packet repetition is chosen, for loss rates between 3%, 7% interpolation is chosen, for loss rates between 7%, 10% PWR is used and for loss rates larger than 10% DSPWR is used. The SRT complexity can be determined according to loss rate profile measured and the PESQ required. Let probability of instantaneous loss rate of ri be P( ri ) and CSRT, Creptition, Cinterpolation, CPWR and CDSPWR are the complexity of the SRT technique, packet repetition method, interpolation method, PWR method, DSPWR method respectively then, CSRT = P(ri < 3 )* Creptition + P( 3 ≤ ri < 7 )* Cint erpolation + P( 7 ≤ ri < 10 )* CPWR + P(ri ≥ 10 )* CDSPWR
(6-2) The amount of computation saved, Csaved, by using SRT rather than using DSPWR only can be determined by: Csaved = (CDSPWR − Crepetition) * P(ri < 3) + (CDSPWR − Cinterpolation) * P(3 ≤ ri < 7) + (CDSPWR − CPWR) * p(7 ≤ ri < 10)
(6-3)
75
Chapter 6: Performance Evaluation and Results
where the SRT switches at 3%, 7%, and 10% loss rates. Figure 6.10 shows the performance of the SRT for the GSM FR, GSM HR, and FS CELP. 4 FS CELP VSELP 3.5
RPE-LTP
PESQ
3
2.5
2
1.5
1
0
2
4
6
8
10 Loss Rate (%)
12
14
16
18
20
Figure 6.10: SRT applied for GSM FR, GSM HR, FS CELP
Table 6.3 shows the SRT results for the GSM FR, GSM HR, and FS CELP at 2%, 10%, 18% and 20% loss ratios. Table 6.3: The SRT applied for GSM FR, GSM HR, FS CELP Codec
Recovery Method
GSM FR GSM HR FS CELP
SRT
0% 3.570 3.560 2.860
Quality measured using PESQ 2% 10% 18% 3.740 2.501 2.082 3.750 2.50 2.080 2.561 1.880 1.600
20% 2.020 2.000 1602
Note that the PESQ for the SRT up to 3% nearly the same as for the PR method. However for loss ratios between 3% and 7% the PESQ near that of SI between these two packet loss rates. The same note can be observed for the range 7% to 10% comparing to PWR and for the DSPWR for loss ratios grater than 10%. 76
Chapter 6: Performance Evaluation and Results
6.6 Parallel Recovery Technique (PRT) The PRT was implemented using the block diagram in figure 5.2 as the recovery technique in the simulation system in figure 6.1. The measured quality using PRT“Recovery Technique” and ordinary recovery techniques is shown in figures 6.11 through 6.22 for the GSM FR, GSM HR, and FS CELP.
Figure 6.11: Enhancement for the WS technique by PRT-WS technique for GSM FR
77
Chapter 6: Performance Evaluation and Results
Figure 6.12: Enhancement for SI technique by PRT-SI technique for GSM FR coder
Figure 6.13: Enhancement for the PWR technique by the PRT-PWR technique for the GSM FR coder
78
Chapter 6: Performance Evaluation and Results
Figure 6.14: Enhancement for the DSPWR technique by the PRT-DSPWR technique for GSM FR coder
Figure 6.15: Enhancement for WS technique by the PRT-WS technique for GSM HR coder
79
Chapter 6: Performance Evaluation and Results
Figure 6.16: Enhancement for the SI technique by the PRT-SI technique for GSM HR coder
Figure 6.17: Enhancement for the PWR technique by the PRT-PWR for GSM HR coder
80
Chapter 6: Performance Evaluation and Results
Figure 6.18: Enhancement for the DSPWR technique by the PRT-DSPWR for the GSM HR coder
Figure 6.19: Enhancement for the WS technique by the PRT-WS technique for FS CELP coder
81
Chapter 6: Performance Evaluation and Results
Figure 6.20: Enhancement for the SI technique by the PRT-SI technique for FS CELP coder
Figure 6.21: Enhancement for the PWR technique by the PRT-PWR technique for FS CELP coder
82
Chapter 6: Performance Evaluation and Results
Figure 6.22: Enhancement for the DSPWR technique by the PRT-DSPWR for FS CELP coder
Table 6.4 displays the PESQ enhancement by the PRT technique Table 6.4: PESQ enhancement by the PRT technique Codec
GSM FR
GSM HR
FS CELP
Recovery Method WS PRT-WS SI PRT-SI PWR PRT-PWR DSPWR PRT-DSPWR WS PRT-WS SI PRT-SI PWR PRT-PWR DSPWR PRT-DSPWR WS
0% 3.57 3.57 3.57 3.57 3.57 3.57 3.57 3.57 3.56 3.56 3.56 3.56 3.56 3.56 3.56 3.56 2.86
Quality measured using PESQ 2% 10% 18% 3.11 2.25 1.70 3.18 2.38 1.90 3.20 2.41 1.85 3.25 2.30 2.05 3.26 2.40 1.90 3.30 2.49 2.22 3.30 2.46 2.05 3.35 2.58 2.26 3.10 2.08 1.50 3.14 2.20 1.72 3.10 2.17 1.60 3.18 2.26 1.79 3.20 2.23 1.62 3.29 2.30 1.89 3.20 2.25 1.77 3.29 2.40 1.99 2.51 1.60 1.21
20% 1.60 1.80 1.80 2.00 1.85 2.20 2.00 2.20 1.42 1.50 1.56 1.77 1.60 1.75 1.66 1.90 1.07
83
Chapter 6: Performance Evaluation and Results PRT-WS SI PRT-SI PWR PRT-PWR DSPWR PRT-DSPWR
2.86 2.86 2.86 2.86 2.86 2.86 2.86
2.60 2.55 2.63 2.62 2.70 2.61 2.70
1.70 1.70 1.80 1.75 1.85 1.84 194
1.26 1.23 1.41 1.25 1.47 1.40 1.63
1.23 1.20 1.39 1.21 1.48 1.36 1.60
As noticed from the results, a noticeable improvement in quality of the repaired stream. It is remarkable that as the loss rate increases, the enhancement becomes more noticeable. This can be explained as follows; for low loss rates the loss tends to be for individual packets, so the decoder history will not deviate so much than the coder. Hence, the improvement using the packet repetition in the PRT will be small. However for large loss rates, the loss tends to occur in multiple packet bursts as well as single packet bursts, that forces the decoder’s history away from the encoder. The use for packet repetition in the PRT will reduce the decoder’s history deviation, which will greatly improve the quality as the loss rate increases. As it is clear from the results that for loss rates greater than 20% the improvement will become more noticeable. This suggests the use of the PRT when high loss rates are more likely. Comparing the PRT to the Switched Recovery Technique (SRT); the average complexity in PRT doesn’t be reduced but slightly increased by the amount added by using the PR technique at the same time the other recovery techniques used. Another difference is the use of the desired recovery technique for all loss rates while in SRT the recovery technique in use depends on the instantaneous loss rate which is the inverse of the time distance between every two consecutive losses.
84
Chapter 7: Conclusions
Chapter 7
Conclusions
7.1 Conclusions Data networks always suffer from problems such as packet loss, delay and jitter. When packetizing a real-time application such as voice it suffers from the same problems data suffer from. In real time applications, delay can cause packet loss if it exceeded the limit the application can tolerate. Loss in Data Networks can be modeled using random models or burst model or 2-state Markov model. To reduce the bit rate of a voice coder, waveform coders fails and AbS must be used for bit rates below 16 Kbps. Packet loss recovery techniques are categorized into sender-based and receiver-based techniques. Receiver-based techniques are suitable for recovering lost voice packets without loading the network with extra traffic other than the voice packets. On studying the performance of the purposed codecs with all recovery techniques, DSPWR has the best performance but with the highest complexity, PWR performance is lower than DSPWR’s, SI performance was less than PWR’s. LD-CELP has a different performance as PR was the superior of performance with the least complexity. The SRT can compromise between quality and complexity and caused reduction in the average complexity compared to DSPWR. All other recovery techniques dealt with loss periods; however loss cause degradation in after-loss recovered voice due to decoder deviation from the right history. PRT uses PR with any recovery technique to improve performance by reducing quality degradation due to decoder miss tracking that leads to deviation of its history apart from the encoder’s. PRT enhanced the performance of the different recovery techniques for the
85
Chapter 7: Conclusions
GSM HR, GSM FR, FS CELP. The amount of enhancement increases as the loss rate increase.
7.2 Future Work As a future extend to this work, is the study of codecs and recovery techniques over other network protocols such as Frame Relay based networks. Considering other multimedia types such as video, a study for receiver-based techniques performance can be evaluated. A following step to the simulation results in this thesis may be the implementation of SRT and PRT in Voice over Packet applications in readily to improve their performance.
86
Appendix A: Source Code
Appendix A
Source Code
The code in this appendix is written using MATLAB r13
A.1 Sample Interpolation (SI) function xout=sampleinterpolation(xin) L=length(xin); for i=1:L if xin(i)>=0 xinp(i)=xin(i); else xinp(i)=0; end if xin(i)=0 xout(i)=xin(i)*(1+i*mp/vrmsp2); elseif xin(i)(length(pitch)-ppopitch) np=floor(n/length(pitch)); for i=1:np pckttmp=[pckttmp(:);pitch(:)]; end ptmp=pitch(1:(n-length(pckttmp))); pcktout=[pckttmp;ptmp(:)]; else ptmp1=pitch((n-lppeak+1):length(pitch));
%ptmp1=pitch((n-
lppeak+1):length(pitch)) n=n-length(ptmp1); np=floor(n/length(pitch)); for i=1:np pckttmp=[pckttmp(:);pitch(:)]; 89
Appendix A: Source Code
end ptmp2=pitch(1:(n-np*length(pitch))); %ptmp2=pitch(1:(n-np*length(pitch)-1)); pcktout=[ptmp1(:);pckttmp;ptmp2(:)]; end end x1=pcktin(length(pcktin)); x2=pcktout(3); y=interpolate(x1,x2,2); pcktout(1)=y(1); pcktout(2)=y(2); return function y=interpolate(x1,x2,n) p1=polyfit([1,2+n],[x1,x2],1); y1=polyval(p1,[1:2+n]); y=y1(2:n+1); return
A.3 Pitch Waveform Replication subroutine, “pitchfind.m” % y=pitchfind(x,fb,m) searchs the speech segment x for a pitch period % satisfing the conditions fb,m % y is a structure variable with the following fields: %
y.foundpitch : the found pitch cycle
%
y.ppeaktracer : the +ve peak tracer
%
y.npraatracer : the -ve peak tracer
%
y.ppeaks : the found +ve peak positions
%
y.npeaks : the found -ve peak positions
%
y.state : the status of the function (i.e. found a pitch cycle if 1 90
Appendix A: Source Code
%
or failed of 0
%
y.pp : the peak positions of all found pitch cycles
% All fields return empty matrix [] for failier but y.state % fb : must take one of the two values, 'f' or 'b', where 'f' means forward % serch i.e. from packet beginig to its end and 'b' means backword search. % m : takes one of three values, '1', '2', or '3'. %
1 : means the pitch period near the begining of the segment
%
2 : means the pitch period near the segment end
%
3 : means the least pitch period in the segment.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function y=pitchfind(x,fb,m) %%%%%%%%%% Variables in Use %%%%%%%%%%%%%%%%%%%% % L : length of speech packet % f : decreasing factor % ppeak : +ve peak % npeak : -ve peak % ppkp : +ve peak posotions % npkp : -ve peak positions % vrms : rms of the segment % peaks : the set of peaks of pitch cycles found % ic : number of coulumns in peaks % vcut : start and stop desired amplitude for pitch period % pcycle : the selected cycle %============================================================== === % test variables: %~~~~~~~~~~~~~~~~~ 91
Appendix A: Source Code
%fb='f'; %m=3; %x=s1; pk1=[]; pk2=[]; % Finding +ve and -ve peaks: %----------------------------clc; format long; L=length(x); f=0.996; ppkp=[]; npkp=[]; vrms=sqrt(sum(x.*x)/L); % From segment start to end: %~~~~~~~~~~~~~~~~~~~~~~~~~~~ if fb=='f' ppeak=vrms; npeak=-vrms; for i=2:L if x(i)>ppeak ppeak=x(i); else ppeak=ppeak*f; end 92
Appendix A: Source Code
if ppeakppeak ppeak=x(i); else ppeak=ppeak*f; end 93
Appendix A: Source Code
if ppeakppkp(i) & npkp(j)