Packet Voice Transmission over Data Networks

Tanta University Faculty of Engineering Department of Electronics and Electrical Communication Engineering

Packet Voice Transmission over Data Networks A Thesis Submitted for Degree of Master of Science In

Electronics and Electrical Communication Engineering By

Eng. Sameh Atef Napoleon Anis Engineer at the Department of Electronics and Electrical Communication Engineering, Faculty of Engineering, Tanta University Supervised by Prof. Dr.

Ass. Prof

Mohamed El-Said Nasr

Salah Eldein Khamis

Head of Department of Electronics and Electrical Communication Engineering Faculty of Engineering - Tanta University

Ass. Prof. of Electrical Engineering Department of Electronics and Electrical Communication Engineering Faculty of Engineering - Tanta University

Tanta 2006

Abstract

Nowadays, voice transmission over data networks has become of wide spread. However data networks always suffer from packet loss, delay, and delay jitter that greatly affect the quality of the perceived voice quality. Delay jitter can be cured through using playout buffers. However delay can cause further loss for voice packets. Many recovery techniques were developed to recover lost voice packets and are broadly divided to sender-based and receiver-based recovery techniques. Sender-based techniques require more transmission band width and add delay which is undesirable. Receiver-based techniques depend on received packets and voice properties making it attractive as they remedy the sender-based recovery problems. Examples of it are, Silence Substitution, Packet repetition, Waveform Substitution, Sample interpolation, Pitch waveform Replication and Double Sided Pitch waveform Replication. It is noted that the recovery quality increases as the recovery technique complexity increases. Two enhanced recovery techniques are developed in this thesis. The Switched Recovery Technique (SRT), which is a compromise between complexity and quality by employing recovery techniques according to instantaneous measure for the loss rate. The other developed new technique is the Parallel Recovery Technique (PRT) which takes into account the effect of loss on the after-loss decoded voice by copying the last received packet to the input to the decoder while the last decoded waveform before loss is used to recover the voice waveform during the loss. In this way the decoder history is kept as close as possible to that at the coder which minimizes the impact of loss. The proposed techniques were tested with different codecs such as LD-CELP, GSM FR, GSM HR, FS CELP. The loss distributions generated using the two-state Markov model and the codecs performance measured by the ITU-T P.682 standard. The use of SRT was found to reduce average complexity and leads to reasonable recovery quality, while the PRT has lead to improved quality. The LD-CELP has a special performance and doesn’t need SRT nor PRT.

i

Acknowledgement

I would like to express my deepest gratitude to both my supervisor Prof. Dr. Mohamed El-Said Nasr. Head of Department of Electronics and Electrical Communication Engineering, Faculty of Engineering, Tanta University and Dr. Salah Eldein Khamis Ass. Prof. of Electrical Engineering Department of Electronics, Faculty of Engineering, Tanta University, for their great support and constructive guidance. I would like also to say thanks to my family, and friends for their spiritual encouragement and motivation.

ii

Publications

[1] Mohamed E. Nasr, Sameh A. Napoleon, “On Improving Voice Quality Degraded by Packet Loss in Data Networks,” The 22nd National Radio Science Conference (NRSC’ 2005), March 15-17, 2005. Also:

[1] Mohamed E. Nasr, Sameh A. Napoleon, “On Improving Voice Quality Degraded by Packet Loss in Data Networks,” The IEEE Africon 2004, September, 15-17, 2004. [2] Salah Khamis, Sameh A. Napoleon, “Enhanced Recovery Technique for Improving Voice Quality Degraded by Packet Loss in Data Networks,” Submitted for publication in the Alex. Eng. Journal.

iii

Contents

Abstract ................................................................................................................................i Acknowledgement...............................................................................................................ii Publications ....................................................................................................................... iii Contents..............................................................................................................................iv List of Abbreviations....................................................................................................... vii List of Figures .....................................................................................................................x List of Tables................................................................................................................... xiii List of Symbols.................................................................................................................xiv Chapter 1 Introduction.....................................................................................................1 Chapter 2 Data Networks Overview................................................................................5 2.1 IP Networks ................................................................................................................5 2.1.1 Protocol Architecture...........................................................................................5 2.1.2 Real-Time Transport Protocol (RTP)..................................................................6 2.2 ATM Networks...........................................................................................................8 2.2.1 ATM protocol architecture ..................................................................................9 2.2.2 ATM Adaptation Layer (AAL) .........................................................................10 2.3 Frame Relay..............................................................................................................11 2.3.1 Control Plane .....................................................................................................12 2.3.2 User Plane..........................................................................................................12 2.3.3 User data transfer...............................................................................................13 2.4 Modeling network packet loss..................................................................................15 2.4.1 Modeling loss in IP networks ............................................................................15 2.4.2 Modeling loss in ATM networks.......................................................................17 iv

Chapter 3 Voice Compression Techniques ...................................................................20 3.1 Analysis by Synthesis (AbS) ....................................................................................22 3.1.1 The Short-term Predictor...................................................................................24 3.1.2 Autocorrelation Method ....................................................................................25 3.1.3 Long-Term Predictor (LTP) ..............................................................................26 3.2 Standard Voice Compression Techniques................................................................28 3.2.1 GSM Full Rate (RPE-LTP) coder .....................................................................28 3.2.2 Federal Standard CELP (FS1016) coder ...........................................................29 3.2.3 GSM Half Rate coder (GSM HR) .....................................................................30 3.2.4 LD-CELP...........................................................................................................31 Chapter 4 Packet Loss and Recovery Techniques .......................................................34 4.1 Packet loss in data networks.....................................................................................34 4.2 Packet loss recovery techniques ...............................................................................35 4.3 Sender-Based Packet Loss Recovery .......................................................................36 4.3.1 Automatic Repeat reQuest (ARQ) ....................................................................36 4.3.2 Media-independent FEC....................................................................................37 4.3.3 Media-specific FEC...........................................................................................38 4.3.4 Interleaving........................................................................................................40 4.4 Receiver-based Recovery Techniques......................................................................42 4.4.1 Insertion .............................................................................................................42 4.4.2 Interpolation ......................................................................................................44 4.4.3 Regeneration Based Recovery...........................................................................56 4.4.4 Model-Based Recovery .....................................................................................56 Chapter 5 Enhanced Recovery Techniques ..................................................................58 5.1 Switching Recovery Technique (SRT).....................................................................58 5.2 Parallel Recovery Technique (PRT).........................................................................60 Chapter 6 Performance Evaluation and Results..........................................................63 6.1 Test components.......................................................................................................63 v

6.1.1 Coder and decoder.............................................................................................63 6.1.2 Recovery techniques..........................................................................................64 6.1.3 Network .............................................................................................................64 6.1.4 Perceptual quality evaluation ............................................................................64 6.2 Simulation system ....................................................................................................65 6.3 Waveform repair for ordinary recovery techniques .................................................65 6.3.1 WS and SI..........................................................................................................66 6.3.2 PWR and DSPWR.............................................................................................68 6.4 Packet Loss vs. coding algorithm.............................................................................71 6.5 Switching Recovery Technique................................................................................74 6.6 Parallel Recovery Technique (PRT).........................................................................77 Chapter 7 Conclusions ....................................................................................................85 7.1 Conclusions ..............................................................................................................85 7.2 Future Work..............................................................................................................86 Appendix A Source Code ...............................................................................................87 References........................................................................................................................108 ‫ اﻟﻤﻠﺨﺺ اﻟﻌﺮﺑﻰ‬........................................................................................................................111

vi

List of Abbreviations

AAL

ATM adaptation layer

AbS

Analysis by Synthesis

APLP

average of the packet loss periods

ARQ

Automatic Repeat reQuest

ATM

Asynchronous Transfer Mode

BU

Both packets Unvoiced

BV

Both Voiced

BWAA

Backward Amplitude Adjustment

CBR

Constant Bit Rate

CELP

Code Excited Liner Prediction

CLP

Conditional Loss Probability

DSPWR

Double Sided Pitch Waveform Replication

FEC

Forward Error Control

FP

Following Packet

FS CELP

Federal Standard Code Excited Linear Prediction

FV

Following packet Voiced

FWAA

Forward Amplitude Adjustment

GSM HR

GSM Half Rate

IETF

Internet Engineering Task Force

IP

Internet Protocols

LAN

Local Area Networks

LARs

Log-Area Rations vii

LD-CELP

Low-Delay Code Excited Linear prediction

LP

Linear Prediction

LPC10

Linear Prediction Coder 1015

LSB

Least Significant Bits

LSFs

Line Spectrum Frequencies

LTP

Long-Term Predictor

MAN

Metropolitan Area Networks

MSB

Most Significant Bits

NS

Noise Substitution

PCM

Pulse Code Modulation

PESQ

Perceptual Evaluation of Speech Quality

PMP

Phase Matching using Pitch difference

PP

Preceding Packet

PR

Packet Repetition

PRT

Parallel Recovery Technique

PSA

Pitch Segment Adjustment

PV

Preceding packet Voiced

PWR

Pitch Waveform Replication

QL

Quality Loss

QoS

Quality-of-Service

REP-LTP

Regular Excitation Pulse with Long-Term Predictor

RTCP

RTP Control Protocol

RTP

Real Time Protocol

RUDP

Reliable User Data Protocol

SI

Sample Interpolation

SIP

Session Initiation Protocol

SMDS

Switched Multimegabit Data Service

SRT

Switched Recovery Technique viii

SS

Silence Substitution

TCP

Transmission Control Protocol

TPLP

transient packet loss period

UDP

User Datagram Protocol

ULP

Uneven Level Protection

VBR

Variable Bit Rate

VoIP

Voice over the IP network

VoP

Voice over Packet

VSELP

Vector Sum Excited Linear Prediction

WAN

Wide Area Networks

WS

Waveform Substitution

XOR

eXclusive OR

ix

List of Figures

Figure 2.1: VoIP protocol architecture.................................................................................6 Figure 2.2: Header for IP packet carrying real-time application..........................................7 Figure 2.3: ATM reference model........................................................................................9 Figure 2.4: User-Network interface protocol architecture .................................................12 Figure 2.5: LAPF core format ............................................................................................14 Figure 2.6: Gilbert Model...................................................................................................16 Figure 2.7: Statistical multiplexing of voice sources. ........................................................18 Figure 3.1: Basic model for Analysis by Synthesis............................................................22 Figure 3.2: Coder and Decoder for the GSM FR ...............................................................29 Figure 3.3: The FS CELP coder block diagram .................................................................30 Figure 3.4: GSM HR coder block diagram ........................................................................31 Figure 3.5: LD-CELP coder ...............................................................................................32 Figure 4.1: Recovery Techniques.......................................................................................36 Figure 4.2: Media-independent FEC ..................................................................................38 Figure 4.3: Position of redundant data in case of, (a) low and medium network loads, and (b) heavy network load. ......................................................................................................39 Figure 4.4: Interleaving of units across multiple units.......................................................41 Figure 4.5: Positive peak detector algorithm flowchart .....................................................46 Figure 4.7: Procedure PSA .................................................................................................50 Figure 4.8: Illustration of procedure PMP..........................................................................52 Figure 4.9: Illustration of the procedure FWAA and BWAA, where the dotted lines represent the waveforms after reconstruction.....................................................................53 Figure 5.1: SRT block diagram

Control lines

SRT ..........................................60 x

Figure 5.2: Block diagram for PRT

Control lines

PRT....................................62

Figure 6.1: Simulation system used for testing recovery techniques .................................65 Figure 6.2: Waveform Substitution and Sample Interpolation...........................................66 Figure 6.3: Peaks of the voice segment as detected by the PWR peak detector ................68 Figure 6.4: PWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovewred using PWR..................................................................69 Figure 6.5: DSPWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovered using DSPWR.....................................................70 Figure 6.6: PESQ vs. loss rate for LD-CELP.....................................................................71 Figure 6.7: PESQ vs. loss rate for the GSM FR coder.......................................................72 Figure 6.8: PESQ vs. loss rate for the GSM HR coder ......................................................72 Figure 6.9: PESQ vs. loss rate for the FS CELP (FS1016) coder ......................................73 Figure 6.10: SRT applied for GSM FR, GSM HR, FS CELP............................................76 Figure 6.11: Enhancement for the WS technique by PRT-WS technique for GSM FR ...77 Figure 6.12: Enhancement for SI technique by PRT-SI technique for GSM FR coder....78 Figure 6.13: Enhancement for the PWR technique by the PRT-PWR technique for the GSM FR coder....................................................................................................................78 Figure 6.14: Enhancement for the DSPWR technique by the PRT-DSPWR technique for GSM FR coder....................................................................................................................79 Figure 6.15: Enhancement for WS technique by the PRT-WS technique for GSM HR coder ...................................................................................................................................79 Figure 6.16: Enhancement for the SI technique by the PRT-SI technique for GSM HR coder ...................................................................................................................................80 Figure 6.17: Enhancement for the PWR technique by the PRT-PWR for GSM HR coder ............................................................................................................................................80 Figure 6.18: Enhancement for the DSPWR technique by the PRT-DSPWR for the GSM HR coder.............................................................................................................................81

xi

Figure 6.19: Enhancement for the WS technique by the PRT-WS technique for FS CELP coder ...................................................................................................................................81 Figure 6.20: Enhancement for the SI technique by the PRT-SI technique for FS CELP coder ...................................................................................................................................82 Figure 6.21: Enhancement for the PWR technique by the PRT-PWR technique for FS CELP coder.........................................................................................................................82 Figure 6.22: Enhancement for the DSPWR technique by the PRT-DSPWR for FS CELP coder ...................................................................................................................................83

xii

List of Tables

Table 2.1: The functions of each layer for the ATM reference model...............................10 Table 5.1: Complexity of recovery techniques normalized to SS......................................58 Table 6.1: Distinct features of codecs used in test ............................................................63 Table 6.2: PESQ vs. loss rate for the different codec types .............................................73 Table 6.3: The SRT applied for GSM FR, GSM HR, FS CELP........................................76 Table 6.4: PESQ enhancement by the PRT technique .......................................................83

xiii

List of Symbols

ak

: Short-term predictor coefficients

C

: Link Capacity in bps

E

: Short-term average prediction error

e(j)

: Prediction error

ew(j) : Perceptually weighted prediction error f(k)

: Probability of loss run of length k

Gk

: Long-term predictor gain

H(z)

: The all-pole digital filter

L

: Update frame-length for the short-term predictor

La

: Analysis frame-length for the short-term predictor

LARi : Log Area Ratios M

: Total Number of voice sources

Ma

: Number of active voice sources

Ms

: Number of voice sources that saturates the link

N

: Analysis frame length for the long-term predictor

n

: Number of elements in the time series {xi }i∞=1

n0

: Number of zeros in the time series {xi }i∞=1

n1

: Number of ones in the time series {xi }i∞=1

n01

: Number of times that a 1 follows a 0 in {xi }i∞=1

n10

: Number of times that a 0 follows a 1 in {xi }i∞=1

p

: Transition probability from 0 to 1 in the 2-States Markov model

xiv

pˆ

: Estimated value for p from {xi }i∞=1

Pl

: Long-term predictor

Ps

: Short-term predictor

q

: Transition probability from 1 to 0 in the 2-States Markov model

qˆ

: Estimated value for q from {xi }i∞=1

r

: Loss rate (or ratio)

rˆ

: Estimated loss rate (or ration)

R(j)

: Autocorrelation coefficients

s(j)

: Speech sample

s ( j )

: Predicted speech sample

u(j)

: Excitation signal after the long-term predictor

v(j)

: Excitation signal before the long-term predictor

w(j)

: Window function

xi

: Elements of the binary time series {xi }i∞=1 and takes values 1 or 0

α

: Long-term predictor delay

β

: Short-term predictor order

π0

: The state 0 probability in the 2-States Markov model

π1

: The state 1 probability in the 2-States Markov model

ρ

: Residual after the short-term predictor

xv

Chapter 1: Introduction

Chapter 1

Introduction

Voice over Packet Data Networks is to transmit voice over Packet Switching Data Networks by converting it into packets while keeping reliability and voice quality as in circuit switched telephone networks, and gaining cost savings [1]. In fact, the convergence of voice and data networks is rapidly taking place across the globe. Data Networks can be classified according to their transfer protocols as IP (Internet Protocol) networks, ATM (Asynchronous Transfer Mode) networks, FR (Frame Relay) networks [2, 3], etc. IP protocol is the widely spread and used in networks for transporting data, and ATM was designed to carry multimedia and data as well. Hence, IP and ATM networks are adopted in this thesis. Packet Switched Networks can also classified by their coverage area into Wide Area Network (WAN), Metropolitan Area Network (MAN), and Local Area Network (LAN) [4]. Packet Switched Networks always suffer from packet loss, delivery delay, and delay jitter [1]. Packet loss occurs due to discarding packets in congestion periods as in IP networks and buffer overflow as in ATM switches [5, 6, 7], or by dropping packets at gateway/terminal due to late arrival or miss delivery due to errors in packet header. Delay is the consequence of long routs and queuing in routing nodes. Delay jitter is the different delay for each received packet and is the result of different routs for each packet and hence different queuing delays. For multimedia applications, such as voice, delay can cause packet loss due to discarding, so the first two impairments can be merged into packet loss. Delay jitter can be cured using a playout buffer to add delay to early arrived packets to regulate packet arrival time as seen by the voice decoder. The impact of packet loss on perceived voice quality depends on several factors, including loss pattern, codec 1


type and packet size. The packet losses can be modeled using the Bernoulli model and Gilbert model (Two-state Marcov Model) for IP networks [8, 9, 10], and random model or the burst model for ATM [11]. The growth in voice over packet networks demand lead to excess demand of bandwidth, so using low bit-rate voice coders became inevitable [12]. Various standardized coders can be used such as Low-Delay Code Excited Linear prediction, LDCELP (ITU-T G728) at 16 Kbps, Regular Excitation Pulse with Long-Term Predictor, REP-LTP ( GSM standard) at 13 Kbps [13, 14, 15], Vector Sum Excited Linear Prediction VSELP (GSM half rate standard) at 5.6 Kbps, Code Excited Liner Prediction, CELP 1016 (Federal Standard) at 4.8 Kbps [16]. For data packets and non-real time applications, there is always time to repair the loss with exact repair via retransmissions or Forward Error Control (FEC) which is not applicable for real-time applications such as voice and telephony over packet networks or video conferencing as delay in packet delivery causes packet discarding at the receiver end [18]. Hence the need of real-time recovery methods. These methods always depend in recovery on the voice characteristics. Recovery techniques used with packet voice can be categorized into main categories, namely: sender-based recovery and receiver-based recovery. In sender-based recovery, the sender is responsible of recovery. It requires more bandwidth than the original packet voice stream and adds more network delay. However, in receiver-based recovery, the recovery process initiated by the receiver. It doesn’t require extra transmission bandwidth that is why the methods in this category are used for recovery in this thesis. This category includes, Silence Substitution (SS), Packet Repetition (PR), Waveform Substitution (WS), Sample Interpolation (SI) [19, 20 21 22], Pitch Waveform Replication (PWR) and Double Sided Pitch Waveform Replication (DSPWR). They are ordered in ascending way in terms of computational load (i.e. complexity) [22]. This thesis aims to enhance known receiver-based recovery techniques and to develop new receiver-based techniques. To enhance recovery techniques, performance 2


evaluation of recovery techniques mentioned above is done with the purposed voice coders. The Performance is measured using the ITU-T P.682 Perceptual Evaluation of Speech Quality PESQ. It is found that best recovery quality is by using DSPWR then PWR then SI then WS and PR are nearly with the same performance for all coders but LD-CELP which has the best recovery performance through using PR. SS is the worst recovery method for all coders. It is noted that as the complexity of the recovery techniques increases the recovered voice quality increases excluding LD-CELP. Two enhancements for a recovery technique can be done, reducing its complexity or increasing its performance. A Switched Recovery Technique (SRT) was developed to compromise between complexity and quality. It is noted that at low loss rates, all receiver-based recovery methods tend to improve recovered voice quality by nearly the same amount. Hence it is logical to choose the least complex method for recovery. As the loss rate increases the SRT begins to switch to the recovery technique which will improve the quality of the perceived voice with the lowest possible complexity. The average complexity of the SRT is found to be: CSRT = P(ri < 3 )* Creptition + P(3 ≤ ri < 7 )* Cint erpolation + P( 7 ≤ ri < 10 )* CPWR + P(ri ≥ 10 )* CDSPWR

Where ri is the instantaneous loss rate with probability of P(ri) and CSRT, Creptition, Cinterpolation, CPWR and CDSPWR are the complexity of the SRT technique, packet repetition method, interpolation method, PWR method, and DSPWR method respectively. The numerical values appear in the equation are the values at which the SRT switches and are determined as described in chapter 5.

Clearly CSRT is less than CDSPWR. Hence we

succeeded in recovery of voice packets at different loss rate with reduced average complexity compared to using DSPWR for best recovery performance but with the highest complexity. Quality enhancement is gained through using the Parallel Recovery Technique (PRT), where two parallel recovery techniques, one of them is the Packet Repetition method. LPC-based or differential-based coders, build the current waveform using current received packet and the decoder history. It is noted that the loss affects the after-loss 3


waveform also which makes excess quality degradation even when using a sophisticated recovery technique. The Parallel Recovery Technique (PRT) is developed to enhance the quality of the speech via repair in the loss location while minimizing the effect of loss on the after-loss stream as well. This is done through combining the PR in conjunction with any other recovery technique. Further, three voice compression methods are used, to test the PRT, the GSM standard Full Rate, GSM Half Rate, and the CELP (FS1016) coders. The results show improvement in perceived voice quality with the addition of negligible complexity. The thesis organized as follows, chapter 2 reviews Data Networks, IP and ATM and modeling of loss. Chapter 3 is an overview of low bit rate voice coding techniques. Chapter 4 explores the detailed description of different recovery techniques. Chapter 5 describes the SRT and the PRT techniques. The results of testing the ordinary and developed recovery techniques are located in chapter 6. Finally, the thesis conclusions and future work are given in chapter 7.

4

Chapter 2: Data Networks Overview

Chapter 2

Data Networks Overview

Data Networks can be classified according to coverage area or by the protocol used to transfer data via them. By the first classification, data networks can be Wide Area Networks (WANs) that cover a large area and distant computers or Metropolitan Area Networks (MANs) which covers a country for example or Local Area Networks (LANs) which connect workstations in a building for example. When networks classified by transfer protocol used, they may be IP networks i.e. networks that transfer data using the Internet Protocol or ATM networks for the ATM protocol or FR networks etc. In the thesis IP and ATM networks are investigated because the IP protocol is the most common and widely used protocol in transferring data across the Internet, and ATM is designed to support data transfer as well as multimedia or real-time applications.

2.1 IP Networks IP network is based on the "best effort" principle which means that the network makes no guarantees about packet loss rates, delays and jitter. For voice traffic, the perceived voice quality will suffer from these impairments (e.g. loss, jitter and delay). We will focus on transferring voice over the IP network which is abbreviated, VoIP.

2.1.1 Protocol Architecture Voice over IP (VoIP) is the transmission of voice over network using the Internet Protocol. Here, we introduce briefly the VoIP protocol architecture, which is illustrated in figure 2.1. The Protocols that provide basic transport (RTP), call-setup signaling (H.323, SIP) and QoS feedback (RTCP) [1] are shown

5


Figure 2.1: VoIP protocol architecture

In this thesis, we focus on voice transmission and the signaling part is not considered.

2.1.2 Real-Time Transport Protocol (RTP) Two main types of traffic ride upon Internet Protocol (IP): User Datagram Protocol (UDP) and Transmission Control Protocol (TCP). In general, TCP is used when a reliable but not delay sensitive connection is needed, and UDP when simplicity and reliability is not a main concern. Due to the time-sensitive nature of voice traffic, UDP/IP is the logical choice to carry voice. More information is needed on a packet-by-packet basis than UDP offers, however. So, for real-time or delay-sensitive traffic, the Internet Engineering Task Force (IETF) adopted the RTP (Real Time Protocol). VoIP rides on top of RTP, which rides on top of UDP. Therefore, VoIP is carried with an RTP/UDP/IP packet header. RTP is the standard for transmitting delay-sensitive traffic across packet-based networks. RTP rides on top of UDP and IP. RTP gives receiving stations information that is not in the connectionless UDP/IP streams. As shown in figure 2.2, two important bits of information are sequence information and timestamping. RTP uses the sequence information to determine whether the packets are arriving in order and if a packet is lost, and it uses the time-stamping information to determine the interarrival packet time (jitter).

6


Figure 2.2: Header for IP packet carrying real-time application

RTP can be used for media on demand, as well as for interactive services such as Internet telephony. RTP (refer to figure 2.2) consists of a data part and a control part, the latter called RTP Control Protocol (RTCP). The data part of RTP is a thin protocol that provides support for applications with real-time properties, such as continuous media (for example, audio and video), including timing reconstruction, loss detection, and content identification. RTCP provides support for real-time conferencing of groups of any size within an Internet. This support includes source identification and support for gateways, such as audio and video bridges as well as multicast-to-unicast translators. It also offers QoS feedback from receivers to the multicast group, as well as support for the synchronization of different media streams. Using RTP is important for real-time traffic, but a few drawbacks exist. The IP/RTP/UDP headers are 20, 8, and 12 bytes, respectively. This adds up to a 40-byte header, which is big compared to the payload for packetized voice. Large RTP header can be compressed to 2 or 4 bytes by using RTP Header Compression (CRTP).

7


2.1.2.1 Reliable User Data Protocol Reliable User Data Protocol (RUDP) builds in some reliability to the connectionless UDP protocol. RUDP enables reliability without the need for a connection-based protocol such as TCP. The basic method of RUDP is to send multiples of the same packet and enable the receiving station to discard the unnecessary or redundant packets. This mechanism makes it more probable that one of the packets will make the journey from sender to receiver. This also is known as forward error correction (FEC). Few implementations of FEC exist due to bandwidth considerations (a doubling or tripling of the amount of bandwidth used). Customers that have almost unlimited bandwidth, however, consider FEC a worthwhile mechanism to enhance reliability and voice quality.

2.2 ATM Networks ATM stands for Asynchronous Transfer Mode. ATM involves the transfer of data in discrete chunks. Also, like packet switching, ATM allows multiple logical connections to be multiplexed over a single physical interface. In the case of ATM, the information flow on each logical connection is organized into fixed-size packets, called cells with length of 53 bytes, out of them 5 bytes for the cell header [2]. ATM is a streamlined protocol with minimal error and flow control capabilities; this reduces the overhead of processing ATM cells and reduces the number of overhead bits required with each cell, thus enabling ATM to operate at high data rates. Further, unlike IP, ATM uses fixed-size cells which simplifies the processing required at each ATM node, again supporting the use of ATM at high data rates. Unlike IP networks, it is designed for real time multimedia and data transportation, not for data only [2]. Also ATM is connection-oriented. That is, in order for a sender to transmit data to a receiver, a connection has to be established first. The connection is established during the call set-up phase, and when the transfer of data is completed, it is turned down. ATM, unlike IP networks, has built-in mechanisms for providing different quality-of-service (QoS) to different types of traffic [3]. As in section 2.2 we will focus attention for the voice traffic only over ATM.

8


2.2.1 ATM protocol architecture The reference model for the ATM protocol is depicted in figure 2.3. The reference model consists of a user plane, a control plane and a management plane (more details about these planes are in [4, 5, and 6]. Within the user and control planes is a hierarchical set of layers. The user plane defines a set of functions for the transfer of user information between communication end-points; the control plane defines control functions such as call establishment, call maintenance, and call release, and the management plane defines the operations necessary to control information flow between planes and layers, and maintain accurate and fault tolerant network operation.

Figure 2.3: ATM reference model

Within the user and control planes, there are three layers; the physical layer, the ATM layer, and the ATM adaptation layer (AAL). Table 2-1 summarizes the functions of each layer [7]. The physical layer performs primarily bit level functions, the ATM layer is primarily responsible for the switching of ATM cells, and the ATM adaptation layer is responsible for the conversion of higher layer protocol forms into ATM cells. The function that the physical, ATM, and adaptation layers perform are described in more detail in the following: 9

Chapter 2: Data Networks Overview Table 2.1: The functions of each layer for the ATM reference model Higher Layer Functions Higher Layers .convergence CS AAL .segmentation and reassembly SAR .generic flow control .cell-header generation/extraction ATM layer .cell VPI/VCI translation .cell multiplex and demultiplex Layer management .cell-rate decoupling .HEC, header-sequence generation/verification TC physical .cell delineation layer .transmission -frame adaptation .transmission -frame generation/recovery bit timing PM physical medium AAL : ATM Adaptation layer. VCI: Virtual Channel Identifier. CS : Convergence Sublayer. HEC: Header Error Control. SAR : Segmentation And Reassembly. TC: Transmission Control. VPI : Virtual Path Identifier. PM: Physical Medium

PHY independent PHY dependent

2.2.2 ATM Adaptation Layer (AAL) AAL Type 1: This service is used by the applications that require a Constant Bit Rate (CBR), such as uncompressed voice and video, and usually referred to as isochronous. This type of application is extremely time-sensitive and therefore end-to-end timing is paramount and must be supported. Isochronous traffic is assigned service class A. AAL Type 2: again this service is used for compressed voice and video (packetized isochronous traffic), however, it is primarily developed for multimedia applications. The compression allows for a Variable Bit Rate (VBR) service without losing voice and video quality. The compression of voice and video (class B) however, does not negate the need for end-to-end timing. However, timing is still important and is assigned a service class just below that of AAL Type 1. AAL Type ¾: This adaptation layer supports both connection-oriented and compatibility with IEEE 802.6 that is used by Switched Multimegabit Data Service (SMDS). Connection-oriented AAL Type 3 and AAL Type 4 payloads are provided with a service class C while connectionless-oriented AAL Type 3 and AAL Type 4 payloads are 10


assigned the service class D. The support for IEEE 802.6 significantly increases cell overhead for data transfer when compared with AAL Type 5. AAL Type 5: For data transport, AAL Type 5 is the preferred AAL to be used by applications. Its connection-oriented mode guarantees delivery of data by the servicing applications and doesn’t add any cell overhead.

2.3 Frame Relay Frame relay is a data link control facility designed to provide a streamlined capability for use over high-speed packet-switched networks [4]. At the beginnings of data transfer over packet switched networks, the X.25 protocol was developed to transfer data and resist data loss. The X.25 approach results in considerable overhead [4]. All of this overhead may be justified when there is a significant probability of error on any of the links in the network. This approach may not be the most appropriate for modern digital communication facilities. Today's networks employ reliable digitaltransmission technology over high-quality, reliable transmission links, many of which are optical fiber. In addition, with the use of optical fiber and digital transmission, high data rates can be achieved. In this environment, the overhead of X.25 is not only unnecessary, but degrades the effective utilization of the available high data rates. Frame relaying is designed to eliminate much of the overhead that X.25 imposes on end user systems and on the packet-switching network. Figure 2.4 depicts the protocol architecture to support the frame-mode bearer service. We need to consider two separate planes of operation: a control (C) plane, which is involved in the establishment and termination of logical connections, and a user (U) plane, which is responsible for the transfer of user data between subscribers. Thus, C-plane protocols are between a subscriber and the network, while U-plane protocols provide end-to-end functionality.

11


Figure 2.4: User-Network interface protocol architecture

2.3.1 Control Plane The control plane for frame-mode bearer services is similar to that for common-channel signaling in circuit-switching services, in that a separate logical channel is used for control information. At the data link layer, LAPD (Q.921) is used to provide a reliable data link control service, with error control and flow control, between user (TE) and network (NT) over the D channel. This data link service is used for the exchange of Q.933 control-signaling messages.

2.3.2 User Plane For the actual transfer of information between end users, the user-plane protocol is LAPF (Link Access Procedure for Frame-Mode Bearer Services), which is defined in Q.922. Q.922 is an enhanced version of LAPD (Q.921). Only the core functions of LAPF are used for frame relay: • Frame delimiting, alignment, and transparency • Frame multiplexing/demultiplexing using the address field

12


• Inspection of the frame to ensure that it consists of an integral number of octets prior to zero-bit insertion or following zero-bit extraction • Inspection of the frame to ensure that it is neither too long nor too short • Detection of transmission errors • Congestion control functions The last function listed above is new to LAPF, and is discussed in a later section. The remaining functions listed above are also functions of LAPD. The core functions of LAPF in the user plane constitute a sub layer of the data link layer; this provides the bare service of transferring data link frames from one subscriber to another, with no flow control or error control. Above this, the user may choose to select additional data link or network-layer end-to-end functions. These are not part of the frame-relay service. Based on the core functions, a network offers frame relaying as a connection-oriented link layer service with the following properties: • Preservation of the order of frame transfer from one edge of the network to the other • A small probability of frame loss

2.3.3 User data transfer The operation of frame relay for user data transfer is best explained by beginning with the frame format, illustrated in figure 2.5a. This is the format defined for the minimumfunction LAPF protocol (known as LAPF core protocol). The format is similar to that of LAPD and LAPB with one obvious omission: There is no control field. This has the following implications: • There is only one frame type, used for carrying user data. There are no control frames. • It is not possible to use inband signaling; a logical connection can only carry user data.

13


• It is not possible to perform flow control and error control, as there are no sequence numbers.

Figure 2.5: LAPF core format

The flag and frame check sequence (FCS) fields function as in LAPD and LAPB. The information field carries higher-layer data. If the user selects to implement additional data link control functions end-to-end, then a data link frame can be carried in this field. Specifically, a common selection will be to use the full LAPF protocol (known as LAPF control protocol) in order to perform functions above the LAPF core functions. Note that the protocol implemented in this fashion is strictly between the end subscribers and is transparent to ISDN. The address field has a default length of 2 octets and may be extended to 3 or 4 octets. It carries a data link connection identifier (DLCI) of 10, 17, or 24 bits. The DLCI serves the same function as the virtual circuit number in X.25: It allows multiple logical frame relay connections to be multiplexed over a single channel. As in X.25, the connection identifier 14


has only local significance; each end of the logical connection assigns its own DLCI from the pool of locally unused numbers, and the network must map from one to the other. The alternative, using the same DLCI on both ends, would require some sort of global management of DLCI values. The length of the address field, and hence of the DLCI, is determined by the address field extension (HA) bits. The C/R bit is application-specific and is not used by the standard frame relay protocol.

2.4 Modeling network packet loss Packet loss in a data network, caused by several factors. Buffer overflow in switches is common, long network delays can force the real-time application to discard latedelivered packets and consider them as lost ones, and miss-routing due to header errors. To test the recovery techniques, they must be tested in a range of loss rates. Hence the need to simulate network loss by using a loss generator. For accurate and more realistic results, the loss in a network has to be modeled to be used inside the loss generator. Before proceeding in discussing loss models, define a binary time series {xi }i∞=1 where xi takes a value of 0 if the ith packet arrived successfully and the value 1 if it was lost. The time series is a discrete-valued time series which takes on values in the set.

2.4.1 Modeling loss in IP networks 2.4.1.1 Bernoulli Loss Model In the Bernoulli loss model [l], also called the Memoryless Packet Loss Model [8], the sequence of random variables {xi }i∞=1 is independent and identically distributed. That is, the probability of xi being either 0 or 1 is independent of all other values of the time series and the probabilities are the same irrespective of i. Thus this model is characterized by a single parameter, r. Estimated value, rˆ can be calculated as [9]: rˆ =

n1 , n

(2-1) 15


Where n1 is the number of ones in the time series of length n. The model can be implemented by picking a random number for each packet, and deciding whether it is lost based on the value of the number. For a sequence of n packets, the number of lost packets tends to n.r for large values of n [8]. 2.4.1.2 The Gilbert Model Also called, 2-state Markov Model. Markov model can capture temporal loss dependency. In figure 2.6, p is the probability that the next packet is lost, provided the previous one has arrived, q is the opposite. 1-q is the conditional loss probability (clp). Normally p + q p then a lost packet is more likely provided that the previous packet was lost than when the previous packet was successfully received. On the other hand, p > (1 − q) indicates that loss is more likely if the previous packet was not lost. 2.4.1.3 Markov Chain Model of kth order A kth-order Markov chain model is a more general model for capturing dependencies among events. The next event is assumed to be dependent on the last k events, so it needs 2k states. Let xi denote the binary event for ith packet, 1 for loss, 0 for non-loss. The parameters to be determined in a kth order Markov model are: P (x i | x i −1 , x i − 2 ,...., x i − k ) for all combinations of x i , x i −1 , x i − 2 ,....., x i − k [10]. Note that Bernoulli Model and Gilbert Model are two special cases for the kth Markov Model with k=0 and k=1 respectively. In [9], measured end-to-end packet loss for 128 hours with 2 hours fore each trace, and found that Gilbert model can be used to model packet loss for most of traces. So, in simulating the network loss, the Gilbert Model was used.

2.4.2 Modeling loss in ATM networks There are two simple ATM loss models that are used in testing recovery techniques [11], Random model, and Burst model. 2.4.2.1 Random model The Random model assumes that cell losses in ATM are distributed randomly. This model generates losses (ones in the time series {xi }i∞=1 ) with the desired cell loss rate using a random number generator. This is a straightforward approach as the characteristics of an ATM network are not directly involved in the computation of random cell losses [11]. Shortcomings for this model, are that it does not take into consideration the behavior of the voice traffic in an ATM network, which is burst like in

17


its behavior [11] and that it does not affected by network parameters, such as network load and link capacity[11]. 2.4.2.2 Burst Model On contrary to Random model, this model takes into account the nature of a voice source that is alternates between talk-spurts and silence periods. The voice source considered active and generates cells only during talk-spurts while during silence periods, voice source considered inactive [11]. The voice cells generated by different voice sources are then fed to a common queue and transmitted over the link on a first-come first-served basis. This is shown in figure 2.7.

Figure 2.7: Statistical multiplexing of voice sources.

When the queue reaches the threshold for congestion, cells are discarded according to a cell discarding algorithm which is a priority based algorithm [11]. The receiver considers the cells which have not been received, within the delay time of the playout buffer memory, to be messing. The burst model is a mathematical model that can be used to characterize cell losses for voice traffic in ATM networks and produces the distribution of equilibrium cell loss rates for a given set of ATM network parameters. Let r(Ma) be the equilibrium cell loss rate then: ⎧0 ⎪ r (M a ) = ⎨ M a - M s ⎪ M a ⎩

if

Ma ≤Ms

if

Ma >Ms

(2-5)

18


where Ma is the number of active voice sources, Ms is the number of the voice sources that saturates the link of a capacity C bps. If Ma>Ms then the link saturates and (Ma-Ms) cells will be lost [11]. The probability of having Ma active voice sources within M voice sources can be obtained using the binomial distribution [11]. Since loss is more frequent in IP networks than ATM networks, and loss generator based on Markov model produces single packet as well as burst losses depending on the loss rate, it is reasonable to use "Markov model"-based loss generator to simulate losses to test the different recovery techniques, that will be discussed in chapter 4, because the aim of loss model is to simulate loss frequency as well as loss distributions in the observed packet stream.

19

Chapter 3: Voice Compression Techniques

Chapter 3

Voice Compression Techniques

As digital communication and secure communication have become increasingly important, the theory and practice of data compression have received increased attention. While it is true that in many systems bandwidth is relatively inexpensive, e.g., fiber optic, in most systems the growing amount of information that users wish to communicate or store necessitates some form of compression for efficient, secure, and reliable use of the communication or storage medium. An example where compression is required results from the fact that if speech is digitized using a simple PCM system consisting of a sampler followed by scalar quantization, the resulting signal will no longer have a small enough bandwidth to fit on ordinary telephone channels. That is, digitization causes bandwidth expansion. Hence data compression will be required if the original communication channel is to be used [12]. Voice coders are mainly classified into three types, waveform coders, vocoders and hybrid coders. Waveform coders Waveform coders reproduce the analog waveform as accurately as possible, including background noise. They operate at high bit rate. G.711 is the waveform coder to represent 8 bit compressed pulse code modulation (PCM) samples with the sampling rate of 8000Hz. This standard has two forms, a-Law and μ-Law. A-Law G.711 PCM encoder converts 13 bit linear PCM samples into 8 bit compressed PCM samples, and the decoder does the conversion vice versa. μ-Law G.711 PCM encoder converts 16 bit linear PCM samples into 8 bit compressed PCM samples. G.726, based on Adaptive Differential Pulse Code Modulation (ADPCM) technique, convert the 64 kbit/s A-law or μ-law pulse code 20


modulation (PCM) or 128Kbits linear PCM channel to and from a 40, 32, 24 or 16 kbit/s channel. The ADPCM technique applies for all waveforms, high-quality audio, modem data etc. Vocoders Vocoders do not reproduce the original waveform. The encoder builds a set of parameters, which are sent to the receiver to be used to drive a speech production model. Linear Prediction Coding (LPC), for example, is used to derive parameters of an adaptive (a time-varying) digital filter. This filter models the output of the speaker’s vocal tract. The quality of vocoder is not good enough for use in telephony system. Hybrid coders The prevalent speech coder for VoIP is the hybrid coder, which melds the attractive features of waveform coder and vocoder. It is also attractive because it operates at a low bit rate as low as 4-16 kbps. They use analysis-by-synthesis (AbS) techniques. An excitation signal is derived from the input speech signal in such a manner that the difference between the input and the synthesized speech is quite small. An enhancement to the operation is to use a pre-stored codebook of optimized parameters (a vector of elements) to encode a representative vector of the vector of the input speech signal. This technique is known as vector quantization (VQ). In this chapter, the discussion will be focused on Low Bit Rate standard codecs because of: • They are all having algorithms that require segmentation of speech stream which is suitable for packet networks that readily require that segmentation. • Save capacity for data networks to face the growth of demand. • Most of speech codecs used in data networks are of the analysis-by-synthesis type i.e. the low bit rate codecs.

21


3.1 Analysis by Synthesis (AbS) The basic model for analysis-by-synthesis predictive coding of speech is shown in figure 3.1. The model consists of three main parts. The first part is the synthesis filter which is an all-pole time-varying filter for modeling the short-time spectral envelope of the speech waveform. It is often called short-term correlation filter because its coefficients are computed by predicting a speech sample from few previous samples (usually previous 8-16 samples, hence the name short term). The synthesis filter could also include a long-term correlation filter cascaded to the short-term correlation filter. The long-term predictor models the fine structure of the speech spectrum [13]. The second part of

s(j)

Synthesis filter

Excitation generator

v(j)

Long-term Predictor

Short-term Predictor

u(j)

1 1 − Pl ( z )

s ( j )+

1 1 − Ps ( z )

-+

e(j)

Error weighting filter

ew(j)

Error minimization

Synthesis filter

Excitation generator

v(j)

Long-term Predictor

1 1 − Pl ( z ) Pl = Gz −α

u(n)

and

Short-term Predictor

1 1 − Ps ( z )

s ( j )

Ps (z ) = ∑ k =1 ak z − k β

Figure 3.1: Basic model for Analysis by Synthesis

the model is the excitation generator. This generator produces the excitation sequence which is to be fed to the synthesis filter to produce the reconstructed speech at the 22


receiver. The excitation is optimized by minimizing the perceptually weighted error between the original and synthesized speech. As it is shown in figure 3.1, a local decoder is present inside the encoder, and the analysis method for optimizing the excitation uses the difference between the original and synthesized speech as an error criterion, and it chooses the sequence of excitation which minimizes the weighted error. The efficiency of this analysis-by-synthesis method comes from the closed loop optimization procedure, which allows the representation of the prediction residual using a very low bit rate, while maintaining high speech quality. The key point in the closed-loop structure is that the prediction residual is quantized by minimizing the perceptually weighted error between the original and reconstructed speech rather than minimizing the error between the residual and its quantized version as in open-loop structures. The third part of this model is the criterion used in the error minimization. The most common error minimization criterion is the mean squared error (mse). In this model, a subjectively meaningful error minimization criterion is used, where the error e(n) is passed through perceptually weighting filter which shapes the noise spectrum in a way to make the power concentrated at the formant frequencies of the speech spectrum so that, the noise is masked by the speech signal. The encoding procedure includes two steps: firstly, the synthesis filter parameters are determined from the speech samples (10-30 ms of speech) outside the optimization loop. Secondly, the optimum excitation sequence for this filter is determined by minimizing the weighted error criterion. The excitation optimization interval is usually in the range of 4-7.5 ms which is less than the LPC parameter update frame. The speech frame is therefore divided into sub-blocks, or sub-frames where the excitation is determined individually for each sub-frame. The quantized filter parameters and the quantized excitation are sent to the receiver. The decoding procedure is performed by passing the decoded excitation signal through the synthesis filters to produce the reconstructed speech.

23


In the following subsections, we will discuss the LPC synthesis and pitch synthesis filters and the computation of their parameters, as well as the error weighting filter and the selection of the error criterion. The definition of every excitation method will be discussed in separate sections.

3.1.1 The Short-term Predictor The short-term predictor models the short-time spectral envelope of the speech. The spectral envelope of a speech segment of length L samples can be approximated by the transmission function of an all-pole digital filter of the form [13] H (z ) =

1 1 = β 1 − Ps (z ) 1 − ∑ ak z − k k =1

(3-1)

where Ps (z ) = ∑ k =1 ak z − k β

(3-2)

is the short-term predictor. The coefficients {ak} are computed using the method of Linear Prediction (LP). The set of coefficients {ak} is called the LPC parameters or the predictor coefficients. The number of coefficients β is called the predictor order. The basic idea behind linear predictive analysis is that a speech sample can be approximated as a linear combination of past speech samples (8-16 samples), i.e. s ( j ) = ∑ k =1 ak s ( j − k ) β

(3-3)

where s(n) is the speech sample and ~s (n) is the predicted speech sample at sampling instant n. The prediction error, e(n) is defined as e ( j ) = s ( j ) − s ( j ) = s ( j ) − ∑ k =1 ak s ( j − k ) β

(3-4)

The short-term average prediction error is defined as E = ∑ e 2 ( j ) = ∑ [s ( j ) − ∑ k =1 ak s ( j − k )]2 β

j

(3-5)

j

To find the values of {ak} that minimize E, we set ∂E / ∂ai = 0 for i=1,…, β. Then after differentiating and rearranging

24


∑

β

a φ (i , k ) = φ (i , 0) , i = 1,....., β .

k =1 k

where φ (i , k ) = ∑ s ( j − i )s ( j − k ) , i = 1,....., β .

(3-6) (3-7)

j

Equation (3-6) defines a set of β equations in β unknowns, {ak}. The most common and widely used method for solving this set of equations is the autocorrelation method.

3.1.2 Autocorrelation Method In this approach, we assume that the error in Equation (3-5) is computed over the infinite duration -∞ < j < ∞. Since this cannot be done in practice, it is assumed that the waveform segment is identically zero outside the interval 0 < j < La -1 where La is the LPC analysis frame length [13, 14]. This is equivalent to multiplying the input speech by a finite length window w(j) that is identically zero outside the interval 0 < j < La-1. Using a rectangular window causes sharp truncation of the speech segment which increases prediction error. So it is plausible to use a tapered window like a Hamming window as given below [13]: w ( j ) = 0.54 − 0.46 cos(2π j /( La − 1)) , 0 ≤ j ≤ La − 1

(3-8)

Considering Equation (3-5), e(j) is nonzero only in the interval 0 < j < La + β- 1. Thus φ (i , k ) = ∑ n =0

La + β −1

s ( j − i )s ( j − k )

i = 1,...., β k = 0,...., β

(3-9)

Setting m = j − i , equation (3.13) can be rewritten as φ (i, k ) = ∑m =0

La −1− ( i − k )

s ( m) s ( m + i − k )

(3-10)

So, φ (i , k ) is the short-term autocorrelation of s(m) evaluated for (i-k). That is φ (i, k ) = R(i − k )

(3-11)

where R (λ ) = ∑ m =λ s ( j )s ( j − λ ) La −1

(3-12)

The set of equations (3-11) can be expressed in the matrix form as

25

Chapter 3: Voice Compression Techniques R (1) R (2) ⎛ R (0) ⎜ R (0) R (1) ⎜ R (1) ⎜ R (2) R (1) R (0) ⎜ # # # ⎜ ⎜ R ( β − 1) R ( β − 2) R ( β − 3) ⎝

" R ( β − 1) ⎞ ⎛ a1 ⎞ ⎛ R (1) ⎞ ⎟⎜ ⎟ ⎜ ⎟ " R ( β − 2) ⎟ ⎜ a2 ⎟ ⎜ R (2) ⎟ " R ( β − 3) ⎟ ⎜ a3 ⎟ = ⎜ R (3) ⎟ ⎟⎜ ⎟ ⎜ ⎟ % # ⎟⎜ # ⎟ ⎜ # ⎟ " R (0) ⎟⎠ ⎜⎝ aβ ⎟⎠ ⎜⎝ R ( β ) ⎟⎠

(3-13)

This most efficient algorithm to find ai from this matrix equation is the Durbin's recursion algorithm which is as follows: E(0) = R(0) For i = 1 to β do k i = [ R (i ) − ∑ j =1 a (ji −1) R (i − j )] / E (i − 1) i −1

a i( i ) = k i

For j = 1 to i-1 do a i( i ) = a (ji −1) − k i ai(−i −j1)

E (i ) = (1 − k i2 ) E (i − 1)

The final solution is given as a j = a (j β ) ki is known as reflection coefficients and rang from -1 to 1. A sufficient condition for a stable predictor is

ki ≤ 1

The ki are then converted to log-area rations (LARs) which is suitable for transmission as they are less sensitive to channel errors. LARs are computed as: ⎛ 1 − ki LARi = log⎜⎜ ⎝ 1 + ki

⎞ ⎟⎟ ⎠

(3-14)

More suitable parameters that are less sensitive for errors in transmission channel are Line Spectrum Frequencies (LSFs). More details about calculating the LSFs from ai are found in [14, 15].

3.1.3 Long-Term Predictor (LTP) While the short-term predictor models the spectral envelope of the speech segment being analyzed, the long-term predictor (LTP), or the pitch predictor, is used to model the 26


fine structure of that envelope [13,14]. Inverse filtering of the speech input removes some of the redundancy in the speech by subtracting from the speech sample its predicted value using the past β samples (usually β=10) are used to predict the present sample of speech. The short-term prediction residual, however, still exhibits some periodicity (or redundancy) related to the pitch period of the original speech when it is voiced. This periodicity is on the order of 20-160 samples (50-400 Hz pitch frequencies). Adding the pitch predictor to the inverse filter further removes the redundancy in the residual signal and turns it into a noise-like process. It is called pitch predictor since it removes the pitch periodicity, or long-term predictor since the predictor delay is between 20 and 160 samples. The long-term predictor is very essential in low bit rate speech coders, as in the CELP, where the excitation signal is modeled by a Gaussian process, therefore long-term prediction is necessary to insure that the prediction residual is very close to random Gaussian noise process. The general form of LTP is [13] 1 1 = m2 1 − Pl 1 − ∑ G k z − (α + k ) k =− m

(3-15)

1

m1 = m2 = 0 is common and in this case the LTP called one-tap predictor. In this case Pl = Gz −α and α , G are calculated through minimization of the mean squared residual

error after short-term and long-term predictors. The order of this filter is α which ranges from 20 to 160. e ( j ) = ρ ( j ) −G ρ ( j − α )

(3-16)

where ρ(n) is the residual after short-term predictor. Doing so, α and G can be calculated from the following equations [13]:

∑ G= ∑

N −1

ρ ( j )ρ ( j − α )

j =0 N −1

[ ρ ( j − α )]2 j =0

⎛ [∑ N −1 ρ ( j ) ρ ( j − α )]2 ⎞ j =0 ⎟ α = max ⎜ ⎜ ∑ N −1[ ρ ( j − α )]2 ⎟ j =0 ⎝ ⎠

(3-15)

(3-16)

27


N is the number of samples over which long-term prediction is calculated. 3.1.3.1 LTP using adaptive codebook method The above calculation for LTP is an open loop solution. However, a significant improvement in results is achieved by using a closed loop solution i.e. inside the AbS loop which is described in details in [13].

3.2 Standard Voice Compression Techniques 3.2.1 GSM Full Rate (RPE-LTP) coder The GSM Full Rate coder (or GSM FR shortly)and also called the Regular Pulse Excited coder with Long Term Predictor (RPE-LTP). This coder has an STP of order 10 and one tap LTP. Its excitation signal is a set of regularly spaced ten pulses per each subframe that has 40 samples. The frame size of this coder is 160 samples, i.e. each frame has four subframes. The excitation pulses can take one of four positions. The coder output frame contains the 10 STP filter coefficients represented by LSFs (Line Spectral Frequencies) and the LTP gain and delay and the ten pulses amplitudes and the position of them. figure 3.2 shows the encoder and decoder for this codec. The bit rate for this codec is 13 Kbps [15].

28


Figure 3.2: Coder and Decoder for the GSM FR

3.2.2 Federal Standard CELP (FS1016) coder Federal Standard CELP (or FS CELP) stands for the Federal Standard Code Excited Linear Prediction. The excitation signal is chosen from a stochastic codebook. The STP order is 10 and the LTP uses the adaptive codebook approach. The bit rate for this codec is 4.8 Kbps. The coder is depicted in figure 3.3 [16].

29


Figure 3.3: The FS CELP coder block diagram

3.2.3 GSM Half Rate coder (GSM HR) Stands for the GSM Half Rate coder and also called VSELP or Vector Sum Excited Linear Prediction. Its bit rate is 5.6 Kbps. Its excitation is two stochastic codebooks which is shorter than that of FS CELP to reduce the codebook index search complexity. LTP with adaptive codebook approach exists. STP order is 10. Its coder is depicted in figure 3.4 [16].

30


Figure 3.4: GSM HR coder block diagram

3.2.4 LD-CELP Is the G.728 ITU-T standard and called Low Delay CELP. The general block diagram of LD-CELP codec is given in figure 3.5 [17].

31


Figure 3.5: LD-CELP coder

As, the bit rates of around 16 kbits/s and lower, the voice quality of waveform codecs falls rapidly. Thus at these rates CELP codecs and their derivatives, tend to be used. However because of the forward adaptive determination of the short-term filter coefficients used in most of these codecs, they tend to have high delays. The delay of a voice codec is defined as the time from when a voice sample arrives at the input of its encoder to when the corresponding sample is produced at the output of its decoder, assuming the bit stream from the encoder is fed directly to the decoder. For a typical hybrid voice codec this delay will be of the order of 50 to 100 ms, and such a high delay can cause problems. Therefore the ITU released a set of requirements for a new 16 kbits/s standard, the chief requirements being that the codec should have voice quality comparable to the waveform codec of 32 kbits/s (G.721) in both error free conditions and over noisy channels, and should have a delay of less than 5 ms and ideally less than 2ms. All the ITU requirements were met by a backward adaptive CELP codec, which was developed at AT&T Bell Labs, and was standardized as G.728. This codec uses backward adaptation to calculate the short-term filter coefficients, which means that rather than buffer 20 ms or so of the input speech to calculate the filter coefficients they are found from the past reconstructed speech. (This means that the codec can use a much shorter frame length than traditional CELP codecs, and G.728 uses a frame length of only 5 32


samples giving it a total delay of less than 2 ms. A high order (β=50) short term predictor is used, and this eliminates the need for any long term predictor. Thus all ten bits, which are available for each five-sample vector at 16 kbits/s, are used to represent the fixed codebook excitation. Of these ten bits seven are used to transmit the fixed codebook index, and the other three are used to represent the excitation gain. Backward gain adaptation is used to aid the quantization of the excitation gain, and at the decoder a post filter is used to improve the perceptual quality of the reconstructed voice [17].

33

Chapter 4: Packet Loss and Recovery Techniques

Chapter 4

Packet Loss and Recovery Techniques

Data transmitted across packet switched networks are normally subject to delay, delay jitter, resequencing of packets, and loss of packets [18]. Recently, the voice over Packet (VoP) applications are gaining increasing interest since they can be realized using inexpensive network services. However, due to the unreliable nature of packet delivery, the quality of the received stream will be adversely affected by packet loss or delay [19].

4.1 Packet loss in data networks Packet loss always occurs due to discarding packets in congestion periods as in IP networks and buffer overflow as in ATM switches, or by dropping packets at gateway/terminal due to late arrival or miss delivery due to errors in packet header. Delay and delay variation (jitter) are the main network impairments that affect voice quality [1]. The end-to-end delay is the time elapsed between sending and receiving a packet. It mainly consists of the following components [1]: • Propagation delay: depends only on the physical distance of the communications path and the communication medium. When transmitted over fiber, coax or twisted wire pairs, packets incur a one-way delay of 5 μs/km. • Transmission delay: the sum of the time it takes the network interface to send out the packet. • Queuing delay: the time a packet has to spend in the queues at the input and output ports before it can be processed. It is mainly caused by network congestion. • Codec processing delay: including codec’s algorithmic delay and lookahead delay.

34


• Packetization/de-packetization delay: the time needed to build data packets at the sender, as well as to strip off packet headers at the receiver. • Playout buffer delay, the time waited at playout buffer at the receiver/terminal. The ITU has recommended one-way delays no greater than 150ms for most applications, with a limit of 400ms for acceptable voice communications [1]. For multimedia applications, such as voice, delay can cause packet loss due to discarding due to late arrival, so packet loss and loss due to delay can be merged into packet loss [10]. Delay jitter can be cured using a playout buffer to add delay to early arrived packets to regulate packet arrival time as seen by the voice decoder.

4.2 Packet loss recovery techniques Recovery techniques may be divided into two classes: sender-based and receiver-based. In Sender-based recovery, the sender participates in the loss recovery. The sender-based recovery techniques can be subdivided into Active techniques such as Retransmission, and Passive techniques such as Interleaving and Forward Error Check (FEC). FEC can be Media-independent or Media-specific[19]. The receiver-based recovery techniques enable destinations to recover lost packets independent from the sender, making them more flexible. They can be classified as insertion repair, interpolation repair or regeneration repair [19]. figure 4.1 shows a more detailed classification for different recovery techniques.

35


Figure 4.1: Recovery Techniques

4.3 Sender-Based Packet Loss Recovery The basic sender-based mechanisms available to recover the packet loss are: Automatic Repeat reQuest (ARQ), media independent/specific Forward Error Correction (FEC), FEC with uneven level protection (ULP), and interleaving.

4.3.1 Automatic Repeat reQuest (ARQ) ARQ is considered as active recovery technique. Using ARQ [18], a lost packet is retransmitted to the receiver by the sender. ARQ-based schemes consist of three parts:

36


Lost data detection by the receiver or by the sender (timeout). Acknowledgment strategy: The receiver sends acknowledgments that indicate which data are received or which data are missing. Retransmission strategy: It determines which data are retransmitted by the sender. Although its robustness against the burst losses, it cannot be used in real-time application, such as VoP, because of the large amount of delay and bandwidth overhead.

4.3.2 Media-independent FEC In FEC, lost data would be recovered at the receiver without further reference to the sender. Both the original data and the redundant information are transmitted to the receiver. There are two kinds of redundant information: those that are either independent or dependent on the media stream. The media-independent FEC does not need to know the original data type (speech, audio, or video). In media-independent FEC, original data together with some redundant data, called parities, are transmitted to the receiver. The redundant data is derived from the original data using the exclusive-OR (XOR) operation: one parity packet is generated for the given original packets; or using Reed- Solomon codes: multiple independent parities can be computed for the same set of packets. Reed-Solomon codes allow achieving optimal loss protection, but lead to higher processing costs than schemes based on XOR operation. Even though the XOR operation results in a sub-optimal protection, it is preferred for practical implementations since we can compute several parity packets with lower processing cost than Reed-Solomon method. The FEC transmits k original data packets (D) and additional h redundant parity packets (P). figure 4.2 shows an example for k=3 and h=2. The FEC encoder produces two redundant packets (P1, P2) from three data packets. If one data packet (D3) and one parity packet (P1) are dropped, the receiver can recover the data packet (D3) by using the successfully received data packets, D1, D2, and P2.

37

Chapter 4: Packet Loss and Recovery Techniques Network loss in FEC block

D3 D2 D1

D2 D1 P2 P1 D3 D2 D1

D3 D2 D1 P2

FEC Encoder

FEC Decoder

D3

P2 P1

Figure 4.2: Media-independent FEC

FEC is effective even for a small h/k ratio. For the FEC decoder, losses of consecutive packets can be corrected for large values of k, since the decoder uses not only the parity bits but also the data bits. However, if k increases, the reconstruction delay at the receiver also increases. There are several advantages of the media-independent FEC schemes: They are source independent: the operation of FEC does not depend on the contents of the original data and the repair is the exact replacement for a lost packet. The original data packet can be used by receivers which are not capable of FEC, since the redundant data are usually sent as a separate stream. However, it has a main disadvantage that is the FEC coding requires high delay or bandwidth for the efficient encoding and decoding. If we increase k or h, it causes the additional delay or bandwidth, respectively.

4.3.3 Media-specific FEC 4.3.3.1 Embedded Speech Coding technique The embedded speech coding technique is introduced in [20]. This technique implements voice recovery by adding redundancies to the audio packets sent by the source. In order to implement this technique without incurring excessive bandwidth, both toll and non-toll quality voice coding algorithms are used in the primary and redundancy transmissions respectively. Therefore, the redundant voice segment is of lower quality than the primary voice data. The use of different coding algorithms is necessary, as better quality voice coding algorithms demand higher network bandwidth and more processing 38


power. In this way, the output speech waveform, at the receiver, will consist of periods of toll quality speech interspersed with periods of synthetic quality speech. The synthetic quality speech coding algorithm, Linear Predictive Coding (LPC), is used for redundant voice encoding. The placement of redundancies is dependent on the network loads condition. In a light and intermediate loads condition, losses are essentially nonconsecutive for an audio stream, and for heavy loads, the behavior is similar, but consecutive losses are more prevalent. Hence, [20] proposed that redundancies placed immediately in the following packet works well in a light and medium network loads. Whereas, redundancies placed in a number of packets later is more suitable for heavy loads. Figure 4.3 is a pictorial description of the packet structure. This technique requires low-delay to synthesize the quantized signal at the decoder, since only a single-packet

Primary Speech Coding

Secondary Coding


Secondary Coding


Secondary Coding


Secondary Coding


Secondary Coding


Secondary Coding

delay is usually added. This makes it suitable for interactive applications, such as VoP.


Secondary Coding


Secondary Coding

(a)

(b)

Figure 4.3: Position of redundant data in case of, (a) low and medium network loads, and (b) heavy network load.

39


4.3.3.2 Uneven Level Protection The idea of uneven Level Protection (ULP) is arisen from the fact that different portions of data (of speech data, in particular) have unequal importance for the reconstruction quality, and the capability to assign different priority levels for different data portions which is possible in ATM networks. For example, if speech is coded using code-exited linear prediction (CELP), then pitch and prediction filter parameters are of considerably higher importance than the excitation codebook. An error in prediction filter parameters can reduce the quality considerably and even result in an unstable system, while an error in codebook index will generally result in little or even no loss in perceptual quality. This motivates a usage of uneven protection for data units having unequal importance. In general, ULP is applied by assigning high priority to linear prediction coefficients (LPC) and pitch parameters and low priority to excitation information for linear prediction-based coders. For waveform coders, ULP is applied by assigning high priority to most significant bits (MSB) and low priority to least significant bits (LSB) in waveform coders, [21].

4.3.4 Interleaving Interleaving is a useful packet-loss recovery technique for applications where endto-end delay is of secondary importance [19]. If the size of a data unit produced at a time by a coder is smaller than the allowed payload size in a packet, then a few data units may be combined into a single packet. However, in order to reduce the packet-loss effects, the original data units are not combined in the same sequential order as produced by the coder but interleaved by the transmitter. Units are resequenced before transmission so that originally adjacent units are separated by a guaranteed distance in the transmitted stream and returned to their original order at the receiver and provided to the decoder. As can be seen from figure 4.4, the effect of a packet loss (one packet contains a few data units) is distributed over small intervals corresponding to distributed data units instead of adjacent data units. For example, units are 5 ms in length and packets 20 ms (i.e., 4 units/packet), 40


then the first packet would contain units 1, 5, 9, 13; the second units 2, 6, 10, 14; and so on, as illustrated in figure 4.4. It can be seen that the loss of a single packet from an interleaved stream results in multiple small gaps in the reconstructed stream, as opposed to the single large gap which would occur in a noninterleaved stream. The effect of a packet loss is reduced due to the following reasons: The resulting small gap intervals correspond typically to speech intervals considerably shorter than a phoneme length. Therefore, humans are able to mentally interpolate the gap intervals, and speech intelligibility is not decreased. This is in contrast to the noninterleaving situation where a single lost packet can result in a complete phoneme lost, which decreases intelligibility of speech. If the receiver is using some form of error concealment (e.g. the gaps due to packet loss are filled using interpolation, a receiver-based recovery technique, of the received adjacent data units), then a higher performance is obtained if interpolation is performed for small intervals instead of longer intervals. The interleaving operation is described by interleave length, L (the distance between subsequent units after interleaving), bundling factor, B (number of data units in one packet), and the sequence length, n, which are related as n =BL. 1

2

3

4

5

6

7

8

1

5

9 13

2

6 10 14

1

5

9 13

2

6 10 14

1

2

4

5

6

8

Packetized data stream

9 10 11 12

13 14 15 16

3

4

8 12 16

Interleaved data stream

4

8 12 16

Data stream experienced loss

7 11 15

9 10

12

13 14

16

Reconstructed stream

Figure 4.4: Interleaving of units across multiple units.

41


Increasing the interleave length, L, increases the distance between resulting gaps due to a single packet loss, which minimizes the packet-loss effect, without increase in network load. At the same time, this increases the overall delay, which might be prohibitive for interactive applications. The choice of bundling factor, B, is defined by the receiver buffer size, and the delay requirements.

4.4 Receiver-based Recovery Techniques Unlike sender-based recovery techniques, the receiver-based recovery techniques are independent and don't require any action to be taken by the source party. On a closer look to figure 4.1, one can notice that all receiver-based recovery but insertion-based recovery depends mainly on the voice properties. The following discussion will show this notice.

4.4.1 Insertion Insertion-based repair schemes derive a replacement for a lost packet by inserting a simple fill-in. The simplest case is splicing, where a zero-length fill-in is used; an alternative is silence substitution, where a fill-in with the duration of the lost packet is substituted to maintain the timing of the stream. Better results are obtained by using noise or a repeat of the previous packet as the replacement. The distinguishing feature of insertion-based repair techniques is that the characteristics of the signal are not used to aid reconstruction. This makes these methods simple to implement, but results in generally poor performance [19]. 4.4.1.1 Splicing Lost units can be concealed by splicing together the audio on either side of the loss; no gap is left due to a missing packet, but the timing of the stream is disrupted. This technique has shown to perform poorly [19]. Low loss rates and short clipping lengths (4– 16 ms) faired best, but the results were intolerable for losses above 3 percent. The use of splicing can also interfere with the adaptive playout buffer required in a packet audio 42


system, because it makes a step reduction in the amount of data available to buffer. The adaptive playout buffer is used to allow for the reordering of misordered packets and removal of network timing jitter, and poor performance of this buffer can adversely affect the quality of the entire system. It is clear; therefore, that splicing together audio on either side of a lost unit is not an acceptable repair technique [19]. 4.4.1.2 Silence Substitution (SS) The silence substitution method just fills the gap left by lost packet with silence (i.e., zero) to maintain the speech timing sequence [22]. This is the simplest and least complexity [22] method among all the techniques; however, it is not able to maintain an acceptable quality of playback audio in the event of high lost rate and large packet size [23]. It is only effective with short packet lengths (< 4 ms) and low loss rates (< 2 percent) [19, 22]. The performance of silence substitution degrades rapidly as packet sizes increase, and quality is unacceptably bad for the 40 ms packet size in common use in network audio conferencing tools [19]. 4.4.1.3 Noise Substitution (NS) Instead of filling in the gap left by a lost packet with silence, background noise is inserted instead [19]. A number of studies of the human perception of interrupted speech have shown that the ability of the human brain to subconsciously repair the missing segment of speech with the correct sound occurs for speech repair using noise substitution but not for silence substitution. In addition, when compared to silence, the use of white noise has been shown to give both subjectively better quality and improved intelligibility. It is therefore recommended as a replacement for silence substitution [19, 20]. 4.4.1.4 Packet Repetition (PR) The repetition method uses the packet preceding to the lost packet as the substitution. Its complexity is also close to zero, same as that of the silence substitution 43


method. However, the repetition method has better recovery performance than the silence substitution method. The reconstructed speech using this method can tolerate a packet loss rate of up to 4% [22]. The subjective quality of repetition can be improved by gradually fading repeated units. The GSM system, for example, advocates the repetition of the first 20 ms with the same amplitude followed by fading the repeated signal to zero amplitude over the next 320 ms. The use of repetition with fading is a good compromise between the other poorly performing silence substitution and noise substitution, and the more complex interpolation-based methods [19] that will be discussed in the subsequent sections.

4.4.2 Interpolation The following recovery techniques are categorized as interpolation methods because the missing speech segments are substituted by other segments from the same speech stream with some modifications that differ according to each method. The advantage of interpolation-based schemes over insertion-based techniques is that they account for the changing characteristics of the speech signal [19]. 4.4.2.1 Waveform Substitution (WS) In this method, when a segment of voice fails to arrive at the destination on time, the previous segment of voice is used to replace the missing segment of voice. The assumption of this technique is that the speech characteristics have not changed much from a preceding speech segment and it is logical to use the previous segment of speech to reconstruct the missing portion. This method does not work for large packet size as the voice characteristics are most likely to change noticeably from one previous packet to the next. Moreover, it also does not guard against the continuous loss of multiple packets where voice characteristics do not remain the same over the duration of packet loss. As with silence substitution, it does not demand lots of processing power [23, 24]. Hence, it is used in some of the interactive voice communication applications [24, 25]. 44


4.4.2.2 Sample Interpolation (SI) This technique is similar to Waveform Substitution; however, it does not directly replace all missing audio segments with the previously received segments. It modifies the previous audio packets before substituting the missing audio segments with it. The method assumes that the audio characteristics change slightly over a short period of time. In order to use previously received samples to replace the missing audio segments and at the same time accommodating the slight change in audio attribute, the missing samples are estimated based on the previous samples' characteristics. A simple form of sample modification is linear interpolation of audio. In comparison, it requires more processing power than the Waveform Substitution, but it offers a better contingency solution. As with Waveform Substitution, it is not usable in a prolonged duration of packets loss as it is likely that the audio characteristics will change significantly [23]. 4.4.2.3 Pitch Waveform Replication (PWR) The pitch waveform replication method uses two parallel detectors which continually detect the positive and negative peaks of the speech, respectively, to estimate the pitch of the packet before the lost one. The pitch search can be done on either side of the loss [19] or only on the preceding side. Figure 4.5 illustrates how the positive peak detector works. In figure 4.5, assume the speech signal is x(n). The positive peak detector updates the value of MAX with successive local maxima of speech samples until no update has occurred for the number of hold samples. Then, MAX decays exponentially by a factor (i.e., the value of the decreasing factor is smaller than one) until it is exceeded by a speech sample. The negative peak detector works analogously. From the two peak detectors, we can obtain the four time intervals that separate the most recent three maxima and minima respectively. By using these four pitch estimations; PWR can decide whether the speech before the missing packet is voiced. If the speech is not voiced or the pitch detection fails, PWR uses the repetition method to recover the lost packet. If the speech is voiced, PWR reconstructs the missing packet by duplicating the pitch period sample

45


preceding the missing packet throughout the region of the lost packet. The packet loss rate tolerable is up to about 10% [22]. PWR can be considered a refinement on waveform substitution [19]. Start

n=1 position = 0 size = packet size factor = decreasing factor count = hold MAX = x(0)

n = n+1

no

End

n< size?

yes

x(n)>MAX?

0

- count

=0 MAX=MAX*factor

Store position

Figure 4.5: Positive peak detector algorithm flowchart

Complexity of PWR can be reduced by converting the original speech waveform into trinary one [26]. The method is based on the speech features of local periodicity. However, by converting the original speech waveform into trinay one, amplitude of the waveform is simplified from arbitrary decimal numbers to definite trinary integers. Then, by encoding a trinary integer and its successive count into a pair, the number of data is 46


greatly reduced from sample count to pair count. In addition, by determining the local threshold optimally, only the locally highest positive and negative peaks are captured after converting the waveform into trinary. This capturing makes finding the minimum speech period easy and high accurately. The found minimum period of the original speech waveform is substituted cyclically during the packet loss duration. Here, in order to minimize waveform discontinuity at the beginning and the end of the packet loss duration, interpolation between each two successive samples of the original waveform is performed such as keeping the continuity of phase variation in the reproduced speech waveform [26]. The flowchart of the algorithm for this method is shown in figure 4.6a. A demonstration for it is shown in figure 4.6b where, T+ and T- are the positive and negative thresholds respectively, τ is the part of the found pitch period required to complete the last pitch period before loss, q is the detected loss period, A's indicate the numbers of samples that exceeded T+ and encoded as +1, B's indicate the numbers of samples that are less than T- and encoded as -1 and C's are the numbers of samples that have amplitudes between T+ and T- and encoded as zero. Generally, there are three important factors that affect the speech quality, namely amplitude continuity, phase continuity and frequency continuity. Explicitly, the amplitudes, phases and frequencies have to be continuous at the boundaries between the substitution packets and their neighboring packets (including the previous and subsequent packets received), otherwise audible noise will occur [22]. It is noted that the PWR method has two shortcomings. First, PWR only copes with the continuity for the boundaries between the reconstructed packets and their previous packets when the speech is voiced, whereas the continuity for the boundaries between the reconstructed packets and the subsequent ones is not properly dealt with. This is referred to as the discontinuity problem of PWR. The second shortcoming of the PWR method is that it uses the repetition method to recover the lost packet when the speech is unvoiced or when the pitch detection fails.

47

Chapter 4: Packet Loss and Recovery Techniques Found minimum Determine threshold levels T+ and T- from the maximum and minimum values of the original speech waveform, respectively, in the definite length of duration.

τ

T+ q

a

T-

Convert the original speech waveform into trinary one, according to the threshold levels T+ and T-.

→ Time (ms) A1 B1

B2

A2

A3

B3 B4 B5 B6

A4 B7B8

b

Encode a value of trinary waveform and its successive count into a pair for each trinary segment.

C1

C2

C3

C4

→ Time (ms) Packet loss detected

no c

yes Do pattern matching with the definite maximum allowable difference, to determine the first matched part, and to find the minimum period of the original waveform.

→ Time (ms)

A1 A2 d A3 A4

Substitute the found minimum period of the original waveform cyclically, starting at the position S with displacement d from the beginning of the found minimum period.

During loss

yes

no Figure 4.6a: PWR with trinary encoding method flow chart [26].

(1,3) (1,4) (1,1) (1,2)

B1 B2 B3 B4 B5 B6 B7 B8

(0,49) (0,3) (0,43) (0,3) (0,44) (0,5) (0,47) (0 4)

C1 C2 C3 C4 C5

(-1,1) (-1,6) (-1,5) (-1,4) (-1,5)

Original Speech waveform with packet loss, where T+ is +ve threshold, T- is –ve threshold. Trinary waveform. Substitution reproduced waveform Ternary-encoded pairs Figure 4.6b: Waveforms for PWR with trinary encoding method [25].

Figure 4.6: PWR with trinary encoding method

48


However, the reconstructed speech will not have acceptable quality when there is a transition from voiced to unvoiced at the substitution packets. This is referred to as the repetition problem of PWR [22]. 4.4.2.4 Double Sided Pitch Waveform Replication (DSPWR) DSPWR remedies the above mentioned problems of the PWR. While PWR searches for pitch in the preceding packet only, DSPWR searches for it in both preceding and following packets surrounding loss. This adds the ability to detect if during loss a transition to a voiced segment occurred, in this case a careful repair is required since the loss during such a transition causes sever impact on speech quality [27]. DSPWR can tolerate loss rates up to 30%. According to the status of the two pitch detectors (for preceding and following packets); there are four repair procedures [22]: • If both detectors succeeded in finding the pitch, a scheme called BV (both voiced) used for recovery. • If the detector for the preceding packet succeeded in finding pitch, but the detector for the following packet failed to find it, a recovery scheme called PV (preceding voiced) used. • If only the pitch detector of the following packet succeeded, a scheme for recovery called FV (following voiced) is used. • If both pitch detectors failed a scheme called BU (both unvoiced) used for recovery. Before discussing these recovery procedures, the problems of pitch segment adjustment, phase discontinuity, and amplitude adjustment, have to be dealt with first. Pitch Segment Adjustment (PSA) The purpose of Pitch Segment Adjustment is to eliminate the amplitude discontinuity at the boundaries between the reconstructed pitch segments used to recover the lost packet [22]. Note that in the PWR method there may be amplitude discontinuity between the head and tail of the pitch segment and due to its frequent occurrences, this 49


problem is as important as the one caused by the phase discontinuity [22]. Pitch Segment Adjustment, will be abbreviated as PSA. The PSA is done as follows: • Let x(n) be the speech signal, P is the pitch • Search the region from x(n-1-P-3) to x(n-1-P+3) to find the sample x(n-1-P+i) whose value is the most close to that x(n-1). • if i≤3 then the amount of lagging is d=3-i, else the amount of leading is d=2(i-3). • Calculate diff =

x(n − P − d ) − x(n − 1) d +1

(4-1)

• for j=1 to j=d do x(n-P+j-1)=x(n-1)+j*diff

(4-2)

The PSA is depicted in figure 4.7.

Procedure PSA for the preceding packet, making an amplitude adjustment to the pitch (P) period segment: (a) the pitch lagging condition, (b) the pitch leading condition, (c) the searching procedure for the value of lagging/leading.

Figure 4.7: Procedure PSA

50


Eliminating Phase Discontinuity As mentioned in 4.2.2.3, PWR suffers from phase discontinuity at the boundary between the recovered packet and the subsequent one. To eliminate this phase discontinuity, the phase of the beginning sample of the packet following the lost one is computed, and then both of the pitch segments are used to reconstruct the lost packet, and use the pitch difference between the pitches computed from both sides of the lost packet to eliminate the phase difference. This procedure called Phase Matching using Pitch difference [22], and abbreviated PMP. Suppose that the pitch of the preceding packet is PP, the pitch of the following packet is PF, the size of packet is n, the phase of the beginning sample of the following packet is phase, the number of pitch segments of the preceding packet used to reconstruct the lost packet is a, the number of pitch segments of the following packet used to reconstruct the lost packet is b, and the number of remaining samples after the fill-up with the pitch segments is c. Then, the lost packet is reconstructed as illustrated in figure 4.8. From figure 4.8, we can derive the following equations. a*PP + b*PF + c = n

(4-3)

hence, (a + b)*PP + b*(PF -PP) + c = n

(4-4)

hence, b*(PF - PP) + c = n-(a + b)*PP

(4-5)

The right hand side means the phase if the lost packet was built using the pitch found in the preceding packet only. hence, b*(PF-PP) + c = n mod PP + k*PP

(4-6)

To erase the phase discontinuity, c has to be equal to phase. Replacing c with phase in the above equation, we get: b*(PF-PP) + phase = n mod PP + k*PP

(4-7)

Hence, b*(PF-PP) = n mod PP + k*PP – phase

(4-8)

Hence b * pitch difference = phase difference where, pitch difference = PF-PP

(4-9)

Using the above equation, a, b and c can be obtained. 51


The PMP can be implemented as following: • Step 1:

Calculate

initial

state

values,

⎢ n ⎥ a=⎢ ⎥, ⎣ PP ⎦

b = 0,

c = n mod PP ,

phase _ diff = c − phase , and pitch _ diff = PF − PP

• Step 2: if phase _ diff = pitch _ diff = 0 then finish and use PP or PF for building the lost packet. • Step 3: else if sign( pitch _ diff ) = sign( phase _ diff ) then b = round (

phase _ diff ) , a = a-b, and c = c - b*pitch_diff pitch _ diff ⎢ n ⎥

if ( PP − phase _ diff < c − phase ) then a = ⎢ ⎥ , b = 0 , c = n mod PP ⎣ PP ⎦ • Step 4: else if phase _ diff > 0 and pitch _ diff < 0 then phase_diff = phase_diff - PP and goto Step 3. • Step 5: else if

phase _ diff < 0 and pitch _ diff > 0 then

phase_diff = phase_diff + PP and goto Step 3.

The lost packet is reconstructed by using the pitch segments of the packet at both sides of the lost packet.

Figure 4.8: Illustration of procedure PMP

Adjusting Recovered Packet Amplitude The amplitude of the reconstructed packet is adjusted in such a way that the amplitude is continuous inside the packet as well as to the neighboring packets. Two amplitude adjustment procedures are used. One of them called FWAA (standing for forward amplitude adjustment) and the other called BWAA (standing for backward 52


amplitude adjustment), are described below. Assume that the amplitude of the selected pitch segment for the previous packet is VP, the amplitude of the selected pitch segment of the future segment is VF, the packet size is n and the signal segment required to adjust is from x(start) to x(stop) in the reconstructed packet [22]. The corresponding scenarios are given in figure 4.9.

Figure 4.9: Illustration of the procedure FWAA and BWAA, where the dotted lines represent the waveforms after reconstruction Procedure FWAA

1- Compute factor =

VF − VP VP * n

2- for i = start to stop do x(n) = x(n)*(1 + factor*i)

53

Chapter 4: Packet Loss and Recovery Techniques Procedure BWAA

1- Compute factor =

VP − VF VF * n

2- for i = start to stop do x(n) = x(n)*(1 + factor*(n-i)) Scheme BV For the case that both the preceding and following packets are voiced. • Step 1. Use the procedure PSA to adjust the PP period segment just preceding the lost packet and the PF period segment just following the lost packet. • Step 2. Compute the peak amplitude of the PP period segment just preceding the lost packet and denote this amplitude as VP. Also, compute the peak amplitude of the PF period segment just following the lost packet and denote this amplitude as VF. • Step 3. Compute the phase of the beginning sample of the packet following the lost packet by the algorithm proposed in the phase-matching recovery method. • Step 4. Use procedure PMP to derive the parameters a, b and c. • Step 5. Copy repetitions of the preceding pitch segment into the lost packet. • Step 6. Then copy the first c samples of the preceding pitch segment into the lost packet. • Step 7. Copy b repetitions of the following pitch segment into the lost packet. • Step 8. Use procedure FWAA to adjust the amplitude of the leading pitch segment in the lost packet. • Step 9. Use procedure BWAA to adjust the amplitude of the rear pitch segment in the lost packet. If just one side of the pitch estimation is successful, we will use the successful side to reconstruct the lost packet. First, we adjust the pitch segment consisting of pitch samples just preceding or following the lost packet by using the procedure PSA as depicted in 54


figure 4.7 and calculate the peak amplitudes of this pitch segment. Then, we reconstruct the missing packet by duplicating this pitch segment throughout the region of the lost packet. In addition, we adjust the amplitude of the reconstructed speech by using the procedure FWAA or BWAA to make the amplitude linearly decrease from the voiced side to the other side. Note that according to our experiments, we do not need to consider the phase discontinuity from a voiced speech to unvoiced speech, since the energy of unvoiced speech is in general so small and barely audible. Schemes PV and FV are outlined below. Scheme PV For the case that the preceding packet is voiced. • Step 1. Adjust the PP period segment just preceding the lost packet by using the procedure PSA. • Step 2. Compute the peak amplitude of the PP period segment just preceding and following the lost packet. • Step 3. Repeat the preceding segments throughout the region of the lost packet as the substitution. • Step 4. Adjust the amplitude of the substitution using procedure FWAA. Scheme FV For the case that the following packet is voiced. • Step 1. Adjust the PF period segment just following the lost packet by using the procedure PSA. • Step 2. Compute the peak amplitude of the PF period segment just preceding and following the lost packet. • Step 3. Repeat the following segments throughout the region of the lost packet as the substitution. • Step 4. Adjust the amplitude of the substitution using procedure BWAA.

55


If the pitch estimations on both sides failed, we reconstruct the lost packet by the rear half packet of the preceding packet and the first half packet of the following packet. This method is easy to implement and found to be effective to reduce the noise caused by the transition from voiced to unvoiced and vice versa. The scheme is presented as follows. Scheme BU For the case that both preceding and following packets are unvoiced. • Step 1. Copy the rear half of the preceding packet into the region of the first half of the lost packet. • Step 2. Copy the first half of the following packet into the region of the rear half of the lost packet.

4.4.3 Regeneration Based Recovery Regenerative repair techniques use knowledge of the audio compression algorithm to derive codec parameters, such that audio in a lost packet can be synthesized. These techniques are necessarily codec-dependent but perform well because of the large amount of state information used in the repair. However, they are computationally intensive [19].

4.4.4 Model-Based Recovery In model-based recovery the speech on one, or both, sides of the loss is fitted to a model that is used to generate speech to cover the period loss. This technique works well for short blocks (8/16 ms) to ensure that the speech characteristics of the last received block have a high probability of being relevant [19]. 4.4.4.1 Interpolation of Transmitted State For codecs based on linear prediction, it is possible that the decoder can interpolate between states. For example, the ITU G.723.1 speech coder interpolates the state of the linear predictor coefficients either side of short losses and uses either a periodic excitation the same as the previous frame, or gain matched random number generator, depending on 56


whether the signal was voiced or unvoiced. For longer losses, the reproduced signal is gradually faded. The advantages of codecs that can interpolate state rather than recoding the audio on either side of the loss is that there is are no boundary effects due to changing codecs, and the computational load remains approximately constant. However, it should be noted that codecs where interpolation may be applied typically have high processing demands [19]. From the above description of different recovery techniques, it is noted that all sender-based recovery techniques require more bandwidth than the original speech packets stream also they add delay. Regeneration-based and insertion-based techniques don't guard against consecutive packet loss and recovered speech quality using these methods degrades rapidly as loss rate increase. The interpolation-based techniques, which are receiver-based techniques are codec independent and don't require any extra transmission bandwidth, so they will be adopted through out the rest of the thesis.

57

Chapter 5: Enhanced Recovery Techniques

Chapter 5

Enhanced Recovery Techniques

In chapter 4, all the explored recovery techniques suffer from shortcomings, such as high complexity as in DSPWR, or non optimum recovery quality as in WS, SI, and PWR. Two improvements on recovery techniques may be done. Reducing complexity or enhancing their performance. These two enhancement directions have lead to development of the two recovery techniques discussed in this chapter.

5.1 Switching Recovery Technique (SRT) As is clear in chapter 4, usage of SS, PR, WS, SI, PWR, and DSPWR for recovering the lost voice packets. However Liao et al has measured their complexity in terms of execution time normalized to the SS recovery techniques [22]. Table 5.1 summarizes these results. The source code for these recovery techniques was not available and we had to build the source code of these different recovery techniques and realized the results in Table 5.1. The source code is built using MATLAB. A flow chart of each recovery technique was built and then converted to the code. Table 5.1: Complexity of recovery techniques normalized to SS

Complexity

SS

PR

WS

SI

PWR

DSPWR

1

1.5-2

4-6

10-15

25-50

450-600

Switching Recovery Technique (SRT) is a recovery technique that makes use of all other recovery techniques. The idea of SRT is that at low loss rates, the performance of the all recovery techniques is approximately the same. So we can use the recovery technique with the least complexity but with a small loss in quality. As the loss rate 58


increases the SRT continue switching to the recovery technique which improves the quality but with the least complexity. In this way, SRT complexity increases as the loss rate increases because it switches to a more complex recovery technique to improve the quality. The Switched Recovery Technique (SRT) is a recovery technique that compromises between complexity and recovered voice quality. In designing SRT, two parameters should be defined. The first parameter is the transient packet loss period (TPLP), which is used to capture the packet loss period distribution. This value is calculated as the most recent packet loss period (the period between the current lost packet and the previous one). The second parameter is the average of the packet loss periods (APLP), which corresponds to the packet loss rate. This value is calculated as the mean of the most recent five packet loss periods [22]. To find where and when a recovery technique must be switched we have to study the performance of the recovery techniques which will be used in the SRT under arrange of packet loss rate to determine the switching points. The block diagram for the SRT implementation is shown in figure 5.1. As seen, the “depacketizer” feeds the “Transient Packet Loss Meter” and “Controller 2”with the incoming packets sequence numbers. If a sequence number is missing, the “Controller 2” turns SW2 to the SRT via the control link a. The “Controller 1”, at the same time uses measured TPLP by the “Transient Packet Loss Meter” to tune SW1 to the proper recovery technique as its parameters were set. The control links, b, c, d, e are simultaneously used by “Controller 1” to turn on the selected recovery technique. The SRT complexity can be determined according to loss rate profile measured and the quality required. Let probability of transient loss rate of ri be P( ri ) where ri =

1 TPLP

(5-1)

and CSRT, Creptition, Cinterpolation, CPWR and CDSPWR are the complexity of the SRT technique, packet repetition method, interpolation method, PWR method, and DSPWR method respectively then [22], C SRT = P(ri < r1 )* C reptition + P(r1 ≤ ri < r2 )* C interpolation + P(r2 ≤ ri < r3 )* C PWR + P(ri ≥ r3 )* C DSPWR (5-2)

59


where r1 , r2 , and r3 are the transient loss rates that the SRT will switch at. The amount of computation saved, Csaved, by using SRT rather than using DSPWR only can be determined by [22]: Csaved = (CDSPWR −Crepetition )* P (ri < r1) + (C DSPWR −Cinterpolation )* P (r1 ≤ ri < r2 ) + (C DSPWR −CPWR )* p (r2 ≤ ri < r3 )

(5-3) Note that the complexity of the controllers and the meter are so small so they are

Playout Buffer

Depacketizer

Input packet voice stream

neglected. Decoder

e Transient Packet Loss Meter

SW2

SI Controller 1

d SW1 PWR

Controller 2

Output Waveform

WS

c DSPWR b a

Figure 5.1: SRT block diagram Control lines SRT

5.2 Parallel Recovery Technique (PRT) As noted in chapters 4 and section 5.1, SRT and all other receiver-based recovery techniques, try to improve the overall decoded voice quality through enhancing the waveform reconstructed during the period of loss. Most codecs used for data networks are LPC-based or differential-based codecs, which generate low bit-rate traffic. These codecs

60


build the current waveform using current received packet and the decoder history ( u(n) as shown in chapter 3). It is noted by [26] that the loss affects the after-loss waveform also which makes excess quality degradation even when using a sophisticated recovery technique. In order to minimize the error on decoded waveform after loss period, the history of the decoder must set as close as possible to the decoder history if no loss encountered. This may be done by re-encoding the recovered waveform using the recovery technique to generate parameters which may resemble that enclosed in the lost packets. However this extremely increases the recovery technique complexity. Another method is to use Packet Repetition (PR) to copy the parameters in the last received packet before loss so it is decoded instead of decoding null packets. In this way, decoder history is maintained as close to the correct history as possible with minimal increase in the recovery technique complexity. Using Packet Repetition and any other known receiverbased recovery technique is called Parallel Recovery Technique (PRT). The PRT is implemented for test as in the block diagram in figure 5.2.

61

1

SW1

Decoder

1

2 2

Controller

SW2

Output waveform

Playout Buffer

Depacketizer

Input packet voice stream


a

Packet Repetition

PRT Receiver-based Recovery Technique b

Figure 5.2: Block diagram for PRT Control lines PRT

When the depacketizer detects a missing sequence number, it triggers the controller to send the control signals a, b to turn the switches SW1, SW2 to position 2 i.e. makes the input to the decoder is copies of the last received packet and the output waveform recovered using a conventional recovery technique from the last received waveform. The enhanced recovery using PRT will be called PRT-“Recovery Technique” for example the enhanced WS will be called PRT-WS.

62

Chapter 6: Performance Evaluation and Results

Chapter 6

Performance Evaluation and Results

To evaluate a recovery technique performance for certain a voice codec over a range of loss rates, a communication link between the sender including coder and the receiver including decoder and the recovery technique through a data network

have to be

established in reality or by simulating each component of the communication link. In this chapter, the evaluation of recovery techniques for different codecs is studied through simulation.

6.1 Test components 6.1.1 Coder and decoder The codecs used in this work are the LD-CELP, GSM FR coder (RPE-LTP), GSM HR coder (VSELP) and FS CELP. These codecs represent different bit rates and different excitation methods. Table 6.1 summaries distinctive features of these coders [13, 14, 16]. Table 6.1: Distinct features of codecs used in test codec

Bit rate (kbps)

Excitation method

Parameters in transmitted frame

other

LD-CELP

16

Stochastic codebook

Codebook index

Backward adaptation

GSM FR

13

Regular pulse excitation

GSM HR

5.6

Two stochastic codebooks

FS CELP

4.8

Stochastic codebook

Pulse amplitudes, grid position, STP coefficients, LTP parameters 1st , 2nd codebook indices and gains, LTP parameters, STP parameters Codebook index and gain, LTP parameters, STP parameters

Forward adaptation Forward adaptation Forward adaptation

63


6.1.2 Recovery techniques The SS, WS, SI, PWR, DSPWR, SRT, PRT are tested

6.1.3 Network The network simulated by using the Gilbert Model (2-State Markov model) because loss generator based on it produces single packet as well as burst losses depending on the loss rate. It is reasonable to use "Markov model"-based loss generator to simulate losses to test the different codecs with the different recovery techniques, because the aim of loss model is to simulate loss frequency as well as loss distributions in the observed packet stream. 6.1.3.1 Loss generation using the Gilbert Model We used these steps for generating and allocating lost packets in the time series {xi }i∞=1 : • We observe a stream of packets of length m so the time series will be {xi }im=1 • Set {xi }im=1 =0, meaning that no loss occurs at this step. • Using given loss rate, r, calculate n1 , the number of 1's in the time series {xi }im=1 , n1 = ⎢⎡ r .n ⎥⎤

• Calculate the probabilities, f (k ) , of all the possible loss run lengths using (2-4) for k = 1,2,3,... , and a typical value of q

• Knowing f (k ) ,and n1 , the number of bursts with length k can be calculated as

⎡ f (k ) × n1 ⎤ • Randomize locations of the bursts using random numbers generated from a uniform distribution random number generator.

6.1.4 Perceptual quality evaluation The perceived voice quality is measured using the ITU-T P.862, Perceptual Evaluation of Speech Quality (PESQ) which is an objective method for end-to-end speech quality 64


assessment of narrow-band telephone networks and speech codecs [28]. This standard is software that simulates the human ear. Its scale ranges from -0.5 to 4.5 which correspond to bad to excellent respectively. The well known objective MOS (Mean Opinion Score) score at five grade scale (e.g., excellent = 5 and bad = 1) is directly calculated from PESQ algorithm. It makes the applications very convenient.

6.2 Simulation system The block diagram for the simulation system for testing the recovery techniques is shown in figure 6.1.

Input Speech

Σ

Playout Buffer

Lossless Network

Depacketizer

Coder

Packetizer

Loss Model

Decoder

Recovery Technique

PESQ

Output Speech

Measured Quality

Figure 6.1: Simulation system used for testing recovery techniques

6.3 Waveform repair for ordinary recovery techniques To show what a certain recovery technique do in the recovered waveform, a close observation for the waveform done in this thesis. This is done by framing a PCM coded voice and delete one or two frames then apply a recovery technique and note the repaired waveform. In this steps no need for loss model. 65


6.3.1 WS and SI Figure 6.2 compare repair using Waveform Substitution (WS) and Sample Interpolation (SI).

Amp litude Normalized Amplitude

0.6 0.4

(a)

0.2 0 -0.2 -0.4 50

100

150

200

250 300 Original speech segment

350

400

450

500

Sample Number

Amp litude Normalized Amplitude

0.6 0.4

(b)

0.2 0 -0.2 -0.4 50

100

150

200

250 300 A lost part of the segment

350

400

450

500

Sample Number

AmAmplitude plitude Normalized

0.6 0.4

(c)

0.2 0 -0.2 -0.4 50

100

150

200 250 300 350 Loss recovered using Waveform Substitution

400

450

500

Sample Number

AmAmplitude plitude Normalized

0.6 0.4

(d)

0.2 0 -0.2 -0.4 50

100

150

200 250 300 350 Loss recovered using Sample Interpolation

400

450

500

Sample Number

Original speech segment. A lost part of the speech segment. Lost part recovered with Waveform Substitution. Lost part recovered with Sample Interpolation.

Figure 6.2: Waveform Substitution and Sample Interpolation

It is worth mentioning that found literature about Sample interpolation doesn’t suggest any implementation algorithm and simple algorithm equations are derived then evaluated 66


experimentally for best reconstructed speech quality. To realize the Sample Interpolation method, it should be found a way to detect if the waveform of the preceding packet increases or decreases. Measuring the first and last peaks is a method. Also measuring the RMS values for the first and second halves of the preceding packet. Generally the lost packet samples are built using this general interpolation equation: ~ x (n) = x(n). f SI (n)

(6-1)

where ~x (n ) is the estimated packet, x (n ) is the original samples used to build the lost packet, and f SI (n) is the interpolation function. Computing linear f SI (n) using peaks: • Divide the packet samples into two sets of samples, x + (n) and x − (n) where, •

⎧ ⎪ x ( n) x ( n) = ⎨ ⎪0 ⎩

•

⎧ ⎪ x ( n) and x (n) = ⎨ ⎪0 ⎩

for x(n) ≥ 0

+

for x(n) < 0

−

for x(n) ≤ 0 for x(n) > 0

and n=0: N-1, where N is the packet length. • Find the value for the first positive peak, V + i , and the value of the last positive peak, V + f , of the preceding packet. • Find the value for the first negative peaks, V − i , and the value of the last negative peak, V − f , of the preceding packet. • Calculate m + =

V + f − V +i V − f − V −i , m− = N N

⎧ n.m + ⎪⎪1 + + • Then f SI (n) = ⎨ V f− ⎪1 + n.m ⎪⎩ V − f

for x + (n) for x − (n)

67


The RMS method is analogous to peaks method with V + i , V + f are replaced by the RMS values of the first and second halves of x + (n) , and V − i , V − f are replaced by the RMS values of the first and second halves of x − (n) .

6.3.2 PWR and DSPWR Figure 6.3 shows how the PWR’s peak detector traces the positive and negative peaks using the algorithm shown in chapter 4. 1

0.8

Speech segment found +ve peaks found -ve peaks +ve peaks trace -ve peaks trace

Normalized Amplitude

0.6

0.4

0.2

0

-0.2

-0.4

-0.6

0

20

40

60

80

100

120

140

160

180

200

Sample Number

Figure 6.3: Peaks of the voice segment as detected by the PWR peak detector

In figures 6.4, 6.5, waveforms recovered using PWR and DSPWR are shown

68


Normalized A m p Amplitude lit u d e

1

0.5

(a) 0

-0.5 50

100

150

200

250 300 Original speech segment

350

400

450

SampleNumber number 500 Sample

Normalized A m p litAmplitude ude

1

0.5

(b) 0

-0.5 50

100

150

200

250 300 A lost part of the segment

350

400

450

Sample Number number 500 Sample

Normalized Amplitude A m p lit ude

1

0.5

(c) 0

-0.5 50

100

150

200

250 300 Loss recovered using PWR

350

400

450

Sample Number number 500 Sample

Figure 6.4: PWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovewred using PWR

69



(a)


Sample Number

(b)


Sample Number

(c)

Sample Number

Figure 6.5: DSPWR, (a) Original voice segment, (b) Voice segment with lost part, (c) Voice segment with lost part recovered using DSPWR

70


6.4 Packet Loss vs. coding algorithm Figure 6.6, figure 6.7, figure 6.8 and figure 6.9 show the results we had when tested the coders. Note that LD-CELP has different performance than all the other coders in the sense that highest quality for recovery was from PR for LD-CELP but for DSPWR for all the other codecs. This is because the backward adaptation in it. Besides, the output frames of the encoder contain only the codebook information. But for all other coders, a packet loss means loss in all needed information by the decoder that is contained in that lost packet. Hence LD-CELP will not require the SRT nor the PRT because it performs well with PR which is the least complex recovery technique after SS.

4 PR DSPWR 3.5

PWR SI WS

PESQ

3

SS

2.5

2

1.5

1

0

2

4

6

8

10 Loss Rate (%)

12

14

16

18

20

Figure 6.6: PESQ vs. loss rate for LD-CELP

71


4 SS WS 3.5

PR SI PWR

PESQ

3

DSPWR

2.5

2

1.5

1

0

2

4

6

8

10 Loss Rate (%)

12

14

16

18

20

Figure 6.7: PESQ vs. loss rate for the GSM FR coder

4 data1 SS WS 3.5

PR SI PWR

PESQ

3

DSPWR

2.5

2

1.5

1

0

2

4

6

8

10 Loss Rate (%)

12

14

16

18

20

Figure 6.8: PESQ vs. loss rate for the GSM HR coder

72


4 SS WS 3.5

PR SI PWR

PESQ

3

DSPWR

2.5

2

1.5

1

0

2

4

6

8

10 Loss Rate (%)

12

14

16

18

20

Figure 6.9: PESQ vs. loss rate for the FS CELP (FS1016) coder

These results in table form are in table 6.2 Table 6.2: PESQ vs. loss rate for the different codec types Codec

LD-CELP

GSM FR

GSM HR

Recovery Method SS PR WS SI PWR DSPWR SS PR WS SI PWR DSPWR SS PR WS SI

0% 3.74 3.74 3.74 3.74 3.74 3.74 3.57 3.57 3.57 3.57 3.57 3.57 3.56 3.56 3.56 3.56

Quality measured using PESQ 2% 10% 18% 2.91 1.62 1.34 3.40 2.71 2.25 3.05 1.90 1.50 3.04 1.96 1.55 3.08 2.03 1.70 3.10 2.14 2.00 2.75 1.84 1.68 3.20 2.25 1.76 3.03 2.23 1.74 3.25 2.31 1.88 3.26 2.33 1.90 3.26 2.50 2.02 2.75 1.78 1.61 3.12 2.20 1.68 3.01 2.30 1.71 3.20 2.31 1.72

20% 1.30 2.14 1.40 1.51 1.62 1.89 1.60 1.71 1.72 1.72 1.88 1.99 1.61 1.62 1.7 1.72

73


FS CELP

PWR DSPWR SS PR WS SI PWR DSPWR

3.56 3.56 2.86 2.86 2.86 2.86 2.86 2.86

3.75 3.75 2.10 2.51 2.40 2.60 2.61 2.62

2.32 2.44 1.25 1.60 1.62 1.70 1.72 1.87

1.78 2.01 1.06 1.12 1.21 1.26 1.27 1.40

1.74 2.00 1.06 1.02 1.12 1.15 1.17 1.33

6.5 Switching Recovery Technique Switching Recovery Technique (SRT) is a recovery technique that makes use of all other recovery techniques. The idea of SRT is that at low loss rates, the performance of the all recovery techniques is approximately the same. So we can use the recovery technique with the least complexity but with a small loss in quality. As the loss rate increases the SRT continue switching to the recovery technique which improves the quality but with the least complexity. In this way, SRT complexity increases as the loss rate increases because it switches to a more complex recovery technique to improve the quality. The Switched Recovery Technique (SRT) is a recovery technique that compromises between complexity and recovered voice quality. In designing SRT, two parameters should be defined. The first parameter is the transient packet loss period (TPLP), which is used to capture the packet loss period distribution. This value is calculated as the most recent packet loss period (the period between the current lost packet and the previous one). The second parameter is the average of the packet loss periods (APLP), which corresponds to the packet loss rate. This value is calculated as the mean of the most recent five packet loss periods [22]. To find where and when a recovery technique must be switched we have to examine figures 6.6, through 6.9. Hence it is deduced that: • The DSPWR method gives the highest PESQ for all loss rates and all codecs but LD-CELP (but with the highest computational load and delay). • The silence substitution technique gives the worst PESQ for all loss rates and all codecs, so it will not be used in our SRT. 74


• Packet repetition and waveform substitution have nearly the same PESQ for all codecs but LD-CELP, and all loss rates so the packet repetition will be used due to its close to zero complexity and it is better than waveform substitution for small loss rates. • The Switched Recovery Technique (SRT) also needs to know where it will switch to the higher recovery technique so we need a parameter called Quality Loss (QL) which is defined as the difference between PESQ for DSPWR and the PESQ for the recovery method chosen by the SRT, all measured at the same loss rate. From results presented in section 6.4, QL should be kept below 0.3 for proper operation of SRT. To understand how QL controls the SRT, let, for example, the codec used is GSM FR and the loss rate be 2% so the QL≈ 0.1, hence the SRT is switched to repetition recovery method. As the loss rate increases, the QL increases till it reaches near 0.3, the SRT decides to switch to interpolation recovery method and so on. The SRT is set to choose between recovery methods as follows: For loss rates below 3% packet repetition is chosen, for loss rates between 3%, 7% interpolation is chosen, for loss rates between 7%, 10% PWR is used and for loss rates larger than 10% DSPWR is used. The SRT complexity can be determined according to loss rate profile measured and the PESQ required. Let probability of instantaneous loss rate of ri be P( ri ) and CSRT, Creptition, Cinterpolation, CPWR and CDSPWR are the complexity of the SRT technique, packet repetition method, interpolation method, PWR method, DSPWR method respectively then, CSRT = P(ri < 3 )* Creptition + P( 3 ≤ ri < 7 )* Cint erpolation + P( 7 ≤ ri < 10 )* CPWR + P(ri ≥ 10 )* CDSPWR

(6-2) The amount of computation saved, Csaved, by using SRT rather than using DSPWR only can be determined by: Csaved = (CDSPWR − Crepetition) * P(ri < 3) + (CDSPWR − Cinterpolation) * P(3 ≤ ri < 7) + (CDSPWR − CPWR) * p(7 ≤ ri < 10)

(6-3)

75


where the SRT switches at 3%, 7%, and 10% loss rates. Figure 6.10 shows the performance of the SRT for the GSM FR, GSM HR, and FS CELP. 4 FS CELP VSELP 3.5

RPE-LTP

PESQ

3

2.5

2

1.5

1

0

2

4

6

8

10 Loss Rate (%)

12

14

16

18

20

Figure 6.10: SRT applied for GSM FR, GSM HR, FS CELP

Table 6.3 shows the SRT results for the GSM FR, GSM HR, and FS CELP at 2%, 10%, 18% and 20% loss ratios. Table 6.3: The SRT applied for GSM FR, GSM HR, FS CELP Codec

Recovery Method

GSM FR GSM HR FS CELP

SRT

0% 3.570 3.560 2.860

Quality measured using PESQ 2% 10% 18% 3.740 2.501 2.082 3.750 2.50 2.080 2.561 1.880 1.600

20% 2.020 2.000 1602

Note that the PESQ for the SRT up to 3% nearly the same as for the PR method. However for loss ratios between 3% and 7% the PESQ near that of SI between these two packet loss rates. The same note can be observed for the range 7% to 10% comparing to PWR and for the DSPWR for loss ratios grater than 10%. 76


6.6 Parallel Recovery Technique (PRT) The PRT was implemented using the block diagram in figure 5.2 as the recovery technique in the simulation system in figure 6.1. The measured quality using PRT“Recovery Technique” and ordinary recovery techniques is shown in figures 6.11 through 6.22 for the GSM FR, GSM HR, and FS CELP.

Figure 6.11: Enhancement for the WS technique by PRT-WS technique for GSM FR

77


Figure 6.12: Enhancement for SI technique by PRT-SI technique for GSM FR coder

Figure 6.13: Enhancement for the PWR technique by the PRT-PWR technique for the GSM FR coder

78


Figure 6.14: Enhancement for the DSPWR technique by the PRT-DSPWR technique for GSM FR coder

Figure 6.15: Enhancement for WS technique by the PRT-WS technique for GSM HR coder

79


Figure 6.16: Enhancement for the SI technique by the PRT-SI technique for GSM HR coder

Figure 6.17: Enhancement for the PWR technique by the PRT-PWR for GSM HR coder

80


Figure 6.18: Enhancement for the DSPWR technique by the PRT-DSPWR for the GSM HR coder

Figure 6.19: Enhancement for the WS technique by the PRT-WS technique for FS CELP coder

81


Figure 6.20: Enhancement for the SI technique by the PRT-SI technique for FS CELP coder

Figure 6.21: Enhancement for the PWR technique by the PRT-PWR technique for FS CELP coder

82


Figure 6.22: Enhancement for the DSPWR technique by the PRT-DSPWR for FS CELP coder

Table 6.4 displays the PESQ enhancement by the PRT technique Table 6.4: PESQ enhancement by the PRT technique Codec

GSM FR

GSM HR

FS CELP

Recovery Method WS PRT-WS SI PRT-SI PWR PRT-PWR DSPWR PRT-DSPWR WS PRT-WS SI PRT-SI PWR PRT-PWR DSPWR PRT-DSPWR WS

0% 3.57 3.57 3.57 3.57 3.57 3.57 3.57 3.57 3.56 3.56 3.56 3.56 3.56 3.56 3.56 3.56 2.86

Quality measured using PESQ 2% 10% 18% 3.11 2.25 1.70 3.18 2.38 1.90 3.20 2.41 1.85 3.25 2.30 2.05 3.26 2.40 1.90 3.30 2.49 2.22 3.30 2.46 2.05 3.35 2.58 2.26 3.10 2.08 1.50 3.14 2.20 1.72 3.10 2.17 1.60 3.18 2.26 1.79 3.20 2.23 1.62 3.29 2.30 1.89 3.20 2.25 1.77 3.29 2.40 1.99 2.51 1.60 1.21

20% 1.60 1.80 1.80 2.00 1.85 2.20 2.00 2.20 1.42 1.50 1.56 1.77 1.60 1.75 1.66 1.90 1.07

83

Chapter 6: Performance Evaluation and Results PRT-WS SI PRT-SI PWR PRT-PWR DSPWR PRT-DSPWR

2.86 2.86 2.86 2.86 2.86 2.86 2.86

2.60 2.55 2.63 2.62 2.70 2.61 2.70

1.70 1.70 1.80 1.75 1.85 1.84 194

1.26 1.23 1.41 1.25 1.47 1.40 1.63

1.23 1.20 1.39 1.21 1.48 1.36 1.60

As noticed from the results, a noticeable improvement in quality of the repaired stream. It is remarkable that as the loss rate increases, the enhancement becomes more noticeable. This can be explained as follows; for low loss rates the loss tends to be for individual packets, so the decoder history will not deviate so much than the coder. Hence, the improvement using the packet repetition in the PRT will be small. However for large loss rates, the loss tends to occur in multiple packet bursts as well as single packet bursts, that forces the decoder’s history away from the encoder. The use for packet repetition in the PRT will reduce the decoder’s history deviation, which will greatly improve the quality as the loss rate increases. As it is clear from the results that for loss rates greater than 20% the improvement will become more noticeable. This suggests the use of the PRT when high loss rates are more likely. Comparing the PRT to the Switched Recovery Technique (SRT); the average complexity in PRT doesn’t be reduced but slightly increased by the amount added by using the PR technique at the same time the other recovery techniques used. Another difference is the use of the desired recovery technique for all loss rates while in SRT the recovery technique in use depends on the instantaneous loss rate which is the inverse of the time distance between every two consecutive losses.

84

Chapter 7: Conclusions

Chapter 7

Conclusions

7.1 Conclusions Data networks always suffer from problems such as packet loss, delay and jitter. When packetizing a real-time application such as voice it suffers from the same problems data suffer from. In real time applications, delay can cause packet loss if it exceeded the limit the application can tolerate. Loss in Data Networks can be modeled using random models or burst model or 2-state Markov model. To reduce the bit rate of a voice coder, waveform coders fails and AbS must be used for bit rates below 16 Kbps. Packet loss recovery techniques are categorized into sender-based and receiver-based techniques. Receiver-based techniques are suitable for recovering lost voice packets without loading the network with extra traffic other than the voice packets. On studying the performance of the purposed codecs with all recovery techniques, DSPWR has the best performance but with the highest complexity, PWR performance is lower than DSPWR’s, SI performance was less than PWR’s. LD-CELP has a different performance as PR was the superior of performance with the least complexity. The SRT can compromise between quality and complexity and caused reduction in the average complexity compared to DSPWR. All other recovery techniques dealt with loss periods; however loss cause degradation in after-loss recovered voice due to decoder deviation from the right history. PRT uses PR with any recovery technique to improve performance by reducing quality degradation due to decoder miss tracking that leads to deviation of its history apart from the encoder’s. PRT enhanced the performance of the different recovery techniques for the

85

Chapter 7: Conclusions

GSM HR, GSM FR, FS CELP. The amount of enhancement increases as the loss rate increase.

7.2 Future Work As a future extend to this work, is the study of codecs and recovery techniques over other network protocols such as Frame Relay based networks. Considering other multimedia types such as video, a study for receiver-based techniques performance can be evaluated. A following step to the simulation results in this thesis may be the implementation of SRT and PRT in Voice over Packet applications in readily to improve their performance.

86

Appendix A: Source Code

Appendix A

Source Code

The code in this appendix is written using MATLAB r13

A.1 Sample Interpolation (SI) function xout=sampleinterpolation(xin) L=length(xin); for i=1:L if xin(i)>=0 xinp(i)=xin(i); else xinp(i)=0; end if xin(i)=0 xout(i)=xin(i)*(1+i*mp/vrmsp2); elseif xin(i)(length(pitch)-ppopitch) np=floor(n/length(pitch)); for i=1:np pckttmp=[pckttmp(:);pitch(:)]; end ptmp=pitch(1:(n-length(pckttmp))); pcktout=[pckttmp;ptmp(:)]; else ptmp1=pitch((n-lppeak+1):length(pitch));

%ptmp1=pitch((n-

lppeak+1):length(pitch)) n=n-length(ptmp1); np=floor(n/length(pitch)); for i=1:np pckttmp=[pckttmp(:);pitch(:)]; 89


end ptmp2=pitch(1:(n-np*length(pitch))); %ptmp2=pitch(1:(n-np*length(pitch)-1)); pcktout=[ptmp1(:);pckttmp;ptmp2(:)]; end end x1=pcktin(length(pcktin)); x2=pcktout(3); y=interpolate(x1,x2,2); pcktout(1)=y(1); pcktout(2)=y(2); return function y=interpolate(x1,x2,n) p1=polyfit([1,2+n],[x1,x2],1); y1=polyval(p1,[1:2+n]); y=y1(2:n+1); return

A.3 Pitch Waveform Replication subroutine, “pitchfind.m” % y=pitchfind(x,fb,m) searchs the speech segment x for a pitch period % satisfing the conditions fb,m % y is a structure variable with the following fields: %

y.foundpitch : the found pitch cycle

%

y.ppeaktracer : the +ve peak tracer

%

y.npraatracer : the -ve peak tracer

%

y.ppeaks : the found +ve peak positions

%

y.npeaks : the found -ve peak positions

%

y.state : the status of the function (i.e. found a pitch cycle if 1 90


%

or failed of 0

%

y.pp : the peak positions of all found pitch cycles

% All fields return empty matrix [] for failier but y.state % fb : must take one of the two values, 'f' or 'b', where 'f' means forward % serch i.e. from packet beginig to its end and 'b' means backword search. % m : takes one of three values, '1', '2', or '3'. %

1 : means the pitch period near the begining of the segment

%

2 : means the pitch period near the segment end

%

3 : means the least pitch period in the segment.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function y=pitchfind(x,fb,m) %%%%%%%%%% Variables in Use %%%%%%%%%%%%%%%%%%%% % L : length of speech packet % f : decreasing factor % ppeak : +ve peak % npeak : -ve peak % ppkp : +ve peak posotions % npkp : -ve peak positions % vrms : rms of the segment % peaks : the set of peaks of pitch cycles found % ic : number of coulumns in peaks % vcut : start and stop desired amplitude for pitch period % pcycle : the selected cycle %============================================================== === % test variables: %~~~~~~~~~~~~~~~~~ 91


%fb='f'; %m=3; %x=s1; pk1=[]; pk2=[]; % Finding +ve and -ve peaks: %----------------------------clc; format long; L=length(x); f=0.996; ppkp=[]; npkp=[]; vrms=sqrt(sum(x.*x)/L); % From segment start to end: %~~~~~~~~~~~~~~~~~~~~~~~~~~~ if fb=='f' ppeak=vrms; npeak=-vrms; for i=2:L if x(i)>ppeak ppeak=x(i); else ppeak=ppeak*f; end 92


if ppeakppeak ppeak=x(i); else ppeak=ppeak*f; end 93


if ppeakppkp(i) & npkp(j)

Packet Voice Transmission over Data Networks

Packet Voice Transmission over Data Networks

Suggest Documents

Packet Voice Transmission over Data Networks

Voice over IP: Speech Transmission over Packet Networks - Springer

Voice over IP: Speech Transmission over Packet Networks - Springer

On transferring voice over data networks, packet ...

On transferring voice over data networks, packet loss ...

Predicting Packet Transmission Data over IP Networks ... - CiteSeerX

Predicting Packet Transmission Data over IP Networks ... - CiteSeerX

VOICE TRANSMISSION OVER lP NETWORKS - Fakultet prometnih ...

Hermes: Data Transmission over Unknown Voice Channels

packet data transmission over a shared transmission channel

Secure data transmission over wireless networks

Integrated Voice-Data Transmission in Packet PCN's 1 ... - CiteSeerX

Video and Voice Transmission over LTE Networks - Semantic Scholar

Multicast Voice Transmission over Vehicular Ad Hoc Networks: Issues ...

Hermes: Data Transmission over Unknown Voice ... - Systems@NYU

Packet Loss Concealment of Voice-over IP Packet using Redundant ...

Voice Traffic Performance Measurement in Packet Networks

Multimedia Transmission over Mobile Networks

Coordinated Packet Transmission in Random Wireless Networks

Transmission of emergency data over wireless networks ... - IJARCSSE

Data Transmission over Networks for Estimation and Control - CiteSeerX

Transmission of multimedia data over lossy networks - CiteSeerX

Improving QoS of Data Transmission over Wireless Sensor Networks

Data Transmission Over Networks for Estimation - Semantic Scholar