Wideband Extension of Narrowband Speech for ...

School of Electrical Engineering and Telecommunications The University of New South Wales

Wideband Extension of Narrowband Speech for Enhancement and Coding

Julien Epps

A thesis submitted to fulfil the requirements of the degree of Doctor of Philosophy at The University of New South Wales, September 2000

Abstract ii

Abstract Most existing telephone networks transmit narrowband coded speech which has been bandlimited to 4 kHz. Compared with normal speech, this speech has a muffled quality and reduced intelligibility, which is particularly noticeable in sounds such as /s/, /f/ and /sh/. Speech which has been bandlimited to 8 kHz is often coded for this reason, but this requires an increase in the bit rate. Wideband enhancement is a scheme that adds a synthesized highband signal to narrowband speech to produce a higher quality wideband speech signal. The synthesized highband signal is based entirely on information contained in the narrowband speech, and is thus achieved at zero increase in the bit rate from a coding perspective. Wideband enhancement can function as a post-processor to any narrowband telephone receiver, or alternatively it can be combined with any narrowband speech coder to produce a very low bit rate wideband speech coder. Applications include higher quality mobile, teleconferencing, and internet telephony. This thesis examines in detail each component of the wideband enhancement scheme: highband excitation synthesis, highband envelope estimation, and narrowband-highband envelope continuity. Objective and subjective test measures are formulated to assess existing and new methods for all components, and the likely limitations to the performance of wideband enhancement are also investigated. A new method for highband excitation synthesis is proposed that uses a combination of sinusoidal transform coding-based excitation and random excitation. Several new techniques for highband spectral envelope estimation are also developed. The performance of these techniques is shown to be approaching the limit likely to be achieved. Subjective tests demonstrate that wideband speech synthesized using these techniques has higher quality than the input narrowband speech. Finally, a new paradigm for very low bit rate wideband speech coding is presented in which the quality of the wideband enhancement scheme is improved further by allocating a very small bitstream for highband envelope and gain coding. Thus, this thesis demonstrates that wideband speech can be communicated at or near the bit rate of a narrowband speech coder.

Dedication iii

To my wife Anne.

Acknowledgments iv

Acknowledgments The author would like to express his sincere thanks to his supervisors Prof W. Harvey Holmes and Dr Mark Thomson for their excellent advice and guidance and for their unceasing encouragement and patience.

The author gratefully acknowledges the support of Motorola Australia Pty Ltd and The University of New South Wales through the Motorola University Postgraduate Research Scholarship. In particular thanks are extended to the Motorola Australian Research Centre and the Motorola Human Interface Laboratories, Schaumburg, Ill., for their suggestions and assistance in performing subjective listening tests.

The support of the Gowrie Trust Fund, through the Gowrie Trust Fund Postgraduate Research Scholarship, is greatly appreciated. The late Prof./1st Lieutenant N. T. M. Yeates would have been delighted with the generosity shown towards his grandson.

Thanks are due to Dr Eliathamby Ambikairajah, Dr Andrew Bradley, Dr W.R. Epps and Mrs R. Epps for reviewing and proof-reading the first draft of this thesis. The valuable comments of all who read and contributed to this thesis have resulted in a substantial improvement in its quality.

This thesis represents a component of ongoing research by the UNSW Speech Laboratory into methods for low bit rate, high quality speech coding and speech classification techniques. Thanks are extended to all speech group members who have participated in Speech Laboratory discussions over the last three years.

Many subjects volunteered their time to participate in listening tests in return for Coke, chocolate or other insubstantial enticements, and their cooperation is acknowledged.

Contents v

Contents Abstract

ii

Acknowledgments

iv

Publication List

ix

Acronyms and Abbreviations

x

Chapter 1.

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Introduction

Speech communication Speech coding Parameterization of speech Quality improvement of coded speech Wideband enhancement Applications of wideband enhancement Thesis objectives Structure Novel contributions

Chapter 2.

Highband Excitation Synthesis

2.0 Overview 2.1 Existing approaches to highband excitation synthesis 2.1.1 Non-linear transformation 2.1.2 Pulse excitation 2.1.3 Codebook mapping of excitation 2.1.4 Spectral folding and spectral translation 2.1.5 Noise modulation 2.1.6 AM-FM synthesis 2.1.7 Sinusoidal synthesis 2.2 Existing methods for speech synthesis using Sinusoidal Transform Coding 2.2.1 A sinusoidal model for speech 2.2.2 Frequency interpolation and tracking 2.2.3 Amplitude interpolation 2.2.4 Phase interpolation 2.2.5 Phase estimation using the minimum phase model 2.2.6 Modelling of unvoiced components 2.3 Existing methods for mixed voiced/unvoiced excitation synthesis 2.3.1 Binary excitation 2.3.2 Voicing dependent cut-off frequency 2.3.3 Multi-band excitation 2.3.4 Single band voicing 2.3.5 Application of mixed excitation models to wideband enhancement 2.4 New approach: STC-based highband excitation synthesis

1 1 3 4 5 7 7 8 9 11 11 12 12 14 15 16 17 17 18 19 19 20 21 22 24 25 26 26 26 27 27 28 30

Contents vi

2.4.1 New model for highband voicing 2.4.2 Periodic component 2.4.3 Random component 2.5 Evaluation of selected highband excitation generation methods 2.5.1 Excitation spectra 2.5.2 Excitation spectrograms 2.5.3 Subjective excitation assessment 2.6 Conclusion Chapter 3.

Existing Methods for Highband Spectral Envelope Estimation

30 31 32 33 33 35 37 38 39

3.0 Overview 39 3.1 Codebook mapping 40 3.1.1 Introduction to vector quantization (VQ) 40 3.1.2 Distance measures 41 3.1.3 Codebook design for VQ 42 3.1.4 Codebook mapping 47 3.1.5 Codebook pair design for codebook mapping 49 3.1.6 Codebook size 51 3.1.7 Codebook mapping with interpolation 52 3.2 Statistical Recovery 53 3.2.1 Envelope estimation using Statistical Recovery (SR) 53 3.2.2 Relationship between Statistical Recovery and codebook mapping with interpolation 55 3.2.3 Statistical Recovery parameter design using the Expectation Maximization algorithm 56 3.2.4 Implementation of the Expectation Maximization algorithm for Statistical Recovery parameter design 58 3.3 Linear methods 59 3.3.1 Linear mapping 59 3.3.2 Piecewise linear mapping 60 3.3.3 Linear filtering 61 3.3.4 Multidimensional linear filter banks with inter-frame filtering 62 3.3.5 Straight line extension 63 3.3.6 Flat highband envelope 64 3.3.7 Fixed envelope 64 3.4 Conclusion 65 Chapter 4.

New Methods for Highband Spectral Envelope Estimation 66

4.0 Overview 4.0.1 Highband spectral distortion criterion 4.1 Codebook mapping - analysis and new methods 4.1.1 Narrowband-highband correlation 4.1.2 Codebook size and training ratio 4.1.3 Comparison of codebook pair design techniques 4.1.4 Comparison of extension bandwidths 4.1.5 Comparison of different parameterizations for codebook mapping 4.1.6 A new method for codebook mapping with codebooks split by voicing

66 66 68 68 69 71 73 73 75

Contents vii

4.1.7 A novel method codebook mapping with phonetically dependent codebooks 76 4.1.8 A new method for codebook mapping on a sub-frame basis 77 4.1.9 Novel methods for codebook mapping with interpolation 78 4.1.10 A new method for codebook mapping with reduced storage requirements 79 4.2 Statistical Recovery - alternative parameterizations 81 4.2.1 Line spectral frequencies 81 4.2.2 Cepstral coefficients 82 4.3 Linear methods - new methods and improvements 82 4.3.1 A new method for linear mapping using narrowband voicing and gain 82 4.3.2 Improved straight line highband envelope estimation 83 4.4 Objective assessment 83 4.4.1 Speech data 83 4.4.2 Comparison of highband envelope estimation methods 84 4.4.3 Limits to highband envelope estimation 85 4.5 Conclusion 87 Chapter 5.

Narrowband-Highband Spectral Envelope Continuity

88

5.0 Overview 88 5.0.1 Highband spectral distortion criterion 89 5.1 Existing methods for highband gain estimation 91 5.1.1 Range matching 91 5.1.2 Range matching with correction factor 92 5.1.3 Statistical recovery 93 5.1.4 Straight line extrapolation 94 5.2 New methods for narrowband-highband envelope matching 94 5.2.1 Highband gain estimation with linear interpolation 95 5.2.2 Highband gain estimation with splicing 95 5.2.3 Point matching 96 5.2.4 Reference envelopes 98 5.2.5 Reference envelopes: Voicing-controlled straight log envelope 99 5.2.6 Reference envelopes: Analog envelope extension 99 5.3 Objective assessment of gain estimation 102 5.3.1 Comparison of gain estimation techniques 102 5.4 Assessment of highband envelope and gain estimation 104 5.4.1 Objective assessment of limits to highband envelope and gain estimation 104 5.4.2 Subjective assessment 106 5.5 Conclusion 107 Chapter 6.

Very Low Bit Rate Wideband Speech Coding

6.0 Overview 6.1 Existing schemes for wideband excitation coding 6.2 Existing techniques for wideband spectral coding 6.2.1 Wideband VQ 6.2.2 Split band VQ 6.2.3 Highband low order LP 6.2.4 Fixed highband envelope and log gain quantization 6.2.5 Codebook mapping with differential log gain quantization 6.3 A new wideband speech coder with a narrowband bit rate

108 108 108 109 109 110 111 112 112 113

Contents viii

6.3.1 Overview 6.3.2 Highband excitation synthesis 6.3.3 Highband spectral coding 6.3.4 Highband gain considerations 6.3.5 Implementation Considerations 6.4 Objective assessment 6.4.1 Comparison criteria and methods 6.4.2 Results 6.5 Subjective assessment 6.6 Conclusion Chapter 7.

Conclusion

7.1 Wideband enhancement and coding 7.2 Limitations of the research 7.2.1 Highband excitation synthesis 7.2.2 Highband envelope estimation 7.2.3 Very low bit rate wideband coding 7.3 Suggestions for future research 7.3.1 Highband excitation synthesis 7.3.2 Highband spectral envelope estimation 7.3.3 Highband gain estimation 7.3.4 Very low bit rate wideband coding 7.3.5 Wideband subjective testing 7.3.6 Wideband auditory masking Appendix A.

Subjective Assessment

A.1 Listener opinion tests A.1.1 Absolute category rating A.1.2 Degradation category rating A.1.3 Absolute vs. comparative rating for wideband enhanced speech A.2 Experiment design A.3 Analysis

113 114 114 116 117 118 118 119 120 121 122 122 126 126 126 126 127 127 127 128 128 128 129 130 130 130 130 131 132 134

Appendix B.

Alternative Distance Measures

135

Appendix C.

Proof of Jensen’s Inequality

137

Appendix D.

Statistical Recovery Algorithm

138

References

140

Publication List ix

Publication List Epps, J., and Holmes, W.H. (1998). “Speech enhancement using STC-based bandwidth extension”, in Proc. Int. Conf. on Sp. Lang. Process. (Sydney, Australia), vol. 2, pp. 519-522, December.

Epps, J., and Holmes, W. H. (1999). “A new technique for wideband enhancement of narrowband coded speech”, in Proc. IEEE Workshop on Speech Coding (Porvoo, Finland), pp. 174-176, June.

Epps, J., and Holmes, W. H. (2000). “Wideband speech coding at narrowband bit rates”, to appear in The Aust. Int. Conf. on Speech Science and Tech. (Canberra, Australia).

Acronyms and Abbreviations x

Acronyms and Abbreviations ACR

Absolute Category Rating

ADPCM

Adaptive Differential Pulse Code Modulation

AM

Amplitude Modulation

CELP

Code-Excited Linear Prediction

CI

Confidence Interval

CL

Competitive Learning (algorithm)

CODEC

Coder-Decoder

DCR

Degradation Category Rating

EM

Expectation Maximization

ETR

Equality Threshold Rating

FIR

Finite Impulse Response

FM

Frequency Modulation

GLA

Generalized Lloyd Algorithm

GMM

Gaussian Mixture Model

HB

Highband

LAR

Log Area Ratio

LBG

Linde Buzo Gray (algorithm)

LES

Log Envelope Samples

LP

Linear Prediction

LPC

Linear Prediction Coefficients

LSF

Line Spectral Frequency

MBE

Multi-Band Excitation

MELP

Mixed Excitation Linear Prediction

MOS

Mean Opinion Score

MSVQ

Multi Stage Vector Quantization

NB

Narrowband

NLIVQ

Non-Linear Interpolative Vector Quantization

PCM

Pulse Code Modulation

PWI

Prototype Waveform Interpolation

SR

Statistical Recovery

STC

Sinusoidal Transform Coding

Acronyms and Abbreviations xi

TIMIT

Texas Instruments Massachusetts Institute of Technology (speech corpus)

VQ

Vector Quantization

WB

Wideband

Ch 1 / Introduction 1

Chapter 1. Introduction

1.1 Speech communication Speech has arguably been the most important form of human communication since languages were first conceived. Over the ages, many forms of communication have been developed to convey information across a distance, but the relatively recent invention of the telephone has revolutionized this process. The demand for spoken communication at a distance has provoked the emergence of a vast system of telecommunications infrastructure, which allows millions of simultaneous conversations between people in locations around the planet. This infrastructure continues to grow at a rapid rate, accommodating a steadily increasing consumer population.

Limits to the capacity of existing infrastructure have seen huge investments in its expansion and in the adoption of newer, wider bandwidth technologies. Demand for more mobile and convenient forms of communication has seen an explosion in the use of cellular and satellite telephony, both of which have significant capacity constraints. The purpose of speech coding research is to address the problem of accommodating more users over such limited-capacity media by compressing speech before transmitting it across a network.

1.2 Speech coding In a digital communications system, speech is represented in terms of a stream of bits which are transmitted at a rate often determined by the available channel capacity. Speech coding research develops and evaluates solutions to an optimization problem, whose objective is to minimize the number of bits used to represent speech, under the constraints that the output speech quality, algorithmic delay, computational complexity and data storage are within acceptable limits.


Achieving high quality in the output speech is a very important requirement, and one which ultimately can only be assessed subjectively. Minimizing delay is desirable for natural conversation between users of all telephony services. Computational complexity and data storage are constraints imposed by the hardware platform on which the speech coder is to be implemented. Faster and more compact processor and memory integrated circuits are continually being made available, however their capabilities are still taxed by the ever more complex algorithms being employed to code speech.

Speech coding methods are commonly classified into waveform coders and parametric coders. Waveform coders perform various operations on the speech signal to reduce redundancy, and then code the resulting time domain waveform. As their name suggests, parametric coders decompose speech into a set of parameters which characterize the signal for a short duration (or frame), and these parameters are then coded. A synthesis algorithm is then used in the decoder to reconstruct the speech from its constituent parameters. Examples of waveform and parametric coders are illustrated in Fig. 1.1. The efficiency with which speech can be represented using parameterizations and the difficulty of reconstructing the speech signal from them are reasons why parametric coders are generally associated with lower bit rates and slightly poorer speech quality.

encoder input speech

decoder linear prediction analysis

quantize residual waveform and envelope

decode residual and envelope

linear prediction filtering

output speech

(a)

encoder input speech

discrete Fourier transform

decoder pick peaks of spectrum

quantize spectral parameters

decode spectral parameters

synthesize sum of sinusoids

output speech

(b)

Figure 1.1. Examples of (a) a waveform coder (simplified linear predictionbased coder), and (b) a parametric coder (simplified sinusoidal coder).


1.3 Parameterization of speech Speech is a time varying signal which is often analyzed in frames of around 20ms length, during which the signal can be assumed to be locally stationary [DEL93]. During ‘voiced’ speech (e.g. Fig. 1.2a), the signal primarily consists of a quasi-periodic, harmonic component which originates from vibrations of the vocal cords. During ‘unvoiced’ speech (e.g. Fig. 1.2b), the signal mainly comprises a random component, which originates from air passing through a constriction in the vocal tract. Generally, speech excitation comprises a mixture of quasi-periodic and random components [DEL93], and their relative contribution is often parameterized as the ‘degree of voicing’. 6000

400

200 Amplitude

Amplitude

4000 2000 0

-200

-2000 -4000

0

0

5

10

15

20

-400

0

5

10

Time (ms) (a)

20

3000

4000

80

100

70

90

Magnitude (dB)

Magnitude (dB)

15

Time (ms) (b)

80 70

60 50 40 30

60 0

1000

2000

3000

4000

0

Frequency (Hz) (c)

1000

2000

Frequency (Hz) (d)

Figure 1.2. Time-domain plots of (a) the voiced frame /I:/ as in ‘she’ and (b) of the unvoiced frame /s/ as in ‘suit’, and magnitude spectra (solid) and all-pole spectral envelopes (dashed) of (c) /I:/ and (d) /s/.

The source-filter model of speech production [FAN60] assumes that speech consists of an excitation signal which has been acoustically filtered by the vocal tract. This model is central to the majority of speech processing research, and allows the decomposition of a frame of speech into an excitation signal and a spectral envelope. The acoustical properties of the vocal tract can be accurately modelled by an all-pole filter, whose


magnitude spectrum contains peaks or ‘formants’ occurring near some of the pole frequencies.

Thus, voiced speech can be locally characterized in terms of its fundamental frequency, an all-pole spectral envelope which approximates the magnitudes of each harmonic, and a series of harmonic phases. Whilst unvoiced excitation is not always physically filtered by the vocal tract, it can similarly be characterized in terms of a random excitation signal and an all-pole spectral envelope. The overall shape of the spectral envelope often decreases with frequency for voiced frames, and increases with frequency for unvoiced frames, as seen in the example spectra of Fig. 1.2c and 1.2d respectively.

1.4 Quality improvement of coded speech Although speech coder design goals are often expressed in terms of achieving a given speech quality at a bit rate below that of an existing coder, an equivalent research goal is to achieve higher speech quality at the same bit rate. Many different types of quality improvement are possible without increasing the bit rate, including enhancement of speech in background noise [e.g. BER79, EPH90], the use of auditory perception models to mask quantization noise [e.g. SEN93], improvements to parameter estimation [e.g. MAL99] and improved speech synthesis algorithms [e.g. MCA86, GRI88, KLE95b].

One class of quality improvement which has not received much attention is enhancement by broadening the bandwidth of coded speech without an increase in the bit rate. This is surprising since the notion of quality as a function of speech bandwidth is anticipated to become more pervasive [JAY92, ADO95]. Roughly 80% of the perceptually important spectral information in speech resides in the frequencies below 4 kHz [OSH87], implying that the higher frequencies need not be coded as precisely as those in the narrowband. Further, it appears likely [ROY91] that, for the same bit rate, the overall perceptual quality of a wideband coder is higher than that of a narrowband coder.


Historically, speech has been coded up to 4 kHz, a bandwidth that represents a reasonable compromise between the bit rate and speech quality for voiced speech. However, this is an unsuitable bandwidth for unvoiced speech, which typically has much of its energy above 4 kHz. In comparison with wider bandwidth speech, narrowband (4 kHz bandlimited) speech is often described as having a ‘muffled’ quality [CAR94b, CHA96]. Wideband (8 kHz bandlimited) speech is preferred in subjective tests, and is perceived as more intelligible and natural [ORD91, NOL93] and as requiring less effort to listen to than narrowband speech [CRO72]. Whilst coding speech up to a wider bandwidth reduces the ‘muffled’ effect, this is achieved at the expense of an increase in the bit rate. Hence, there is a need for a quality improvement technique which produces wider bandwidth speech with no (or little) associated increase in the bit rate.

1.5 Wideband enhancement Wideband enhancement is a scheme which seeks to enhance narrowband speech by extending its bandwidth to produce a higher quality estimated wideband speech signal. Given narrowband speech which has been sampled at 8 kHz, it is possible to produce speech sampled at 16 kHz simply by interpolating by a factor of two, that is, inserting a sample between each pair of narrowband samples and determining its amplitude based on the amplitudes of the surrounding narrowband samples. Unfortunately, interpolated speech does not contain any high frequencies; interpolation merely produces 4 kHz bandlimited speech with a sampling rate of 16 kHz rather than 8 kHz. A ‘highband’ signal containing frequencies above 4 kHz therefore needs to be added to the interpolated narrowband speech in order to form an estimated wideband speech signal, as seen in Fig. 1.3.

narrowband speech

synthetic highband speech

highband speech synthesis

upsample 2

LPF 3.8 kHz

estimated wideband speech

interpolated narrowband speech

Figure 1.3. Block diagram of a generalized wideband enhancement scheme


Thus, wideband enhancement is a scheme which adds a synthesized highband (4-8 kHz) signal to narrowband (0-4 kHz) speech to produce a wideband (0-8 kHz) speech signal. The synthesis of the highband signal is based entirely on the information available in the narrowband speech. Hence, the accuracy of highband synthesis is dependent on the degree of correlation between the narrowband and highband portions of the speech signal. This general model of wideband enhancement has been widely used throughout the literature [e.g. CRO72, MAK79, PAT81, CAR94b, CHE94, YOS94, YAS96a, ENB98, TOL98, PAR00].

As mentioned in section 1.3, speech can be modelled in terms of a mixed harmonic and random source, a spectral envelope and a gain. This model of speech production suggests a division of the highband speech synthesis problem into component tasks. These tasks are illustrated in the block diagram of Fig. 1.4, which is a proposed generic model for wideband enhancement (for alternative, less generic models refer to [CHA96, ENB98]). Note that here only the highband speech is synthesized, since the narrowband speech is assumed to be of sufficiently high quality already.

narrowband speech narrowband analysis

highband envelope estimation

narrowbandhighband envelope continuity

upsample 2

LPF 3.8 kHz

highband excitation synthesis

excitation filtering


estimated wideband speech

narrowband speech

Figure 1.4. Block diagram of proposed wideband enhancement scheme

Referring to Fig. 1.4, highband excitation synthesis concerns methods for constructing highband harmonic and random excitation in a manner perceptually consistent with the excitation present in the narrowband signal. Estimation of the correct highband spectral envelope shape, based only on information from the narrowband speech, is the subject of highband envelope estimation. The requirement for continuity between the narrowband and estimated highband spectral envelopes is the proposed method for addressing the highband gain estimation task.


1.6 Applications of wideband enhancement The most obvious application of wideband enhancement is the improvement of speech quality in a variety of contexts. Wideband enhancement can be used for various purposes: as a post-processor, in conjunction with any part of a narrowband telecommunications network, to improve the quality of hands-free telephony; for teleconferencing; or in internet telephony and handset-based telephony. Interfaces between narrowband and wideband telecommunications infrastructure, for example talk-back radio broadcasts or telephone reports from foreign correspondents on television news broadcasts, could also benefit from wideband enhancement.

If wideband speech can be synthesized from narrowband speech with sufficient accuracy then only the narrowband portion needs to be transmitted. Wideband enhancement also offers an original perspective on the problems of wideband speech coding, since it combines an embedded narrowband codec with parametric highband excitation, and suggests efficient methods for highband spectral envelope quantization.

1.7 Thesis objectives The principal objective of this thesis is to investigate the components of wideband enhancement, in an attempt to maximize the quality of the estimated wideband speech obtained from the overall wideband enhancement scheme. This broad objective may be expressed in terms of a number of aims: x To review literature relevant to the components of wideband enhancement; x To simulate, investigate and compare the relative merits of existing methods for wideband enhancement; x To propose new techniques for elements of wideband enhancement; x To develop appropriate measures for the assessment of wideband enhancement schemes; and x To objectively and subjectively assess the efficacy of existing and new techniques for wideband enhancement.


A secondary major objective is to investigate possibilities for the application of wideband enhancement to wideband speech coding.

1.8 Structure This thesis is primarily structured according to the model of the wideband enhancement scheme developed in section 1.5 and illustrated in Fig. 1.4. Thus, the tasks of highband excitation synthesis, highband envelope estimation and narrowband-highband spectral envelope continuity are treated as individual problems in separate chapters.

Perhaps the most obvious form of correlation between the narrowband and highband portions of the speech signal is the harmonic relationship between voiced components of speech excitation. Chapter 2 reviews existing material on highband excitation synthesis, details a new sinusoidal transform coding (STC) based method for excitation synthesis, and compares selected methods.

The observation that voiced and unvoiced speech tend to exhibit spectral envelopes with respectively decreasing and increasing magnitudes suggests the existence of correlation between the narrowband and highband spectral envelope shapes. In chapter 3, existing techniques for the estimation of highband spectral envelopes from narrowband information are investigated. Chapter 4 presents a number of new techniques for highband envelope estimation, and objectively compares these with many techniques from chapter 3. Details of an investigation into the likely limits to the accuracy of highband spectral envelope estimation are also given in chapter 4.

Estimation of the highband gain based only on narrowband information is a difficult problem, and inaccurate highband gain estimates can produce artefacts which are perceptually unacceptable. Highband gain estimation and matching of the estimated highband envelope to the narrowband envelope is addressed in chapter 5, and various existing and new techniques are compared. The likely limits to the accuracy of highband spectral envelope estimation after narrowband-highband envelope matching are also considered in chapter 5.


Chapter 6 reports the application of wideband enhancement to wideband coding, and includes new methods for wideband spectral coding with objective and subjective comparisons.

1.9 Novel contributions This thesis makes a number of novel contributions to the field of wideband enhancement and more generally to speech coding research.

A new technique for synthesizing high quality highband excitation using sinusoidal transform coding (STC) methods is presented in this thesis. This technique shows that from narrowband pitch and voicing information, high quality speech excitation may be generated up to any desired bandwidth. Comparisons with existing techniques show that STC-based excitation synthesis provides improved modelling of highband harmonics and highband voicing.

Three novel methods for highband spectral envelope estimation are proposed in this thesis. An extensive objective comparison is made between the three new methods and existing approaches to highband envelope estimation, which reveals that the new methods produce more accurate estimates of the highband envelope.

In this thesis, the problem of highband gain estimation is considered, and approaches for improving the perceptual quality of gain estimation are considered. Two novel techniques are presented which estimate the highband gain by ensuring continuity between the estimated highband spectral envelope and the narrowband envelope. The performance limitations of highband envelope estimation both before and after highband gain estimation are estimated.

Finally, a new scheme for wideband coding is introduced which is based upon wideband enhancement combined with any existing narrowband codec and a very small highband bitstream. Highband speech is synthesized in this coder using the STC-based excitation synthesis and a novel highband spectral envelope/gain vector quantizer. Subjective tests


reveal that this coder produces estimated wideband speech of higher quality than that produced by a narrowband coder operating at around the same bit rate.

Ch 2 / Highband Excitation Synthesis 11

Chapter 2. Highband Excitation Synthesis



narrowband -highband envelope continuity

upsample 2

LPF 3.8 kHz




wideband speech

narrowband speech

2.0 Overview An important component of any wideband enhancement scheme is the modelling and synthesis of the highband speech excitation. Key requirements of excitation modelling and excitation synthesis include: x a near harmonic relationship between sinusoids in purely voiced excitation components; x an approximately flat excitation magnitude spectrum; x minimization of perceptual artefacts, such as frame boundary effects; and x accurate modelling of the mix of quasi-periodic and random excitation components. This chapter examines existing approaches to highband excitation, presents a novel method for extending the excitation bandwidth of narrowband speech based upon sinusoidal synthesis, and compares the new method with the existing approach of spectral folding.

The structure of this chapter is as follows. Section 2.1 reviews previous approaches to the generation of voiced excitation in wideband enhancement. Sinusoidal transform coding (STC) is reviewed in greater detail in section 2.2 as background to the novel technique proposed in section 2.4. Section 2.3 reviews models of voiced/unvoiced excitation both from speech coding and previous wideband enhancement schemes. Section 2.4 presents a new technique for highband excitation synthesis, based upon STC. Section 2.5 evaluates the new excitation synthesis technique and compares it with spectral folding, an existing technique.


Note that this chapter does not provide any review of techniques from speech analysis, despite making reference to them. The interested reader is referred to such references as [NOL66, MAR72, HES83, NEY83, GRI88, MCA90, DEL93, LI99] on pitch estimation and [ATA76, CAM86, COH91, GHI91, STE96] on determination of the degree of voicing.

2.1 Existing approaches to highband excitation synthesis Approaches to synthesizing highband excitation based on narrowband information found in the literature include non-linear transformation [PAT81], pulse excitation [YOS94], codebook mapping of excitation [YOS94], spectral folding [MAK79], spectral translation [MAK79], noise modulation [MCCR00], AM-FM synthesis [TOL98] and sinusoidal synthesis [CHA96].

This section deals mostly with highband voiced excitation synthesis. Many existing highband voicing schemes do not explicitly include an unvoiced component (see section 2.3.5). Unvoiced highband excitation has usually been produced by a bandpass random signal [PAT81, CHE94, CHA96, SCHN98, TAO00]. One exception to this is codebook mapping from narrowband unvoiced excitation to wideband unvoiced excitation [YOS94], explained more fully in section 2.1.3.

2.1.1 Non-linear transformation Given a sinusoid of frequency Z, a simple means of generating a second harmonic of this signal is to square it cos(2Z ) { 2 cos 2 Z 1 .

In general, the k’th harmonic component can be generated

(2-1)


cos kZ

¦

k

l 0

al cos l Z ,

(2-2)

where al are constants which can be calculated using the following identities cos kZ j sin kZ { cos Z j sin Z

(2-3a)

cos 2 Z sin 2 Z { 1 .

(2-3b)

k

Thus, given a signal consisting only of the fundamental component of a frame of voiced speech, excitation may be synthesized up to any desired frequency. This method was used to synthesize voiced excitation in an early vocoder [FUJ70], and could equally be applied to highband excitation synthesis.

Rather than derive the al , many researchers have sought simpler, more computationally efficient means of generating higher harmonics directly from the narrowband excitation. In [PAT81], two types of non-linear transformations are compared: x Application of a ‘W’ instantaneous non-linearity (c.f. Fig. 2.1) to the narrowband speech x Raising the narrowband speech signal to the third power Both methods were reported to have similar performance in subjective experiments, although their use in enhancing the narrowband speech was only considered advantageous for /s/ sounds. In [YAS96b] a non-linear harmonic extension technique consisting of full-wave rectification is proposed. output

input

Figure 2.1. Non-linear characteristic of the spectral spreading element described in [PAT81].


None of the methods of this section employs spectral flattening after non-linear processing, and thus they produce highband excitation of varying amplitude spectra. Spectra resulting from the application of the various methods to an example frame of voiced narrowband speech illustrate this shortcoming (c.f. Fig. 2.2b, 2.2c and 2.2d). Spectral flattening, for example using an inverse LP filter, would improve these

(c)

(d)

Magnitude (dB) Magnitude (dB)

(b)

Magnitude (dB)

(a)

Magnitude (dB)

methods. 40 20 0 -20 0 20

1000

2000

3000

4000 5000 Frequency (Hz)

6000

7000

8000

1000

2000

3000


6000

7000

8000

1000

2000

3000


6000

7000

8000

1000

2000

3000


6000

7000

8000

0 -20 0 140 120 100 80 0 40 20 0 -20 0

Figure 2.2. Magnitude spectrum of an example wideband residual signal (a). The narrowband portion of this wideband residual was used to produce estimated wideband excitation signals using the W transfer function, raising the narrowband signal to the third power and half-wave rectification. The resulting magnitude spectra are shown in (b), (c) and (d) respectively.

2.1.2

Pulse excitation

Since the fundamental frequency can be determined from the narrowband speech, harmonic excitation up to any desired bandwidth can be obtained by generating a train of pulses separated in time by the pitch period. This approach is derived from pulseexcited LP coding techniques, for example code excited linear prediction (CELP) coders [SCHR85, FED91, DEL93, KLE95a] and multipulse excited LP coders [ATA86]. Excitation of this type has been employed in wideband extension [YOS94]. The challenge in a pulse-excited scheme is to accurately replicate the pulse shape of the voiced speech residual, thereby avoiding a loss in perceptual quality.


2.1.3 Codebook mapping of excitation

CELP is a widely employed method for speech coding in which the speech excitation is vector quantized. Narrowband excitation vectors can be transformed into wideband excitation vectors by means of a one-to-one mapping between codebooks of narrowband and wideband excitations, respectively. The details and design of codebook mapping methods are described in detail in sections 3.1 and 4.1.

In research reported in [YOS94], candidate narrowband excitation code vectors were compared with the frame of narrowband speech under analysis using an analysis-bysynthesis approach. The index of the excitation vector most similar to the excitation signal derived from the narrowband speech under analysis was then used to select a wideband excitation vector from a codebook of wideband excitation sequences, as seen in Fig. 2.3.

The excitation codebooks contained representative waveforms of both voiced and unvoiced excitation segments. If the narrowband speech under analysis was voiced, then the highband synthesis was performed using pitch-synchronous overlap-addition, otherwise frame-wise overlap-addition was used for highband synthesis. Subjective comparison between this method of excitation and the pulse excitation of section 2.1.2 revealed identical preference scores (50%) for each method [YOS94]. narrowband excitation codebook

wideband excitation codebook

select most similar excitation vector narrowband excitation

Figure 2.3. Excitation codebook mapping

wideband excitation


2.1.4

Spectral folding and spectral translation

In the spectral folding method [MAK79], highband excitation is generated by upsampling the narrowband residual (assumed to have a flat spectrum) by an appropriate factor. Thus, an 8 kHz bandlimited residual may be generated from a 4 kHz bandlimited residual through upsampling by a factor of 2. Whereas during normal sampling rate conversions, upsampling would be followed by low pass filtering to remove the alias (in the 4-8 kHz frequency band), in this method the alias is retained and actually forms the highband excitation. This method of excitation synthesis has been popular with many researchers [e.g. CAR94b, AVE95, YAS95b, NAKA97, ENB99], presumably for its simplicity.

Spectral translation produces a estimate of the wideband residual by shifting a copy of the narrowband residual spectrum to a higher frequency, and adding this to the narrowband residual [MAK79]. One technique for performing this transformation is shown in Fig. 2.4. narrowband residual

n

2 4 kHz cutoff

+

(-1)n

wideband residual

+ n

2 passband 4-8 kHz

Figure 2.4. Spectral translation scheme for conversion of 4 kHz bandlimited residual to 8 kHz bandlimited residual signal (n is the discrete-time sample index)

In general neither spectral folding nor spectral translation produce highband periodic components that are harmonically related to the narrowband harmonics. Further deficiencies of these approaches arise from two sources of spectral distortion: x The magnitude spectrum of narrowband speech drops below its wideband value close to 4 kHz, due to bandlimiting. This produces a spectral notch centred around 4 kHz in the wideband speech (see for example Fig. 2.10a of section 2.3.5). It may


reasonably be inferred from [MOO89] that any spectral notch of this kind with depth greater than a few decibels will be perceptible. x Spectral flattening using inverse LP filtering (as used in [MAK79]) will not in general produce a truly flat spectrum.

Thus, there is some highband spectral distortion introduced even before the excitation is shaped by the highband envelope. However, informal subjective experiments reported in [AVE95] suggested that no adverse perceptual effects were associated with the spectral folding method.

2.1.5 Noise modulation

According to the critical band model [SCHA70] of the auditory system, the frequency resolution of human hearing above 4 kHz is sufficiently poor that only harmonics with fundamental frequencies above 400 Hz can be individually resolved in this region. This suggests that in critical bands above 4 kHz, pitch periodicity is perceived through the time-domain envelope of the bandpass speech signal.

In the parametric highband speech model of [MCCR00], the time domain envelope of the 3-4 kHz band of narrowband speech is extracted, and this envelope is used to modulate highband noise to produce the highband excitation. In strongly voiced frames, the 3-4 kHz bandpass signal has a strongly periodic time domain waveform, and thus the highband excitation will also take on a strong pitch modulation. This method is very computationally efficient and is motivated by human perception, but it has the disadvantage of relying on the assumption that the time domain envelope of the 3-4 kHz speech band is identical to that of the 4-8 kHz band.

2.1.6 AM-FM synthesis

The amplitude and frequency modulation model of speech [MARA93] comprises a sum of N AM-FM signals, each of which models a formant. The discrete-time speech signal is synthesized as


N

x ( n)

¦ a (n) cos§¨© 2S ª«¬: i

i 1

n : m ³ qi (k )dk º ·¸ »¼ ¹ 0 n

c ,i

(2-4)

where :c,i is the formant (‘carrier’) frequency, :m is the modulation frequency, ai(n) is the discrete-time amplitude envelope for the ith formant, q(n) is the frequency modulating signal, and the instantaneous formant frequency is :inst,i(n) = :c,i + :mq(n). The :c,i are estimated using formant tracking techniques, while :m, the maximum frequency deviation from the :c,i, is determined using methods outlined in [MARA93]. Interpolation of parameters between successive frames is performed using the overlapadd method [OPP75].

Since the sinusoid amplitudes are determined from the centre frequency and bandwidth of each formant, determined in turn from the spectral envelope, achieving any given spectral shape using AM-FM synthesis is relatively simple. The AM-FM model has been used to produce highband harmonic excitation which enhanced narrowband speech quality in informal listening tests [TOL98].

2.1.7

Sinusoidal synthesis

Sinusoidal speech synthesis [HED81, ALM82, MCA86] is based on a bank of oscillators operating at frequencies, amplitudes and phases determined from components of the original speech. This means that excitation up to any desired bandwidth can be achieved simply by including an appropriate number of oscillators in the bank. A significant advantage of sinusoidal synthesis is that the sinusoid amplitudes are determined directly by the spectral envelope, and thus spectral flatness is not an issue. Sinusoidal synthesis allows a choice between overlap-add synthesis (as used in most methods from 2.1.1 to 2.1.6) and parameter interpolation. Previous work [MCA95] suggests that parameter interpolation tends to reduce the discontinuities produced by overlap-add synthesis at frame boundaries. Since phase discontinuities can cause perceptual artefacts, parameter interpolation is thus preferable if the extra computation involved can be justified.


Multi-band excitation (MBE) [GRI88] is a form of sinusoidal coding which allows different frequency bands to take on different voicing properties (see section 2.3.3). In wideband extension work by Chan and Hui [CHA96], narrowband CELP coded speech was analyzed and re-synthesized as wideband speech using MBE synthesis. Good results were reported from their mean opinion score (MOS) listening tests. Sinusoidal synthesis has also been successfully applied to the problem of lowband extension in narrowband speech which is bandlimited to 0.3-4 kHz (or similar) [CAR94b].

Sinusoidal transform coding (STC) [MCA86] and prototype waveform interpolation (PWI) [KLE95b,c] offer other possibilities for sinusoidal synthesis, differing from the MBE model in various aspects. STC is reviewed in detail in section 2.2.

2.2 Existing methods for speech synthesis using Sinusoidal Transform Coding 2.2.1 A sinusoidal model for speech

The essence of STC is the representation of a speech signal s(n) in terms of a sum of sinusoids [MCA86]

§M · s (n) Re¨¨ ¦ J mk exp( jnZ mk ) ¸¸ , ©m 1 ¹ k

where J mk

(2-5)

Amk exp( jI mk ) is the complex amplitude with real amplitude Alk and phase

I mk for the mth component of the Mk sinusoids in the kth frame. If s(n) is assumed to be purely voiced, then the sinusoids can be constrained to be harmonically related, yielding an estimate sˆ(n) of s(n) calculated as follows

sˆ(n; Aˆ mk ,Zˆ 0k ,Iˆmk )

M (Z 0 )

¦

m 1

>

@

Aˆ mk exp j (nmZˆ 0k Iˆmk ) .

(2-6)


This harmonic model is based only on estimates of the fundamental frequency Z 0 , amplitudes Amk , and phases I mk within the kth frame of speech. A shortcoming of the harmonic model is its poorer representation of unvoiced speech relative to the general sinusoidal model of (2-5), particularly when the fundamental frequency is large. This problem is discussed further in section 2.2.6.

Whilst it is possible to synthesize harmonic speech directly from equation (2-6) at low computational expense using overlap-add techniques [MCA88], this results in ‘rough’ sounding synthesized speech. Sections 2.2.2 through 2.2.5 review methods designed to overcome these perceptual problems, by smoothly interpolating the estimated frequencies mZˆ 0 , amplitudes Aˆ mk , and phases Tˆmk

mZˆ 0k Iˆmk from the k-1’th to the kth

frame.

2.2.2 Frequency interpolation and tracking

Frequency interpolation between consecutive frames may be linear, or it may be determined by continuity constraints on the frequencies and phases at frame boundaries (see section 2.2.4). Accounting for the presence, absence and smooth transitions of the various sinusoid frequencies from frame to frame requires tracking of the sinusoid frequencies. The concepts of ‘birth’ and ‘death’ of sinusoidal components [MCA86] can be adopted for tracking purposes. Sinusoidal components in one frame, whose frequencies differ by more than a matching interval ' from those of all components in the next frame, are declared ‘dead’. All other components in that frame are matched, and the remaining unmatched components in the next frame are declared ‘born’. An example of this frequency tracking method is illustrated in Fig. 2.5. Components which are declared ‘dead’ or ‘born’ retain the same frequency between the present and next frames, and the amplitudes of these components are linearly ramped down to or up from zero respectively.


frequency

death

....

.... birth death

time

Figure 2.5. Frequency track matching using birth-death frequency tracking

Large differences in fundamental frequency can occur between adjacent frames, due to effects such as word boundaries, resulting in inaccurate tracking of harmonics from one frame to the next using the birth-death frequency tracker. In MBE synthesis [GRI88], the harmonic frequencies are linearly interpolated between adjacent frames if fundamental frequency changes of less than 10% are encountered. If the fundamental frequency change is greater than 10%, the present frame and next frame are considered to be followed and preceded respectively by unvoiced components. This means that all harmonic components in the previous frame ‘die’ and all harmonic components in the present frame are ‘born’ in the same sense as that of [MCA86].

Another method for sinusoid tracking [TAO99] is based on analysis-synthesis of individual sinusoid tracks. Each candidate sinusoid track is synthesized, and this is then compared with the original speech in the time domain. The candidate sinusoid track which produces the smallest error signal is chosen as the correct sinusoid track. Once all matches have been made, any remaining sinusoidal components in the previous frame ‘die’, while any remaining sinusoidal components in the present frame are ‘born’.

2.2.3 Amplitude interpolation

Having tracked the sinusoidal components, the instantaneous amplitudes of the sinusoids must be determined in order to ensure continuity between frames. The sinusoid amplitudes are usually linearly interpolated between frames. That is, the instantaneous amplitude of each sinusoid is calculated as


~ A ( n)

( Aˆ k Aˆ k 1 ) Aˆ k 1 n, N

(2-7)

where n = 0, . . . , N - 1 are samples of the k-1’th frame and N is the frame length in samples.

2.2.4

Phase interpolation

Interpolation of the phase Tˆmk is a complex task since continuity of both phase and frequency (the derivative of phase) should be preserved at frame boundaries. This constraint is satisfied for each sinusoid using a cubic phase interpolation function [MCA86], specified in continuous variables as ~ T (t ) ] J t D t 2 E t 3 .

(2-8)

] and J are determined by the values of phase and angular frequency respectively from the previous frame (defined as t = 0), ~ T (0)

]

T k 1 ,

(2-9a)

~ T (0)

J

Z k 1 ,

(2-9b)

and D and E are determined by the values of phase and angular frequency respectively for the present frame (defined as t = T, where T is the frame length in seconds),

~ T (T ) Tˆ k 1 Zˆ k 1T DT 2 ET 3 Tˆ k 2SP ,

(2-9c)

~ T (T)

(2-9d)

Z k 1 2DT 3ET 2

Z k ,

where P accounts for phase wrapping. Rearranging (2-9c) and (2-9d) gives a solution for D and E in terms of P


ªD ( P ) º « E ( P)» ¼ ¬

ª 3 «T 2 « 2 « 3 ¬T

1º k ˆ k ˆ k 1 T » ªT T Zˆ T 2SP º . » 1 »« Zˆ k Zˆ k 1 ¬ ¼ » T2¼

(2-10)

Since the measured phase only takes on values between 0 and 2S, synthesis must include phase unwrapping. P is determined by minimizing the variation in the phase for a constant frequency. This is achieved by substituting the continuous variable x for P and minimizing

T

f ( x)

~ ³0 §¨©T (t; x) ·¸¹ dt , 2

(2-11)

over x. The value of x which minimizes (2-11) is

x*

1 2S

ª ˆ k 1 k 1 k k 1 T º ˆk «¬(T Zˆ T T ) (Zˆ Zˆ ) 2 »¼ ,

(2-12)

and the optimum value of P, P*, is chosen as the closest integer to x*. The evaluation of D = D(P*) and E = E(P*) completely specifies the interpolation function (2-8). Other phase interpolation models, for example quadratic phase interpolation [GRI88], have also been employed in sinusoidal synthesis applications.

For frames in which a frequency track is born, MacAulay and Quatieri [MCA95] suggest that the startup phase be defined from T k 1

T k Z k S ,

(2-13)

where S is the number of samples between the previous and present frame. Experimental work for this thesis suggested that where the MBE frequency tracking algorithm was used, this startup phase produced audible pulses. One alternative, which produced a slight perceptual improvement, was to apply phase interpolation as per equation (2-8), except with D and E determined under the constraints Tˆ k 1 Zˆ k

Zˆ k 1 .

0 and


2.2.5

Phase estimation using the minimum phase model

The phases Tˆmk can be estimated based only on knowledge of the spectral envelope if the assumption is made that the spectral envelope is minimum phase. This results in the following model [MCA87]

M

sˆ(n)

¦ A(mZˆ

0

) exp> j (n n0 )Zˆ 0 ) s (mZˆ 0 ) H (mZˆ 0 )@ ,

(2-14)

m 1

where A(Z) is the amplitude, Zˆ 0 is the estimated fundamental frequency, n0 is the onset time, )s(Z) is the vocal tract system phase, H (mZˆ 0 ) is the residual phase of the mth harmonic, and M is the number of sinusoids. Details of the onset time calculation are given in [MCA92]. If the vocal tract system transfer function, Hs(Z), including both the glottal and vocal tract filters, is assumed to be minimum-phase [MCA91] with magnitude response As(Z) and phase response )s(Z), then

log Hs (Z )

f

¦c

m

exp( jmZ ) ,

(2-15)

m f

from which it follows that

log As (Z )

f

c0 2¦ cm cos( mZ ) ,

(2-16a)

m 1

) s (Z )

f

2¦ cm sin( mZ ) ,

(2-16b)

m 1

[OPP75]. The real cepstral parameters ^ cm `m 0 , and hence the vocal tract system phase M

component ) s (Z ) , can thus be estimated using the Hilbert transform of the spectral envelope As (Z ) . The assumption that the vocal tract filter has minimum phase is known to be deficient, and possibilities for improvement of this model include


correction using all-pole filtering or a Rosenberg pulse model [ROS71] to increase the phase accuracy [SUN97].

In the case of harmonic sinusoidal synthesis, the phases of the sinusoids are integer multiples of the fundamental phase, and hence the sinusoid phases will be locked at all times, removing the necessity to compute the onset time n0 [MCA95]. Since the fundamental phase for the kth frame is given by

I 0 (kT ) I 0 (k 1)T (Z 0k 1 Z 0k )

T , 2

(2-17)

the phase for the mth harmonic becomes

Tˆ(mZ 0 ) mI 0 (kT ) ) s (mZ 0 ) Hˆ(mZ 0 ) ,

(2-18)

where T is the frame length (in seconds), and )s and H are determined by equations (216b) and (2-20) respectively. Phase estimation using these minimum phase models is useful in very low bit rate speech coding applications where phase information is not explicitly transmitted.

2.2.6

Modelling of unvoiced components

Experiments [MCA87] have shown that during voiced speech, the phase model of section 2.2.5 is sufficiently accurate to consider the phase residual H(mZ0) to be zero, while during unvoiced speech, the phase residuals can be randomly distributed over the interval (-S, S]. The residual phase estimate Hˆ(mZ 0 ) of [MCA87] is based upon the degree to which the present frame of speech is voiced PV [0, 1], and Tm, a uniformly distributed random variable on (-S, S]. It is given by H (mZ 0 ) T m (1 PV )

(2-19)


If the phase model of section 2.2.5 is employed in conjunction with the voicingdependent cutoff frequency model [MAK78] discussed in section 2.3.2, the residual component can be modelled as

H (Z )

0 ® ¯U > S ,S @

if Z d Z c , if Z > Z c

(2-20)

where Zc is the voicing-dependent cutoff frequency (see section 2.3.2) and U(-S,S] is a uniformly distributed random variable on (-S,S] [MCA95]. Sinusoidal components above Zc are forced to be more closely spaced than those below Zc in order to better approximate unvoiced excitation.

2.3 Existing methods for mixed voiced/unvoiced excitation synthesis 2.3.1 Binary excitation

The binary voicing decision, seen in Fig. 2.6, is probably the oldest model of speech excitation (refer to [DEL93] for example), and assumes that any frame of speech contains either purely periodic or purely random excitation. voicing status ...

t periodic excitation

spectral shaping

t random excitation

Figure 2.6. Binary model of mixed voiced/unvoiced excitation

2.3.2 Voicing dependent cut-off frequency

Observations that the degree of voicing varies as a function of frequency within a single frame of speech led to models of mixed excitation which allow simultaneous voiced and


unvoiced excitation. Among the first of these was the voicing dependent cut-off frequency model of Makhoul et al. [MAK78], which assumes that the lower portion of the speech spectrum is voiced, while the higher portion is unvoiced. The cut-off frequency, which divides the two portions, thus determines the extent to which the synthesized speech is voiced, as shown in Fig. 2.7.

...

t

f

periodic excitation

spectral shaping

Fc

t f random excitation

Figure 2.7. Voicing dependent cut-off frequency model

2.3.3 Multi-band excitation

Improved flexibility in the modelling of voicing can also be achieved using the multiband excitation model [GRI88], which allows voicing decisions to be made for each of a number of bands across the speech bandwidth. Each band may take on any degree of voicing from unvoiced to voiced. In speech coding the decisions are often binary (as shown in Fig. 2.8) due to bit rate considerations. V

V

V

U

V

U

...

t

f spectral shaping

periodic excitation

t f random excitation

Figure 2.8. MBE voicing model (V = voiced, U = unvoiced)

2.3.4 Single band voicing

While binary approximations to the degree of voicing are often made in speech coding, the degree of voicing tends to vary continuously across the spectrum of natural speech. Since periodic excitation occupies infinitely narrow bands of the spectrum, another possible voicing model is to simply add gain-adjusted random excitation across the


entire band of interest (c.f. Fig. 2.9). The value of the gain is determined by the average degree of voicing across the entire spectrum. This model should be used with care since excessive random excitation could be produced in low frequency bands of frames whose voicing state varies from highly voiced to highly unvoiced with increasing frequency. degree of voicing

t random excitation

spectral shaping

...

t periodic excitation

Figure 2.9. Single band voicing model

2.3.5 Application of mixed excitation models to wideband enhancement

Since voiced speech tends to become more unvoiced with increasing frequency, a reasonable approximation to the highband excitation is purely random excitation. This approximation has been made by many researchers [PAT81, SCHN98, CHE94], who consider that the principal advantage of highband excitation is to improve the intelligibility of unvoiced phonemes. Translating some of the highband random excitation into the narrowband has even been shown to improve intelligibility in narrowband coding [HEI98].

CELP-based excitation schemes such as [YOS94] use a binary excitation model to switch between periodic and random wideband excitation. A variant on the voicing dependent cut-off frequency was employed in a high quality wideband speech synthesizer [HOL90].

Many researchers [e.g. CAR94b, NAKA97, YAS95b, YOS94] have opted for the spectral folding scheme of section 2.1.4 [MAK79] to generate the highband excitation. In the example of Fig. 2.10a, wideband excitation was generated from 4 kHz bandlimited excitation by spectral folding, producing periodic components close to 8 kHz. This is totally unlike the spectral characteristics of true wideband excitation, seen in Fig. 2.10b, except that the non-harmonic highband components tend to be perceived as noise. From this perspective spectral folding operates similarly to the voicing


dependent cut-off frequency of section 2.3.2. Perceptual artefacts introduced by spurious sinusoids close to 8 kHz may rarely be noticeable, due to the small magnitudes typically found in this spectral region for voiced phonemes. M a g n it u d e ( d B )

20 15 10 5 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0

0

1000

2000

3000

4000

5000

F r e q u en cy (H z)

6000

7000

8000

Figure 2.10a. Magnitude spectrum of wideband excitation generated from an example frame of narrowband speech using spectral folding. -3 0

M a gn it ude ( dB )

-4 0 -5 0 -6 0 -7 0 -8 0 -9 0

0

1000

2000

3000

4000

5000

6000

7000

8000

F r e quen cy (H z)

Figure 2.10b. Magnitude spectrum of true wideband excitation for the same frame.

[CHA96] uses wideband MBE excitation, and binary voicing decisions are made over small bands between 0 and 8 kHz. This offers considerable flexibility in modelling highband speech excitation, but requires the use of codebooks to predict the voicing status of bands in the highband, introducing possible inaccuracies to their degree of voicing.

A very recent and promising model for highband voicing is to estimate the degree of voicing in the 3-4 kHz band of the narrowband speech and use this estimate for the highband degree of voicing [MCCR00]. This is an effective estimation method because if the 3-4 kHz band is strongly unvoiced, then the highband will also be strongly unvoiced, while if the 3-4 kHz band is strongly voiced, then a large part of the highband may also be voiced.


2.4 New approach: STC-based highband excitation synthesis From the review of selected existing sinusoidal synthesis techniques in section 2.2, it is apparent that sinusoidal synthesis has potential for producing high quality highband speech excitation. This arises from the ease of obtaining a highband signal with the following properties: x harmonic voiced highband components x a flat highband spectrum (before spectral shaping) x flexibility in the implementation of highband voiced and unvoiced components x smooth phase transitions at frame boundaries In this research, the harmonic STC model was chosen for highband excitation synthesis because it provides a high quality, bandwidth-scalable synthesis of voiced and unvoiced speech, and has the additional advantage of offering possibilities for the estimation of instantaneous highband phases from the estimated highband spectral envelope.

2.4.1

New model for highband voicing

Inspection of magnitude spectra of wideband speech (for example those of Fig. 2.12a and 2.13a of section 2.5.1) reveal that speech can take on a highly voiced or a highly unvoiced characteristic above 4 kHz. The voicing model employed in this technique is based upon single band voicing control (c.f. section 2.3.4), and allows a mixture of voiced and unvoiced excitation across the full highband excitation bandwidth, as seen in Fig. 2.11. Dr sˆr (n) f random excitation

highband excitation sˆh ( n) f

sinusoidal excitation

Figure 2.11. Voicing model for STC-based highband excitation synthesis


The random component of the excitation is derived from two sources, highband random excitation and randomization of the sinusoid phases, and these are discussed further in sections 2.4.3 and 2.4.2 respectively. Both of these sources of randomization are controlled by the factor Dr

10 2 (1Qˆ ) ,

(2-21)

where Qˆ [0, 1] is the narrowband degree of voicing (0 = unvoiced, 1 = voiced) estimated using [ATA76].

2.4.2 Periodic component

The periodic component of the excitation synthesis scheme was based upon the harmonic sinusoidal model of section 2.2.1 and the minimum-phase filter model for harmonic synthesis of section 2.2.5 [MCA95]. Periodic highband speech was modelled using

MW

sˆh (n)

¦ Aˆ (mZˆ

m M N 1

0

>

ˆ s (mZˆ ) D r H ) exp j nmZˆ 0 mIˆ0 ) 0

@

(2-22)

where Aˆ (Z ) is the estimated spectral envelope, Zˆ 0 is the estimated fundamental ˆ (Z ) is the estimated vocal tract system phase, Iˆ is the estimated frequency, ) s 0 fundamental phase, MN and MW are the number of harmonics in the narrowband and wideband respectively, and H (-S,S] is a uniformly distributed random variable. The fundamental frequency Z 0 was estimated using the STC-based method of [MCA90]. ˆ s (mZˆ ) were evaluated The amplitudes Aˆ (mZˆ 0 ) and the vocal tract system phases ) 0 from the highband spectral envelope (c.f. section 2.2.5), and the fundamental phase for the kth frame was calculated using (2-17). Note that a similar highband phase estimation technique was later used by Aguilar et al. [AGU00] for low bit rate wideband speech coding.


Parameter interpolation between frames was achieved by combining the MBE harmonic frequency tracking technique of section 2.2.2 [GRI88], linear interpolation of the sinusoid amplitudes (see section 2.2.3) and cubic phase interpolation [MCA86] (see section 2.2.4).

Note that this synthesis technique can be modified to produce wideband sinusoidal excitation simply by setting MN = 0. In a speech enhancement application, the narrowband speech could optionally be re-synthesized with very little additional computation. In a wideband speech coding application, where the narrowband coding was achieved using sinusoidal methods, this technique for producing highband voiced excitation could be integrated with considerable savings in overall complexity.

2.4.3 Random component

Unvoiced components in the highband were modelled using a combination of the randomized harmonic phases of equation (2-22) and random excitation shaped by the spectral envelope (2-23). The random excitation component for each frame of speech was synthesized as

Sˆ r (Z ) D r Aˆ (Z ) H HP (Z )W (Z )

(2-23)

where Aˆ (Z ) is the all-pole transfer function of the wideband spectral envelope comprising the narrowband and estimated highband spectral envelopes, H HP (Z ) is a highpass filter with 3.8 kHz cutoff frequency, and W (Z ) is the spectrum of a wideband uniformly distributed random sequence with unity mean square value. The time domain random component sˆr (n) was constructed from Sˆ r (Z ) using the overlap-add technique [OPP75].

The output highband speech was synthesized as the sum of the harmonic and random components (illustrated in Fig. 2.11 of section 2.4.1)

sˆH (n)

sˆh (n) sˆr (n)

(2-24)


2.5 Evaluation of selected highband excitation generation methods In this section, comparisons are made between the novel STC-based highband excitation synthesis technique of section 2.4 and the spectral folding method [MAK79] (c.f. section 2.1.4). Spectral folding was chosen for comparison because it is a computationally inexpensive method which has been widely employed in previous research. When examining different methods of highband excitation synthesis, it is difficult to define a single optimum objective criterion for their comparison. In previous research, comparisons have been made using spectra, spectrograms, or subjective listening tests. Throughout this section, the true highband spectral envelope is employed in place of the estimated highband spectral envelope, in order to concentrate on the differences in highband excitation due to the various techniques.

2.5.1

Excitation spectra

The magnitude spectra were calculated for example frames of voiced and unvoiced narrowband speech which were enhanced using both the novel STC-based highband excitation synthesis method of section 2.4 and spectral folding [MAK79]. Inspection of the spectra (figures 2.12 and 2.13) clearly reveals that STC-based highband excitation synthesis (c.f. section 2.4) can automatically generate mixed excitation appropriate to the required degree of voicing. On the other hand, inspection of spectra (figures 2.14 and 2.15) generated using spectral folding on the same voiced and unvoiced speech frames demonstrates its deficiencies relative to STC-based highband excitation. While spectral folding appears adequate for the synthesis of highband excitation for strongly unvoiced frames (e.g. Fig. 2.15), it erroneously introduces (inharmonic) sinusoids throughout the highband for a strongly voiced frame, as seen in Fig. 2.14. The spectral notch around 4 kHz produced by spectral folding is also clearly evident in Fig. 2.14b.


Magnitude (dB)

60 40 20 0 -20 0

1000

2000

3000 4000 5000 Frequency (Hz) (a)

6000

7000

8000

1000

2000

3000 4000 5000 Frequency (Hz) (b)

6000

7000

8000

Magnitude (dB)

60 40 20 0

-20 0

Figure 2.12. Magnitude spectra of (a) a highly voiced frame of wideband speech and (b) the same speech but with the highband synthesized from the 04 kHz portion using the novel technique of section 2.4.

Magnitude (dB)

40 30 20 10 0

-10 -20 0

1000

2000

3000 4000 5000 Frequency (Hz) (a)

6000

7000

8000

1000

2000

3000 4000 5000 Frequency (Hz) (b)

6000

7000

8000

Magnitude (dB)

40 30 20 10 0

-10 -20 0

Figure 2.13. Magnitude spectra of (a) a highly unvoiced frame of wideband speech and (b) the same speech but with the highband synthesized from the 04 kHz portion using the novel technique of section 2.4.


Magnitude (dB)

60 40 20 0

-20 0

1000

2000

3000 4000 5000 Frequency (Hz) (a)

6000

7000

8000

1000

2000

3000 4000 5000 Frequency (Hz) (b)

6000

7000

8000

Magnitude (dB)

60 40 20 0

-20 0

Figure 2.14. Magnitude spectra of (a) a highly voiced frame of wideband speech and (b) the same speech but with the highband synthesized from the 04 kHz portion using spectral folding.

Magnitude (dB)

40 30 20 10 0 -10 -20 0

1000

2000

3000 4000 5000 Frequency (Hz) (a)

6000

7000

8000

1000

2000

3000 4000 5000 Frequency (Hz) (b)

6000

7000

8000

Magnitude (dB)

40 30 20 10 0 -10 -20 0

Figure 2.15. Magnitude spectra of (a) a highly unvoiced frame of wideband speech and (b) the same speech but with the highband synthesized from the 04 kHz portion using spectral folding.

2.5.2 Excitation spectrograms

One method for analyzing the spectral characteristics of different highband excitation techniques over several frames is to employ a spectrogram. Figures 2.16, 2.17 and 2.18 are spectrograms of the 0.6 second speech segment “she had”, uttered by a male speaker, displaying respectively the true wideband speech signal, wideband speech with


the highband synthesized using the STC-based method, and wideband speech with the highband synthesized using the spectral folding method.

Figure 2.18 shows that spectral folding creates highband sinusoid tracks which move in contrary directions to the harmonic tracks in the narrowband, and examples of this are marked by dashed regions. Possible improvements to the STC-based highband excitation (Fig. 2.17) could include the use of a more accurate pitch estimation algorithm. 8000

7000

Frequency (Hz)

6000 5000 4000 3000 2000 1000 0

0

0.1

0.2

0.3 Time (s)

0.4

0.5

Figure 2.16. Spectrogram of the 0.6 wideband speech segment “she had” (male speaker). 8000

7000

Frequency (Hz)

6000

5000 4000 3000

2000 1000

0

0

0.1

0.2

0.3 Time (s)

0.4

0.5

Figure 2.17. Spectrogram of the 0.6 wideband speech segment “she had” (male speaker), where the 3.8-8 kHz region was synthesized from the 0-4 kHz speech using STC-based highband excitation.


8000 7000

Frequency (Hz)

6000 5000 4000 3000 2000 1000 0

0

0.1

0.2

0.3 Time (s)

0.4

0.5

Figure 2.18. Spectrogram of the 0.6 wideband speech segment “she had” (male speaker), where the 3.8-8 kHz region was synthesized from the 0-4 kHz speech using spectral folding.

2.5.3 Subjective excitation assessment

In order to assess the performance of the new STC-based highband excitation relative to narrowband and true wideband speech, 18 listeners (16 male and 12 female, between the ages of 20 and 35) were presented with samples of speech prepared according to different conditions. The listeners were asked to rank the conditions on a five-point absolute category rating (ACR) quality scale from ‘bad’ to ‘excellent’ (see Appendix A for details). The STC-based highband excitation was synthesized using the true highband envelopes in this test. The resulting mean opinion scores (MOS) for the conditions are shown in Table 2.1.

Table 2.1. ACR listening test results for synthetic excitation Condition True wideband speech Synthetic highband excitation, true highband envelope/gain True narrowband speech

MOS 4.25 3.31 2.74

95% CI r0.13 r0.15 r0.17

These results show that listeners noted a significant preference for speech which had been enhanced using the STC-based excitation to the narrowband speech, while small


artefacts in the highband excitation caused it to score lower than the true wideband speech.

A number of informal listening tests were also conducted. Comparisons between the spectral folding method and the STC-based technique confirmed that the harmonic sinusoid-derived excitation was preferred to the spectral folding excitation, particularly during voiced segments of speech. Tests on the STC-based excitation using harmoniconly and random-only excitation revealed that the novel mixed excitation synthesis model described in section 2.4 provides the better model for highband speech. During informal listening tests conducted through speakers, rather than through headphones (as in the ACR tests of Table 2.1), subjects had difficulty differentiating the narrowband speech with synthetic STC-based highband excitation from the true wideband speech.

2.6 Conclusion In this chapter, a number of existing highband excitation and speech synthesis methods were reviewed, and their relative merits discussed. A new technique, STC-based highband excitation synthesis, was introduced for the synthesis of high quality highband excitation using only information from the narrowband speech. STC-based excitation synthesis offers mixed voiced/unvoiced excitation with harmonic sinusoid components in voiced speech, a flat spectral characteristic (before spectral shaping), and satisfactory phase modelling.

STC-based highband synthesis was compared with spectral folding, revealing its ability to better model a range of different types of highband excitation and to accurately reproduce highband voiced components. In subjective assessments, listeners indicated a firm preference for narrowband speech enhanced using STC-based highband synthesis and the true highband spectral envelopes over the narrowband speech without enhancement.

Ch 3 / Existing Methods for Highband Spectral Envelope Estimation 39

Chapter 3. Existing Methods for Highband Spectral Envelope Estimation




upsample 2

LPF 3.8 kHz




wideband speech

narrowband speech

3.0 Overview In this chapter, existing methods for the estimation of the highband spectral envelope shape from narrowband information are reviewed. The organization of this chapter is based around three broad classes of estimation techniques: section 3.1 discusses a vector quantization-based approach, known as codebook mapping or non-linear interpolative vector quantization; section 3.2 examines work on a maximum likelihood method known as statistical recovery; and section 3.3 surveys a number of different linear approaches to envelope estimation. Concepts from this review underlie many of the new highband envelope estimation techniques presented in chapter 4.

This chapter does not provide any explanation of techniques from speech analysis, despite frequent reference to methods for estimating and parameterizing the spectral envelope. The reader is referred to such sources as [MAK75a, ITA75a, MAR76, PAUL81, ELJ91, DEL93, MAL99] for further detail. Note that the term ‘highband envelope’ can be interpreted loosely to indicate any spectral envelope whose upper frequency limit is greater than that of the narrowband envelope, but whose lower frequency limit may lie within, or even at the lower limit of, the narrowband.


3.1 Codebook mapping Codebook mapping relies on the principles of vector quantization, and hence this subject is briefly reviewed first by way of introduction.

3.1.1

Introduction to vector quantization (VQ)

Vector quantization [LIND80, GER83, GRA84, MAK85, ABU90, GER93] approximates a vector containing some component of an input signal by another vector from a finite set of predetermined representative vectors, and is thus an important signal compression tool. This quantization operation can be described mathematically as Q : k o C,

(3-1a)

Q(x) = xˆ i where i = arg min d (x, xˆ j ) , j

(3-1b)

where x k is the input vector, C = { xˆ i k | 1 d i d N} is a codebook of N code vectors xˆ i of dimension k, and d is a distortion criterion. The quantization operator Q implicitly defines a partitioning of k into N regions

Ri = {x k | d(x, xˆ i ) d d(x, xˆ j ) j = 1, 2, … , N, j z i },

(3-2)

whose centroids xˆ i , the code vectors of C, each optimally represent their respective regions according to the distortion criterion d. A two-dimensional example of this partitioning is shown in Fig. 3.1 for a codebook of size 20.


x2

x1

Figure 3.1. Two-dimensional example of a VQ partitioning of vectors x = [x1 x2]T, where the dots represent the code vectors (centroids). The advantage of VQ in a coding configuration is that, if there is sufficient memory in both the encoder and the decoder to store the codebook (e.g. Fig. 3.2), the code vector index i can be transmitted in place of the input vector, allowing a reduction in the number of bits required for its transmission.

input spectral envelope

codebook

codebook

. . .

. . .

encoder

codebook index i channel

decoder

output spectral envelope

Figure 3.2. Spectral coding using vector quantization

More generally, vector quantization is a pattern-matching technique which can be used to design various kinds of decision-making algorithms with applications including classification, approximation, enhancement and pattern recognition.

3.1.2

Distance measures

A key consideration in vector quantizer design is the sense in which vector comparisons are made, that is, the distance or distortion measure d. In order to optimize vector quantizer design according to a given objective evaluation criterion, the same criterion should be used as the distance measure throughout the design process. Probably the most common method for objective evaluation of spectral coding techniques is spectral distortion, mathematically described as


d ( A, Aˆ )

2 Zs

³

Zs 2 0

>

@

2

20 log10 ( A(Z )) 20 log10 ( Aˆ (Z )) dZ ,

(3-3)

where A(Z ) and Aˆ (Z ) are the magnitude spectra of the envelopes of two speech segments being compared and Zs is the sampling frequency. Calculation of spectral distortion is performed using a discrete version of (3-3), expressed as

d ( A, Aˆ )

>

@

2 1 K 20 log10 ( A(Z k )) 20 log10 ( Aˆ (Z k )) , ¦ k 1 K

(3-4)

where the K frequencies Z k , at which the magnitude spectra of the envelopes are evaluated, are usually equally spaced.

Spectral distortion will be used as a distance and evaluation criterion throughout this thesis. There are, of course, many other distortion measures for spectral coding, and some other common measures are outlined in Appendix B. More extensive reviews can be found, for example, in [GRA80, QUA88, DIM89, KUB91, DEL93, RAB93].

3.1.3 Codebook design for VQ

The objective of vector quantization codebook design is the minimization of quantization distortion over all training data xt, that is, § · min( E ) min ¨ ¦ d x t , Q(x t ) ¸ , {xˆ i } © t ¹

(3-5)

where xˆ i = Q(x) is the quantized value of a given input vector x. Since the cost function E is a function of the N code vectors, it is often a complicated surface which contains more than one minimum. In such instances, location of the global minimum is a highly computationally intensive procedure and is rarely attempted in practice. There is a large body of literature on the training and optimization of codebooks for VQ (see for


example [LIND80, ABU90, GER93, SAD95]), and the summary which follows is only a selective introduction to the subject.

Codebook design can be considered to comprise four steps:

1. Selection of an appropriate set of training data 2. Initial codebook design: using a database of training vectors of size M, a codebook of the desired size N is generated either by ‘pruning’ or by ‘growing’ methods. 3. Local optimization: the codebook is modified with the objective of minimizing the total distortion of the codebook relative to the training database. In general the global total distortion minimum is not achieved. 4. Attempted global optimization: the local optimization step is repeated under different initial conditions in an attempt to design the codebook which achieves the global total distortion minimum.

Selection of training data

The data used to train codebooks should ideally be a good approximation of the wide range of different vectors to be represented. In the case of spectral quantization, the training data should encompass as many speakers as possible, from both genders, preferably with different native tongues, and should contain as many different utterances as possible. In general, larger training sets produce better codebooks, and it is of interest to determine the number of training vectors required to design a low distortion codebook of the desired size. In vector quantization literature a training ratio is often defined, being the number of training vectors per codebook vector above which the incremental codebook improvement is ‘small’. This ratio is given as a minimum of 20 in the spectral envelope quantization work described in [MAK85] and around 5 in the speech example of [LIND80], but may be larger for other sources.

Initial codebook design

One class of codebook initialization techniques forms a codebook by pruning or combining vectors from the training database. The pairwise nearest neighbour algorithm


(Algorithm 3.1) [EQU87] begins with the entire training set, and iteratively combines vectors which are very close to each other, until a codebook of size of N is obtained. 1. Set n = 0. Form a codebook C0 consisting of all M narrowband training vectors and decide on a desired codebook size N, where N < M. 2. Find the pair of vectors (xj, xk), j z k, within Cn which have the smallest distance. 3. Replace this pair of vectors in Cn by their centroid. 4. Set n m n + 1 and return to step 2 until Cn+1 has size N. Algorithm 3.1. Pairwise Nearest Neighbour codebook initialization algorithm

An alternative algorithm starts from the optimal codebook of size one (the centroid of the entire training data), and increasingly larger codebooks are produced by iteratively splitting the codebook [LIND80]. The first codeword, xˆ 1 , is split into xˆ 1 and xˆ 1 + H1, where H1 is a vector of small Euclidean norm, as seen in Algorithm 3.2. At subsequent splitting stages, vectors Hi are used to perturb codewords xˆ i . One choice for Hi which tends to perturb the split codewords into higher distortion regions is the vector whose k’th component is the standard deviation of the k’th component of all training vectors in partition Ri [GER93]. After splitting, the new codebook now has an extra level of resolution and can have a distortion no greater than that of the previous codebook since it contains this previous codebook. Note that this algorithm is considerably more computationally efficient than the pairwise nearest neighbour algorithm. 1. Set n = 0. Choose the narrowband codebook C0 as a single vector, calculated as the centroid of the entire set of training vectors. Determine the desired codebook size N and calculate the perturbation vector H1 for the first iteration. 2. Given a codebook Cn of size M, split each code vector xˆ i into xˆ i and xˆ i +Hi, forming a new codebook of size 2M. 3. Locally optimize this codebook (e.g. using Algorithm 3.3) to form Cn+1. 4. If the present codebook size M < N, set n m n + 1, calculate the Hi and go to step 2. Otherwise, set C = Cn+1. Algorithm 3.2. Codebook initialization by splitting


Local optimization

Having formed an initial codebook of the desired size N, the application of an improvement step algorithm produces a locally optimum codebook. Possibly the most famous improvement algorithm in speech coding is the Generalized Lloyd Algorithm (GLA), a generalization of [LLO82], which is illustrated in Fig. 3.3 and described in Algorithm 3.3. Whilst GLA is guaranteed to improve the total distortion at each iteration, in general the distortion will converge to a local minimum rather than a global one. The combination of GLA with codebook initialization by splitting is commonly known as the LBG algorithm [LIND80]. Note that there are a number of alternatives to the LBG algorithm, many of which produce smaller total codebook distortion (e.g. [KAT94, CHEN96]). 1. Set n = 0. Choose an initial codebook C0, and use it to encode the entire training data using vector quantization. Calculate the total distortion due to the quantization process M

Dn

¦ d (x

m

, xˆ m ) ,

(3-6)

m 1

where x m and xˆ m are respectively the mth unquantized and quantized vectors of M training vectors. 2. Replace each code vector in Cn by the average of the training vectors that were quantized by it, forming Cn+1. 3. Encode the entire training data using Cn+1 (as per step 1) and calculate the new total distortion Dn+1 using (3-6). 4. If the relative change of the distortion

Dn Dn 1 Dn

is less than a pre-determined

threshold, then stop the algorithm. Otherwise, set n m n + 1 and go to step 3. Algorithm 3.3. Generalized Lloyd Algorithm


initial codebook

quantize training data training set

update code vectors

compute distortion

does not satisfy stop criterion

test satisfies stop criterion output codebook

Figure 3.3. Generalized Lloyd Algorithm flow chart [GER93]

Attempted global optimization

There are no known closed-form solutions to the problem of optimal vector quantization. Techniques such as simulated annealing and stochastic relaxation (see for example [ROS92, GER93]) may be employed to improve the probability of an improvement algorithm, such as GLA, converging to the global optimum.

Codebook structure and computational complexity of vector quantization

The design of codebooks for vector quantization is often very computationally expensive, however the codebooks usually only need to be designed once. During their use, in speech coding for example, codebooks need to be searched as part of the quantization process for every frame of speech. Since this is a search for the nearest of many multi-dimensional vectors to the input vector, there is a large amount of computation required. Performing a full codebook search means that the complexity is proportional to the codebook size, which is often 210 or larger (see for example [PAL93]).

The large complexity of the codebook search required to process each frame has motivated the development of a number of different codebook structures which simplify the search. One example of a codebook structure which substantially reduces the


codebook search complexity from a search over a single large codebook is multi-stage VQ (MSVQ) [GER93]. Whilst complexity is not considered in depth in this work, any implementation of VQ-based techniques should take advantage of the computational savings of methods such as MSVQ.

3.1.4 Codebook mapping

Codebook mapping is an important highband envelope estimation scheme which is based upon a generalization of VQ, whereby the space of highband spectral envelopes is classified according to the shape of the narrowband spectral envelope. The highband spectral envelope is determined from the highband codebook vector whose corresponding narrowband code vector is closest in shape to the spectral envelope of the present frame of narrowband speech under analysis. Codebook mapping schemes, to date, map one-to-one from the narrowband to the highband codebook, as seen in Fig. 3.4. narrowband codebook

highband codebook

select most similar envelope narrowband envelope

highband envelope

Figure 3.4. Block diagram of codebook mapping

If the narrowband codebooks CL = { xˆ i p | 1 d i d N} and highband codebooks CH = { yˆ i q | 1 d i d N} comprise narrowband parameter vectors xˆ i of dimension p and highband parameter vectors yˆ i of dimension q respectively, then codebook mapping may be mathematically described as (c.f. (3-1)) F : p o CH,

(3-7a)

F(x) = yˆ i where i = arg min d (x, xˆ j ) , j

(3-7b)


where x p is the input narrowband parameter vector. F implicitly incorporates a partition of the narrowband parameter space into regions (c.f. (3-2))

Ri = {x p | F(x) = yˆ i }.

(3-8)

Thus, for an input narrowband parameter vector x Ri, the mapping x o xˆ i o yˆ i

(3-9)

is made. Note that x and y could take on any parameterization. Existing literature provides little guidance as to the choice of parameterization, which will be analyzed in section 4.1.5 in the next chapter.

Although a large number of researchers have proposed highband envelope estimation schemes based on codebook mapping, for example [YOS94, YAS95b, NAKA97, SCHN98, ENB99], the most comprehensive work in this area to date is that of [CAR94a, CAR94b]. In narrowband speech coding, line spectral frequency (LSF) codebook mapping was used in a two sub-band, variable-rate split VQ scheme to reduce the average bit rate [MCC97], and codebook mapping in a multi-stage configuration was employed in MBE [NIS93].

Other applications which have employed codebook mapping, or non-linear interpolative VQ (NLIVQ), include low-band extension of telephone speech [MIE00], speaker adaptation [SHI86, KUW95, TAN97], alaryngeal speech enhancement [BI97], classification [MIL96], noisy source compression [RAO96], high dimensional signal recovery from lower dimensional data [GER90], reconstruction of high resolution images from lower resolution images [PAN98, FEK98a, FEK98b, SHE98b], image coding [WANG96], image enhancement [PAN98, SHE98a, SHE98b], and enhancement of delta modulation/LPC tandem connections [NAY90].


3.1.5 Codebook pair design for codebook mapping

Selection of training data

Codebook pair design begins with the selection of a large database of wideband speech from a diverse range of speakers, which is then downsampled to form a narrowband version of the same database. These databases are then used to form training databases TL and TH consisting of M parameterized narrowband and highband envelopes respectively. It is important that narrowband and highband envelopes arising from the same frame of wideband speech have the same time index in each training database. TL and TH are then used to design the codebooks CL and CH, as shown in Fig. 3.5. Unlike narrowband codebook design (c.f. section 3.1.3), there is no guide for the training ratio (i.e. the size of TL and TH) in the literature, and this is the subject of experiments in section 4.1.2 of chapter 4. wideband speech data LPF and downsampling

synchronization

highband envelope extraction

narrowband speech data parameterization

parameterization training data formation

. . . .

narrowband training data TL

. . . .

highband training data TH

codebook design

. . .

narrowband codebook CL

. . .

highband codebook CH

Figure 3.5. General scheme of codebook design for highband envelope estimation

Codebook pair design using Carl’s method or NLIVQ

The commonest codebook pair design algorithm in the literature was first proposed for the highband envelope estimation application by Carl [CAR94a,b]. As detailed in


Algorithm 3.4, the narrowband training data are quantized by a predetermined narrowband codebook, and each highband vector of index i is formed from the centroid of all highband training envelopes whose corresponding narrowband envelopes were quantized by the narrowband code vector with index i. This method is conceptually identical to non-linear interpolative VQ (NLIVQ), earlier proposed by Gersho for image coding and enhancement applications [GER90, GER93]. Gersho’s analysis shows that for a given narrowband codebook CL and training data (TL,TH), the NLIVQ training procedure results in the design of the optimum highband codebook CH over that training data. An example of the application of this algorithm is illustrated in the block diagram of Fig. 3.6.

1. Design a narrowband codebook CL from the narrowband training data TL of size M, for example using Algorithms 3.2 and 3.3. 2. Quantize TL using CL, and store the resulting indices in, where 1 d n d M. 3. Calculate the j’th code vector of CH as the mean of all vectors in TH whose indices in are equal to j. Algorithm 3.4. Codebook pair design using Carl’s method (NLIVQ)

wideband speech data

LPF and downsample

parameterize

narrowband speech data

HPF and downsample highband speech data

narrowband training data

clustering based upon narrowband codebook design

parameterize highband training data

narrowband codebook

. . .

. . . .

. . . .

. . .

highband codebook

cluster using same indices as narrowband clustering

Fig 3.6. Block diagram demonstrating Carl’s algorithm for highband codebook design

Where Carl’s method has been employed in previous literature [CAR94b, YOS94, SCHN98], narrowband codebook design has been achieved using the LBG algorithm, except in one implementation where the competitive learning algorithm was used [ENB99].


Codebook pair design using Chan and Hui's method

An alternative to Algorithm 3.4, employed in [CHA96], is to design a wideband codebook and, during codebook mapping, the wideband envelope is selected based on spectral distance measures over the narrowband only portion of the wideband codebook. This approach (c.f. Algorithm 3.5) is equivalent to designing a pair of narrowbandhighband codebooks, except that the spectral distortion is minimized over the entire wideband during codebook design. A disadvantage of Chan and Hui's method is that for the same frame of speech the narrowband portion of a wideband envelope differs slightly from the envelope of the narrowband signal due to bandlimiting effects and differences in the envelope modelling.

1. Form an initial wideband codebook CW from the wideband training data. 2. Design a wideband codebook CW from the initial codebook according to a local optimization technique such as Algorithm 3.3 (GLA). The vectors in the narrowband and highband codebooks are comprised of the narrowband and highband portions of the wideband code vectors from CW respectively. Algorithm 3.5. Chan and Hui's method of codebook pair design.

Further optimization of the codebook pair

Any improvements to the narrowband codebook before the application of Carl’s method, for example using simulated annealing or stochastic relaxation, would be likely to improve the performance of the codebook mapping scheme. Neither Carl’s nor Chan and Hui’s methods explicitly jointly optimize the codebook pair in terms of both narrowband and highband distortion. Therefore, it is suggested here that an alternative codebook pair design could employ a joint optimization technique in the style of the deterministic annealing approach of [RAO96, MIL96, FEK98a, FEK98b].

3.1.6 Codebook size

Despite the number of researchers who have employed codebook mapping for highband envelope estimation, the only work on codebook size to date is reported in [YOS94].


Here a wideband spectral distortion criterion was applied to test speech data which was independent of the training data. Their results (see Figure 7 in [YOS94]) reveal a gradual increase in spectral distortion with increasing codebook size. It is apparent that this differs from the decreasing spectral distortion with increasing codebook size characteristic that is familiar from VQ theory [GRA84], and this forms the motivation for further investigation reported in section 4.1.2 of chapter 4.

3.1.7 Codebook mapping with interpolation

One possibility for improving the accuracy of the highband envelope estimate resulting from codebook mapping is to combine, or interpolate between, the estimates resulting from more than one mapping. The simplest method of interpolation between code

^

vectors is to choose the K nearest neighbours xˆ i1 , xˆ i2 ,..., xˆ iK

` to the narrowband input

vector x, and construct the highband vector as a weighted average of the highband code vectors yˆ ik with the same indices as those nearest neighbours

¦ ¦ K

yˆ i

k 1 K k

wk yˆ ik

,

(3-10)

w 1 k

where wl are constants. This produces the mapping scheme shown in block diagram form in Fig. 3.7.

select 3 most similar envelopes

narrowband codebook

highband codebook combine highband envelopes w1

narrowband envelope

w2 highband envelope w3

Figure 3.7. Block diagram of codebook mapping using interpolation (with K = 3 nearest neighbours).


A nearest neighbour Euclidean average can be obtained by choosing w1 = w2 = . . . = wK, which averages the highband envelopes corresponding to the K narrowband envelopes nearest to the envelope of the input narrowband frame. For a narrowband and highband codebooks with vectors { xˆ i p | 1 d i d N } and { yˆ i q | 1 d i d N} respectively, a fuzzy mapping from a narrowband parameter vector x to a highband parameter vector y can be achieved by applying the following weights

wk

ªK « §¨ d ik «¦ ¨ d «¬ l 1 © il

· ¸ ¸ ¹

1 m 1

º » » »¼

m

,

where wl [0, 1] are the fuzzy membership coefficients, d il

(3-11)

x xˆ il is a mean square

distance (c.f. Appendix B), m is the fuzziness, and K is the number of nearest neighbours. The output vector yˆ is calculated using (3-10). More detail on fuzzy vector quantization can be found in [BEZ81]. To date, these interpolation techniques have only been used in speaker adaptation applications, such as that reported in [NAK89, NAK90]. Their application to wideband enhancement is investigated in section 4.1.9 of chapter 4.

3.2 Statistical Recovery 3.2.1

Envelope estimation using Statistical Recovery (SR)

Codebook mapping implicitly assumes that any pair of narrowband and highband spectral envelopes (x,y) may be adequately represented by a single pair of narrowband and highband code vectors from a codebook of finite size. Codebook mapping with interpolation (c.f. section 3.1.7) assumes that (x,y) may be represented as a combination of several narrowband and highband code vectors. A generalization of codebook mapping with interpolation is to assume that x contains contributions from all narrowband code vectors, while y contains contributions from all highband code vectors. Both statistical recovery (SR) [CHE92a, CHE92b, CHE94] and the Gaussian


mixture model (GMM) based technique of [PAR00] take a maximum-likelihood approach to this form of generalized codebook mapping with interpolation.

In SR, the narrowband and wideband envelopes are each assumed to be generated by a narrowband and a wideband source. These sources are linear combinations of narrowband and wideband sources {Oi | 1 d i d N } and {Tj | 1 d j d M } respectively. Hence in SR, the mean vectors i and j of each source are used in place of narrowband and highband code vectors xˆ i and yˆ i . During SR mapping, the relative contribution p(x | i ) of each narrowband source to the input narrowband envelope parameter vector x is determined and the output wideband envelope parameterization y is then calculated as

N

y

jD ij p (x | i ) p ( i )

M

¦¦

p ( x)

i 1 j 1

,

(3-12)

where

p ( x)

and D ij

¦

N

i 1

p (x | i ) p ( i ) ,

(3-13)

p( j | i ) is the statistical dependency of mean vector j on mean vector i .

The narrowband and wideband mean vectors i and j , and the constants D ij and p(Oi) are determined during the parameter design stage (c.f. section 3.2.4). In previous research [CHE94, PAR00], the probability density function of the narrowband and wideband parameter vectors is assumed to be multivariate Gaussian. A two-dimensional graphical interpretation of this mapping is shown in Fig. 3.8.


O1

T2

T1 p(x|O 1) p(y|T 1)

x p(x|O 4)

O3

p(x|O 2) O2

p(y|T 2)

p(x|O 3)

p(y|T 4) y p(y|T 3)

T3 O4

T4

Figure 3.8. Example of two-dimensional narrowband to wideband mapping using the statistical recovery function, where x is a narrowband parameter vector, i , 1 d i d 4, are narrowband mean vectors, y is a wideband parameter vector, and j , 1 d j d 4, are wideband mean vectors. The concentric ellipses represent contours of equal probability for the bivariate Gaussian distribution associated with each source.

3.2.2 Relationship between Statistical Recovery and codebook mapping with interpolation

Statistical recovery can be interpreted in terms of codebook mapping with interpolation if the wideband mean vectors are transformed as follows

ci

M

¦D

ij

j

(3-14)

j 1

SR can then be considered as a codebook mapping from a codebook of narrowband mean vectors { i | 1 d i d N} to a codebook of wideband mean vectors { ci | 1 d i d N}, where the output vector y is a linear combination of all transformed wideband mean vectors ci


N

y

¦ i 1

with wi

p (x | i ) p ( i )ci p ( x)

¦ ¦

N

i 1 N

wi ci

i 1

,

(3-15)

wi

p(x | i ) p( i ) . Equation (3-15) has the same form as equation (3-10), except

that (3-15) combines all N wideband vectors, whereas (3-10) combines only the K nearest neighbours. Thus SR is a form of codebook mapping with interpolation, as illustrated in Fig. 3.9. narrowband mean vectors

transformed wideband mean vectors

combine highband envelopes w1

narrowband envelope

. . . . . . . . .

highband envelope

wN

Figure 3.9. Interpretation of SR in terms of codebook mapping with interpolation

3.2.3 Statistical Recovery parameter design using the Expectation Maximization algorithm The Expectation Maximization (EM) algorithm [DEM77] is an iterative technique which can be used to simultaneously estimate parameters ; = {{ i }, { j }, {Dij}} which maximize the joint probability p (, X, Y)

p () p ( X, Y | ) over a given set of T

training vectors (X,Y) = {(xt,yt) | 1 d t d T}. Equivalently, the log likelihood log p ( X, Y | ) may alternatively be maximized to estimate ;. This maximization must occur over all sk, where sk represents the k’th combination (state path) of a possible (NM)T combinations of narrowband and wideband sources over all T.

The maximization process can be simplified through the use of Baum’s auxiliary function [BAU72]. Consider


p ( X, Y, s k | )

p ( X, Y | ) p ( s k | X, Y, ) ,

(3-16)

which may be rewritten as log p ( X, Y | )

log p ( X, Y, s k | ) log p ( s k | X, Y, ) ,

(3-17)

where the optimum combination of narrowband and wideband sources sk is unknown, and hence the terms on the right hand side of (3-17) are unknown. The training data (X,Y) in conjunction with an estimate ;0 for ; can give an estimate for sk, and hence the expectation of (3-17), conditioned on (X,Y) and ;0, is taken E^log p ( X, Y | ) | X, Y, 0 `

E^log p ( X, Y, s k | ) | X, Y, 0 ` E^log p ( s k | X, Y, ) | X, Y, 0 `.

(3-18)

Define L()

E^log p ( X, Y | ) | X, Y, 0 ` log p ( X, Y | ) ,

Q( | 0 )

(3-19a)

E^log p( X, Y, s k | ) | X, Y, 0 `

¦

( NM )T k 1

log p(X, Y, sk | ) p(X, Y, sk | 0 ) ,

(3-19b)

and

H ( | 0 )

E^log p( s k | X, Y, ) | X, Y, 0 `

¦

( NM )T k 1

log p(sk | X, Y, ) p(sk | X, Y, 0 ) .

(3-19c)

Rewriting (3-18) gives L() Q( | 0 ) H ( | 0 ) ,

(3-20)

which, combined with Jensen’s inequality H ( | 0 ) d H (0 | 0 ) (refer to Appendix C for proof), means that whenever Q( | 0 ) ! Q(0 | 0 ) , L() ! L( 0 ) . Thus, the log


likelihood L()

log p ( X, Y | ) can be maximised by maximizing Q( | 0 ) . The

complete expectation maximization process is summarized in Algorithm 3.6. E-step: Based upon a set of initial parameter values ;0, compute the expectation of the log likelihood Q(;|;0) = E(log p(;,X,Y,sk)|;0)

(3-21)

M-step: Maximize Q(;|;0) by adjusting ;, and return to the E-step Algorithm 3.6. The EM algorithm, in terms of the parameters of [CHE94].

3.2.4

Implementation of the Expectation Maximization algorithm for Statistical Recovery parameter design

The EM algorithm was formulated for SR parameter design in [CHE92a, CHE92b, CHE94] using autoregressive Gaussian sources, with mean vectors parameterized as autocorrelation sequences r i and r j . Maximization of Q( | 0 ) at each iteration was achieved through Lagrange optimization of D ij

¦

M j 1

p ( j | i ) under the constraint

p( j | i ) 1 . This yields (see [BAU72] for details)

¦ ¦ ¦

( NM )T

D ij

k 1

p (0 , X, Y, s k )cij ( s k )

M

( NM )T

j 1

k 1

p ( 0 , X, Y, s k )cij ( s k )

,

(3-22)

where cij ( s k ) is the accumulated count of the state vector ( i , j ) on the state path sk. The implementation of this algorithm, including some extra detail for clarification purposes, is shown in Algorithm D.1 of Appendix D. The stop criterion for this algorithm occurs when the evolution of parameters between consecutive iterations is ‘small’. Experimental work performed for this thesis found that by around 10 iterations the parameter evolution was sufficiently small.


3.3 Linear methods 3.3.1 Linear mapping

If the narrowband envelope is characterized by a parameter vector x [ x1 , x2 ,..., x p ]T and the highband envelope by a parameter vector y [ y1 , y 2 ,..., y q ]T , where T denotes transpose, then it is possible to construct a matrix A such that narrowband envelopes are mapped to highband envelopes as

F : p o q , F(x) = Ax,

(3-23)

in some optimal sense (illustrated in Fig. 3.10). Examples of parameterizations x and y which could be used include LP parameters, LSFs, cepstral coefficients and log area ratios (LARs).

narrowband envelope

A

highband envelope

Figure 3.10. Linear mapping

A useful design value for the matrix A is that which minimizes the mean square error

q

E

¦ ( yˆ

y j )2 ,

j

(3-24)

j 1

between the true and estimated highband parameter vectors y and yˆ using linear multivariate regression. This requires the formation of matrices X and Y from at least q narrowband and highband vector pairs taken from suitable training data, and the evaluation of

A

>X X@ T

1

XT Y .

(3-25)


Equation (3-25) is familiar from linear algebra theory, and is known to produce a least squares solution to the system y = Ax.

Linear mapping has been applied to the low-band extension of telephone speech in recent research [MIE00], where LSFs were used to parameterize the spectral envelopes. The principal disadvantage of the linear mapping method is that a linear relationship between the narrowband and highband parameters is a possibly unnecessarily restrictive constraint.

3.3.2

Piecewise linear mapping

Given the probable non-linearity of the envelope mapping problem, a more promising approach is to allow a number of different linear mappings between smaller disjoint regions of the narrowband and highband parameter spaces. This hybrid codebook mapping linear mapping method can be described as F : p o q , F(x) = Aix if x Ri,

(3-26)

where the Ri is the ith region. A block diagram of the piecewise linear mapping scheme is shown in Fig. 3.11.

A1 select partition A2 narrowband envelope

. . .

highband envelope

AL

Figure 3.11. Piecewise linear mapping

A least squares approach to designing the Ai can be taken by forming Xi and Yi from the ith partition of the training data, according to


Ai

>X

T i

Xi

@

1

T

X i Yi .

(3-27)

Piecewise linear mapping also requires a method for partitioning p into regions Ri. One possibility is to use vector quantization, where the regions are defined by the code vectors xˆ i using (3-2). The implicit classification performed by vector quantization (c.f. (3-8)) can be used to form training data (Xi,Yi) for the ith partition from the complete training data (X,Y).

[NAKA97] describes how a wideband spectral envelope may be generated from a narrowband spectral envelope using piecewise linear mapping. In this implementation VQ was used for partitioning, and y was estimated using the minimum mean square approach within each region. Rather than using only a single transformation matrix for each mapping, fuzzy interpolation was used (similar to that of section 3.1.7) to determine the contribution made by each matrix Ai. This method was reported [NAKA97] to achieve a modest improvement in spectral distortion over the codebook mapping scheme of [YOS94]. The memory requirements of piecewise linear mapping are relatively high, since for a narrowband codebook of size N, N matrices of size puq must be stored. Experiments conducted in this thesis found that piecewise linear mapping actually requires substantially more storage than codebook mapping to achieve the same accuracy, rather than less as claimed in [NAKA97].

Piecewise linear mapping has also been applied to voice conversion [MAT92, VAL92, IWA95], where it was found to produce discontinuous spectral evolution when different mappings were applied to consecutive frames. This problem was addressed in [BI97] by allowing the partitions Ri to overlap when calculating the matrices Ai.

3.3.3 Linear filtering

A variation on linear mapping is the use of a linear filter to estimate the highband envelope for a given narrowband envelope. A constraint on the use of filtering for envelope estimation is that filtering operates on temporal information rather than frequency domain information. In one application [TOL98] of linear filtering to envelope estimation, the narrowband speech signal was split into N bandpass signals


using a filter bank, and each bandpass signal was then input to a Wiener filter Hi(z) to generate a highband bandpass signal.

In this approach, based upon AM-FM synthesis, the bandpass filters had bandwidth 400 Hz and covered the entire narrowband spectrum. Estimates of the highband bandpass signals sîHB (n) were generated from the narrowband bandpass signals siNB (n) using

M

sîHB (n)

¦ b ( j)s i

NB i

(n j ) ,

(3-28)

j 1

where bi (n) are the coefficients of the ith filter Hi(z), M is the length of the filter, and n is the discrete time index. The filter coefficients bi (n) were estimated from training data using least squares techniques, as illustrated in Fig. 3.12. siHB (n) highband bandpass signal siNB (n) narrowband bandpass signal

+ bi(n)

sîHB (n)

Figure 3.12. Least squares estimation of filter coefficients as used in [TOL98]

3.3.4

Multidimensional linear filter banks with inter-frame filtering

Another possibility for linear filtering is to use the subframe-to-subframe variation of envelope parameters as an input signal. This type of estimation has been attempted using a multidimensional FIR filter bank operating on narrowband and wideband cepstral coefficients [AVE95]. In this approach, the narrowband cepstral coefficients Ci(k) of the kth 10ms sub-frame are input to an FIR filter with coefficients Wi,r(k) to produce an estimate Cˆ rd (k ) of the rth coefficient of the true wideband cepstral coefficient C rd (k ) , as seen in Fig. 3.13. Thus

Cˆ rd (k )

p

M

¦ ¦W

i ,r

i 1 l M

(l )Ci (k l ) ,

(3-29)


where p is the order of the narrowband all-pole model used to derive the Ci(k), and M defines the window size over which filtering occurs (set to 100 ms, or M = 5 in [AVE95]). The filter coefficients Wi,r(k) are determined such that Cˆ rd (k ) is the least squares estimate of C rd (k ) . An advantage and unique aspect of this approach is its use of the inter-frame correlation of spectral parameters to determine the shape of the highband spectral envelope. A likely disadvantage of this method is its reliance on a linear filter to approximate the probable non-linear relationship between narrowband and highband spectral envelopes. Results from listening tests reported in [AVE95] revealed that this technique often produces severe artefacts owing to the over- or underestimation of the highband spectral envelope. input narrowband speech

n k-M

.

.

.

.

.

.

.

k

C1(k+M)

.

.

.

.

.

k+M

. . .

Cp(k+M) . . .

C1(k-M) Cp(k-M)

Wi,1 . . . . . .

C1(k+M) Cp(k+M)

Cp(k-M)

. . .

output wideband cepstral parameters

. . .

. . .

C1(k-M)

Cˆ1d (k )

Wi,N . . .

Cˆ Nd (k )

Figure 3.13. Estimation of wideband cepstral coefficients using multidimensional filter banks.

3.3.5

Straight line extension

In [YAS96a], a straight-line approximation to the narrowband log spectral envelope is found, and the line is then extended beyond 4 kHz to represent the highband spectral envelope, as seen in Fig. 3.14. In this instance the accuracy of the highband envelope shape is entirely dependent on the correlation between the average gradients of the narrowband and wideband log envelopes. This method also implicitly determines the highband gain, and the efficacy of this gain determination technique is discussed in chapter 5.

M a g n it u d e sp e c t r um ( d B


55 50 45 40 35 30 25 20 15 10

0

1

2

3

4

5

6

7

8

F r e q u e n c y (k H z )

Figure 3.14. Highband envelope estimation using straight line extension. The average gradient of a smooth log envelope is calculated, and this is used to calculate a straight line log envelope which is extrapolated into the highband.

3.3.6

Flat highband envelope

A flat 6-7 kHz highband envelope was employed in a wideband extension scheme integrated into a wideband coder [PAU96a, SCHN98]. In this instance, a flat highband envelope should usually provide a reasonable estimate of the true envelope (c.f. Fig. 3.15). However, this would not be true if a flat envelope were applied to a 4-8 kHz M a g n it u d e sp ect r um ( d B )

highband region. 50 40 30 20

h i g h b an d e n v e lo p e

10 0 0

1

2

3

4

5

6

7

F r e q u en cy (k H z )

Figure 3.15. Example of the application of a flat spectral envelope to the (6-7 kHz) highband

3.3.7 Fixed envelope

The application of a shaping filter, such as that of [YAS95a] (see Fig. 3.16), to spectrally flat wideband excitation produces highband spectra which drop off in magnitude towards 8 kHz. The resulting spectral shape is characteristic of wideband voiced speech, but not of unvoiced speech in general.

M a g n it u d e r e sp o n s e ( d B


0 -1 0 -2 0 -3 0 -4 0 -5 0 -6 0 -7 0 -8 0 0

1

2

3

4

5

6

7

F r e q u en cy (k H z )

Figure 3.16. Magnitude response of a fixed shaping filter similar to that of [YAS95a]

In another implementation of wideband enhancement using a fixed envelope [SUZ96], the highband envelope comprises fixed formant frequencies. One method for determining either the shape of the fixed envelope or the frequencies of the highband formants is to average a large number of highband envelopes from a training database (the location of the formant frequencies is not specified in [SUZ96]). It is anticipated that using a fixed highband envelope of any kind would result in a large spectral distortion, similar to that obtained by codebook mapping using a codebook size of one.

3.4 Conclusion In this chapter, many different existing methods for the estimation of the highband spectral envelope shape were reviewed. Vector quantization and codebook mapping (non-linear interpolative vector quantization) for highband envelope estimation were examined in section 3.1. In section 3.2, a maximum-likelihood statistical technique was analyzed. A number of linear methods for envelope estimation of the highband spectral envelope were surveyed in section 3.3. Approaches and avenues for further work arising from this review form an important basis for original work contained in chapters 4 through 6.

Although an objective comparison between the various methods is made in chapter 4, the literature suggests that linear techniques for envelope mapping may not be sufficiently accurate to model the relationship between narrowband and highband envelopes. Codebook mapping, statistical recovery and piecewise linear mapping are thus the most promising methods based upon results reported in the existing literature.

Ch 4 / New Methods for Highband Spectral Envelope Estimation 66

Chapter 4. New Methods for Highband Spectral Envelope Estimation




upsample 2

LPF 3.8 kHz




wideband speech

narrowband speech

4.0 Overview In this chapter, some methods of chapter 3 are analyzed in greater detail, a number of new techniques for the estimation of the highband spectral envelope are presented, and objective comparisons are made between many methods to determine the optimum method for highband envelope estimation.

The structure of this chapter is as follows. Section 4.1 experimentally determines the optimum methods for codebook design, the optimum codebook size, and the optimum spectral envelope parameterization. New codebook mapping based highband envelope estimation techniques and a new implementation structure for codebook mapping are also proposed. In section 4.2, the application of alternative envelope parameterizations to the statistical recovery technique described in section 3.2 is investigated. Section 4.3 reports new work on linear approaches to highband envelope estimation. In section 4.4 the performance of all methods covered in this chapter and most methods from chapter 3 are compared. Finally, conclusions are provided in section 4.5.

4.0.1 Highband spectral distortion criterion

Highband spectral distortion is used extensively for objective assessment in both this chapter and chapter 5. For the purposes of comparing highband spectral envelope estimation schemes, the spectral distortion measure used was calculated over K frames as follows


DHC

1 K

2

Zs 2 ª § Ak (Z ) ·º 2 ¨ ¸» dZ , G 20 log « ~ ¦ H 10 ³ ¨ ¸ A ( ) Z k 1 (1 K )Z s KZ s 2 « k © ¹»¼ ¬ K

(4-1)

where

GH

~ Z s 2 A (Z ) k 2 dZ , ³ (1 K )Z s KZ s 2 Ak (Z )

(4-2)

~ Ak (Z ) and Ak (Z ) are the envelopes of the k’th temporally aligned frames of the true and synthesized wideband speech respectively, Zs is the wideband sampling frequency and K is the proportion of the wideband bandwidth occupied by the narrowband bandwidth. In this instance K = 0.475, corresponding to a 3.8 kHz narrowband bandwidth. The compensating gain factor, GH, has the effect of removing the mean difference between the two log envelopes in the highband, and thus DHC measures only the spectral distortion between the highband envelope shapes, as seen in Fig. 4.1. spectral distortion

DHC

0

3.8

8 kHz

Figure 4.1. The highband spectral distortion DHC can be interpreted as the area between the highband portions of two log spectral envelopes after the difference in their log gains has been adjusted to minimize this area.

This means that in this chapter, highband envelope estimation methods are assessed according to the distortion between the shapes of the estimated log highband envelope and the true log highband envelope. A second highband spectral distortion measure, DH, which also accounts for highband gain distortion, is introduced in section 5.0.1.


4.1 Codebook mapping - analysis and new methods 4.1.1 Narrowband-highband correlation

Codebook mapping is based upon the assumption that similarity between narrowband spectral envelope shapes implies similarity between highband envelope shapes. A simple means of testing the validity of this assumption is to compare narrowband and highband spectral distortion measurements between two pairs of narrowband-highband envelopes. A set of 50 envelope pairs was selected at random from a database of narrowband and highband envelope pairs derived from many speakers, and a second much larger set was formed using a random selection of other envelope pairs from the same database. A narrowband-highband envelope pair was selected from the first set, a narrowband-highband envelope pair was selected from the second set, and the narrowband and highband spectral distortions were calculated between the two envelope pairs. This process was repeated for all combinations of envelope pairs in the sets, and 10000 of the resulting narrowband and highband spectral distortions are plotted in Fig.

Highband spectral distortion (dB)

4.2.

Narrowband spectral distortion (dB)

Figure 4.2. Highband vs. narrowband spectral distortion for 10000 combinations of envelope pairs.

In this experiment, if similar narrowband envelope shapes do imply similar highband envelope shapes, then the data could be expected to be relatively closely clustered about


a line of the form: highband spectral distortion v narrowband spectral distortion. From the large spread of data seen in Fig. 4.2, it is apparent that the correlation between narrowband and highband distortion is likely to be a weak one. A similar conclusion is drawn from recent research [NIL00], which shows that there is a lower bound of 0.1 bits of shared information between the narrowband and the highband spectral envelope slopes. Both Fig. 4.2 and [NIL00] suggest that during codebook mapping, the choice of a highband spectral envelope based upon narrowband spectral distortion will not in general be optimal in terms of the highband spectral distortion.

4.1.2 Codebook size and training ratio

In order to quantify the relationship between codebook size and training database size, a number of codebooks of different sizes were designed on training data of various sizes, using a log envelope sample parameterization and Algorithm 3.4 for codebook design. The highband spectral distortion DHC was measured for each codebook size using equation (4-1) (see section 4.0.1) over different training set sizes for a test speech segment of length 96 s, and the results are seen in Fig. 4.3.

Highband spectral distortion DHC (dB)

3.7 3.65 3.6 3.55 3.5 3.45 3.4 3.35 3.3 3.25

4

8

16

32

64

128

256

512

1024

Codebook size

Figure 4.3. Highband spectral distortion DHC for codebook mapping over a range of codebook sizes (averaged over 96 s of test speech). Codebooks were trained using 5214 (|), 10427 (u), 20854 ( ¡) and 83418 (') training vectors.

The effects of decreasing training ratio, or increasing codebook size for a given training set size, are evident in the increase in spectral distortion at larger codebook sizes in Fig.


4.3. Figure 4.3 also shows that the incremental benefit of using larger training set sizes to produce larger codebook pairs decreases with increasing codebook size. A spectral distortion of 3.31 dB was obtained by training a codebook pair of size 128 from 41709 training vector pairs. By doubling the training set size and increasing the codebook size to 512, a mere 0.025 dB improvement was obtained (at the expense of a large increase in codebook search computation).

The minimum point of each curve in Fig. 4.3 represents the codebook size which produces the optimum highband spectral distortion for the given training set size. These points are tabulated in Table 4.1.

Table 4.1. Optimum codebook sizes for different training set sizes training set size 5214 10427 20854 41709 83418

optimum codebook size 16 32 128 128 512

training ratio 326 326 163 326 163

The mean of the training ratios from Table 4.1 is around 250, which provides a rough guide to the minimum number of training vectors required for effective codebook pair design. This figure is substantially larger than, for example, the minimum training ratio of 20 training vectors per codebook vector suggested for vector quantization of spectral envelopes [MAK85]. In light of this experiment, a likely explanation for the increasing speaker independent highband spectral distortion vs. codebook size characteristic seen in Figure 7 of [YOS94] would be a lack of sufficient training data (no training set size is documented in [YOS94]). Note that small improvements in spectral distortion with increased codebook size were also reported recently in [NIL00].

Another method of determining the training ratio is to compare results for open and closed testing, in a similar fashion to the experiment illustrated in Figure 25 in [MAK85]. Figure 4.4 shows the spectral distortion DHC (c.f. equation (4-1)) obtained for codebook mapping with codebook size 64, over different numbers of training vector pairs. The data from closed testing ‘u’ were obtained by applying codebook mapping to the training set data, while the open testing results ‘|¶ ZHUH REWDLQHG E\ DSSO\LQJ codebook mapping to speech data which were independent of the training data.



4.2 4 3.8 3.6 3.4 3.2 3 2.8 2.6

326

652

1304

2607

5214

10427

20854

Training data size

Figure 4.4. Spectral distortion DHC vs. training database size for open testing (| DQGFORVHGWHVWLQJu) (averaged over 96 s of test speech).

In the order of 10427 or 20854 training vectors, the open testing results are ‘close’ to the closed testing results, indicating that a satisfactory training ratio has been reached. This yields a training ratio estimate of between 162 and 326, in agreement with the earlier ratio estimate of 250.

4.1.3 Comparison of codebook pair design techniques

As seen in section 3.1.5, codebook design techniques for highband envelope estimation include Carl’s method (Algorithm 3.4) and Chan and Hui’s method (Algorithm 3.5). Carl’s method is based upon a narrowband codebook design, while Chan and Hui’s method is based upon a wideband codebook design. A further novel possibility for codebook pair design is to start with a highband codebook design, and then form the narrowband codebook, as detailed in Algorithm 4.1.


1. Form a highband codebook CH from the highband training data using Algorithms 3.2 and 3.3 or another codebook design technique. 2. Quantize the highband training data using CH, and store the resulting indices in, where 1 d n d M. 3. Calculate the j’th narrowband code vector of CL as the mean of all vectors in the narrowband training data for which in = j. Algorithm 4.1. Codebook pair design based upon minimization of highband distortion

In order to establish the best minimization criterion for codebook design, codebooks of 32 narrowband and 32 highband (4-8 kHz) log envelope samples were trained from a large multi-speaker speech database using distortion minimization over the narrowband (Algorithm 3.4), wideband (Algorithm 3.5) and highband (Algorithm 4.1). Codebook mapping was performed and the highband spectral distortion DHC was then calculated using equation (4-1) over a 96s test segment of speech. The results for a codebook size of 128 are shown in Table 4.2, from which it is clear that codebook design beginning with minimization of distortion over the narrowband envelopes (Carl’s method or Algorithm 3.4) results in the smallest highband distortion. Accordingly, throughout the remainder of this work codebooks are designed using Carl’s method.

Table 4.2. Spectral distortion for different methods of codebook pair design Distortion minimization Narrowband Highband Wideband

Method due to Carl [CAR94b] new Chan and Hui [CHA96]

Algorithm 3.4 4.1 3.5

Spectral distortion DHC (dB) 3.30 6.63 4.02

A likely reason for these results is that minimization of distortion should occur over the frequency band in which the spectral envelope shape has the largest standard deviation as a function of frequency. Preliminary investigations using wideband envelopes parameterized using 64 log envelope samples showed that the 0-4 kHz band (more specifically, the 0-2 kHz band) had the largest standard deviation in spectral shape.


4.1.4 Comparison of extension bandwidths

The majority of highband envelope estimation schemes discussed in chapter 3 produce highband or wideband spectral envelopes up to 8 kHz based upon narrowband speech information. It was anticipated that employing a highband upper frequency limit of less than 8 kHz would produce more accurate highband estimation. An experiment was conducted in which codebook mapping using codebooks of size 512 was performed on 96s of speech data (independent from the codebook training data), and the highband spectral distortion DHC was measured using (4-1). This was repeated using DHC calculated with upper frequency limits of 7.75 kHz, 7.5 kHz, . . . , 4.25 kHz. The results, illustrated in Fig. 4.5, suggest that the incremental spectral distortion increase obtained by increasing the upper limit of the highband bandwidth from 5.5 kHz to 8 kHz is quite small. This justifies the choice of a high frequency such as 8 kHz as an upper limit.


3.5

3

2.5

2

4

4.5

5

5.5

6

6.5

7

7.5

8

Wideband bandwidth (kHz)

Figure 4.5. Highband spectral distortion DHC measured up to various bandwidths for codebook mapping with codebook size 512 (averaged over 96 seconds of test speech). Log envelope samples were used to parameterize the envelopes, and codebook design was performed using Algorithm 3.4.

4.1.5 Comparison of different parameterizations for codebook mapping

An important consideration in exploiting correlation between the narrowband and wideband spectral envelopes is their parameterization. In order to establish the efficacy of various envelope parameterizations for highband envelope estimation, the performance of codebook mapping with these parameterizations was measured. Using a


mean square distance measure (see Appendix B) between parameter vectors to train codebooks and perform the codebook mapping for each parameterization, DHC was calculated using (4-1) over a 96s test speech segment. The results of this experiment are shown in Table 4.3.

Table 4.3 Spectral distortion DHC of codebook mapping using various parameterizations, with codebook size 512 Description of parameterization 10th order narrowband and 18th order wideband LP envelopes 10th order narrowband and 18th order LSFs 10th order narrowband and 18th order wideband reflection coefficients 10th order narrowband and 18th order wideband log area ratios 10th order narrowband and 18th order wideband cepstral coefficients 19th order narrowband and 24th order wideband Mel frequency cepstral coefficients Normalized narrowband and wideband autocorrelation sequences RN[n] and RW[n] respectively. Narrowband: RN[n], 1 d n d 10. Wideband: RW[n], 1 d n d 18 32 and 64 evenly spaced samples of the log magnitude responses of 10th order narrowband and 18th order wideband LP envelopes respectively

DHC (dB) 3.50 3.50 3.44 3.45 3.44 3.64 3.53 3.29

Although a large amount of correlation exists between narrowband and wideband autocorrelation sequences, autocorrelation coefficients did not perform particularly well under the spectral distortion criterion, as seen in Table 4.3. Further investigation revealed that the log spectral envelope was highly sensitive to very small changes in the autocorrelation coefficients.

Log spectral envelope samples (LES) were the most effective parameters under a spectral distortion criterion, which is not surprising since the LES mean square error distance measure is equivalent to spectral distortion. Note that if the storage requirements were to be minimized, LSFs [ITA75a] would probably be used to store the envelopes, due to their superior quantization properties. The use of other envelope parameterizations, such as formant frequency and amplitude, and pole locations, was also investigated, but they were found to be too sensitive to error to be suitable for codebook mapping.


4.1.6 A new method for codebook mapping with codebooks split by voicing

In order to maximize the likelihood of the correct highband envelope vector being selected, codebook mapping can be modified to include information other than the narrowband spectral envelope. One narrowband feature, which is not utilized in previous codebook mapping methods (c.f. section 3.1) and which has some bearing on the highband envelope shape, is the narrowband degree of voicing. The degree of voicing is reasonably well correlated with the tilt of the speech spectrum, to the extent that spectral tilt has been included as one feature determining the degree of voicing in many algorithms.

Given that voiced and unvoiced envelopes generally exhibit a different overall shape, it is proposed in this thesis that codebook pairs be split into voiced and unvoiced codebook pairs. Thus, if an input frame is determined to be unvoiced, there is a reduced likelihood of its being mapped to a highband envelope typical of a voiced frame. This scheme is illustrated in Fig. 4.6. Split codebooks, or classified VQ, have previously been employed in narrowband coding, producing increased efficiency over standard VQ [GER93, PAK93, DAS95]. voicing state

narrowband envelope

select most similar envelope

narrowband voiced codebook

highband voiced codebook

narrowband unvoiced codebook

highband unvoiced codebook

highband envelope

Figure 4.6. Block diagram of codebook mapping using codebooks split by voicing


4.1.7 A novel method codebook mapping with phonetically dependent codebooks

A generalization of section 4.1.6 is to perform codebook mapping separately for each phonetic class, thereby exploiting the narrowband-highband correlation of the spectral envelopes of that class. A major difficulty with this is the requirement for automatic phonetic classification, a complex problem for which solutions are often both inaccurate and computationally expensive.

With the aim of exploring the potential for phonetically classified codebook mapping, the spectral distortion characteristics of a number of different phonemes were measured. For each phoneme, speech data were extracted from the hand-labelled TIMIT database, used to design codebooks, and DHC was calculated over an independent test segment of the TIMIT database containing only that phoneme. Due to the small amount of available training data for each phoneme, codebooks of size 16 were used, and it should be noted that all results (c.f. Table 4.4) could be substantially improved using more training data and larger codebooks. As anticipated, the results vary considerably with each phoneme, however, the average spectral distortion over all phonemes of 3.02 dB suggests that a substantial enhancement to standard codebook mapping (DHC = 3.35 dB for codebook size 16) could be made if near-perfect phonetic classification were available.

Table 4.4. Phoneme-dependent spectral distortion results phoneme type stop stop affricate fricative fricative fricative fricative nasal nasal nasal semivowel/glide semivowel/glide semivowel/glide vowel vowel vowel vowel vowel

TIMIT symbol d k jh f s sh v m ng nx r w hh aa iy ay ow ah

example word day key joke fin sea she van mum sing winner ray way hay bottle beat bite boat but

DHC (dB) 2.74 3.49 2.74 2.47 4.58 3.06 2.60 3.10 3.55 3.09 2.15 2.54 3.21 2.39 3.57 3.06 3.07 2.93


4.1.8 A new method for codebook mapping on a sub-frame basis

Phonetic classification in speech recognition is achieved in part using analysis of the temporal evolution of the spectral envelope. Temporal information can also be exploited to improve highband spectral envelope estimation, and this has been attempted previously in [AVE95] (c.f. section 3.3.4). Investigations in this thesis into how temporal information could best be applied to a codebook mapping approach revealed that the use of two sub-frame spectral envelopes produced the best improvement to highband spectral envelope estimation using codebook mapping.

In this scheme, extended narrowband code vectors of length 64 are formed by concatenating pairs of 32 log envelope samples from consecutive narrowband subframes. Codebooks are then designed using these vectors, in place of single narrowband 32 sample envelope vectors. The mapping is a one-to-one mapping between the extended narrowband codebook and the highband codebook, similar to that described in section 3.1.4. The resulting scheme is illustrated in the block diagram of Fig. 4.7. extended narrowband codebook

highband codebook

select most similar envelope

concatenated narrowband sub-frame envelopes

highband envelope

Figure 4.7. Sub-frame codebook mapping

Results of an objective comparison of this technique with other highband envelope estimation methods are given in Table 4.5 of section 4.4.2, and show a small but significant improvement over the codebook methods of section 3.1.4. This suggests that including additional information about the temporal evolution of the spectral envelope, along with the narrowband envelope shape, can improve the accuracy of highband envelope estimation.


4.1.9

Novel methods for codebook mapping with interpolation

Previous research (c.f. section 3.1.7) has seen the application of interpolation to various other problems, and its application to highband envelope estimation by codebook mapping is examined in this thesis. In designing a codebook mapping with nearest neighbour averaging scheme (c.f. section 3.1.7), the parameter of interest is K, the number of nearest neighbours. In order to determine the optimum value of K, the spectral distortion performance of codebook mapping using log envelope samples and codebook size 512 was tested for different values of K, and the results are graphed in Fig. 4.8. A substantial improvement is made by introducing interpolation (K > 1) to codebook mapping (K = 1), with the optimum parameter choice occurring at K = 6. Note that K = 2 also gives a near optimum parameter choice with fewer distance calculations required.


3.3 3.29 3.28 3.27 3.26 3.25 3.24 3.23 3.22 3.21 0

1

2

3

4

5

6

7

8

Number of nearest neighbours

9

10

11

Figure 4.8. Highband spectral distortion DHC vs. number of nearest neighbours K for codebook mapping with interpolation, codebook size 512.

Fuzzy codebook mapping requires the choice of a further parameter, the fuzziness m. Taking the value of K = 6 determined from above, the fuzziness m was varied and the highband spectral distortion of fuzzy codebook mapping using log envelope samples and codebook size 512 was measured. The results are given in Fig. 4.9, where the optimum parameter choice occurs at m = 1.4, although the variation in highband spectral distortion for any choice of m > 1.1 is very small. Interestingly, the optimum values K = 6 and m = 1.4 only differed very slightly from the values K = 6 and m = 1.6 reported in [NAK89], where the application was different.



3.228 3.226 3.224 3.222 3.22 3.218 3.216 3.214

1.1 1.3 1.5

2

4

Fuzziness

8

16

Figure 4.9. Highband spectral distortion DHC vs. fuzziness m for fuzzy codebook mapping, codebook size 512, K = 6.

Either interpolation method could be used in conjunction with any other codebook mapping methods to produce an improvement in spectral distortion, although in the case of fuzzy codebook mapping (seen in Fig. 4.9), this improvement is very slight.

4.1.10 A new method for codebook mapping with reduced storage requirements

One shortcoming of codebook mapping, particularly when compared with other techniques such as linear mapping, is that it requires a considerable amount of memory to store the narrowband and highband code vectors. Since the narrowband codebook is designed using a narrowband spectral distortion criterion, reducing its size produces a coarser classification and an increase in the spectral distortion of the highband envelope estimate, assuming sufficient training data have been utilized during codebook design. The highband codebook, on the other hand, is not optimized according to a highband spectral distortion criterion, and thus it is possible that some code vectors within the highband codebook may be quite similar to each other.

These shortcomings suggest a new and more memory efficient variant of codebook mapping, between a narrowband codebook of size N and a highband codebook of size L < N which has been optimized to some extent using a highband spectral distortion criterion. This scheme is illustrated in Fig. 4.10.


narrowband codebook 5 select most similar envelope narrowband envelope

highband codebook

2 3 4 2

highband envelope

1 5 1

Figure 4.10. Block diagram of reduced storage codebook mapping. The integers stored alongside the narrowband code vectors are indices to the highband codebook.

An important design question for reduced storage codebook mapping is how to construct the highband codebook and the mapping from the narrowband to highband code vectors. One method proposed in this thesis, based upon pairwise nearest neighbour codebook design, is given in Algorithm 4.2. The objective of this algorithm is to compress the highband codebook from size N to size L < N by combining vectors which contain very similar envelope shapes.

1. Form a pair of codebooks CL and CH of size N from the highband training data using Carl’s method (Algorithm 3.4). 2. Form a new highband codebook CRH of size L < N from CH using the pairwise nearest neighbour algorithm (Algorithm 3.1). 3. Quantize CH using CRH, and associate the indices i1, . . . , iN resulting from this quantization with the N narrowband code vectors. Algorithm 4.2. Reduced storage codebook design

To obtain an idea of how far the highband codebook can be reduced in size before adverse highband spectral distortion occurs, a codebook pair of size 512 was designed using Carl’s method (Algorithm 3.4) and then reduced to various sizes using Algorithm 4.2. The resulting pairs of codebooks (CL, CRH) were used to estimate highband envelopes for a 96s segment of speech independent from the training data, and the highband spectral distortion DHC (c.f. equation (4-1)) was calculated. The results are illustrated in Fig. 4.11, from which it can be seen that the use of the novel Algorithm 4.2


allows a reduction in highband codebook size by a factor of three with virtually no associated change in codebook mapping performance. This represents a considerable decrease in the memory required to implement codebook mapping.


3.37 3.36 3.35 3.34 3.33 3.32 3.31 3.3 3.29 3.28

64

91

128

181

256

362

512

Highband codebook size

Figure 4.11. Highband spectral distortion DHC for a narrowband codebook size of 512 and various highband codebook sizes.

4.2 Statistical Recovery - alternative parameterizations 4.2.1

Line spectral frequencies

It was hypothesized that, in order to optimize the highband spectral distortion resulting from envelope estimation using the SR function, the probability density functions (equations (D-1a) and (D-1b) in Appendix D) should represent envelope parameterizations other than signal autocorrelation vectors (as in [CHE94]). Alternative envelope parameterizations may provide the advantage of distributions more closely resembling the Gaussian distributions assumed to underly SR. A second possible advantage would be the choice of parameterizations more closely correlated with the log spectral envelope, and hence more closely related to models of human perception.

Since LSFs have a roughly Gaussian distribution (see for example [SOO93]), an LSF parameterization was investigated with LP analysis orders (vector dimensions) pN = 10 and pW = 18. Since the Itakura-Saito distance (see (B-4) in Appendix B) cannot be


directly applied to LSFs, equations (D-1a) and (D-1b) were reformulated with a Mahalanobis distance (see (B-3) in Appendix B) in place of an Itakura-Saito distance

ª pN (x (k ) i (k )) 2 º p (x t | i ) exp « ¦ t », 2V x2t ( k ) ¬« k 1 ¼»

(4-3a)

ª pW (y t (k ) j (k )) 2 º p (y t | j ) exp « ¦ », 2V y2t ( k ) «¬ k 1 »¼

(4-3b)

where V x2t (k ) and V y2t (k ) are the variances of the k’th elements of narrowband and wideband LSF training vectors xt and yt respectively over all 1 d t d T. 4.2.2

Cepstral coefficients

Since for cepstral coefficients, a mixture Gaussian density is often employed in applications such as speech recognition [JUA96], a cepstral parameterization was also investigated. Equations (D-1a) and (D-1b) were once again replaced by equations (4-3a) and (4-3b) but in this instance V x2t (k ) and V y2t (k ) denote the variances of the k’th elements of narrowband and wideband cepstral coefficient training vectors xt and yt respectively over all 1 d t d T. The same LP analysis orders, pN = 10 and pW = 18, were employed in this experiment.

Objective comparisons between the performance of SR with these modifications and the original implementation can be seen in Table 4.5 in section 4.4.2.

4.3 Linear methods - new methods and improvements 4.3.1 A new method for linear mapping using narrowband voicing and gain

In section 4.1.6, voicing information was applied to codebook mapping to produce more reliable highband spectral envelope estimates. Voicing can easily be incorporated into linear mapping simply by including the narrowband degree of voicing Q in an


augmented input vector xc [x |Q ] , and calculating the output (highband envelope) vector as yˆ

Acxc in place of equation (3-23). A c is calculated according to (3-25),

with a matrix Xc , comprising vectors of the form of xc , in place of X. Similarly, the narrowband gain can be incorporated into the mapping to provide additional information about the narrowband frame for highband envelope estimation. While the gain does not have a direct relationship to the envelope shape, to some extent different gains can be correlated with different phonemes and silence. The improvement in highband spectral distortion resulting from the inclusion of the narrowband degree of voicing and the narrowband gain in linear mapping is shown in Table 4.5 of section 4.4.2.

4.3.2 Improved straight line highband envelope estimation

In a simulation of the straight line envelope extrapolation technique of [YAS96a] over a 96 s speaker-independent test speech segment, a large spectral distortion of DHC = 6.40 dB was calculated. It was hypothesized that a better estimate of the average highband spectral slope PH could be obtained using a linear combination of average narrowband spectral slope PN and the narrowband degree of voicing Q

PH

c1Q c2 P N c3 ,

(4-4)

where c1, c2, c3 are constants estimated from a large database using multivariate linear regression. The spectral distortion DHC resulting from this technique was substantially smaller than that of straight line extension, as seen in Table 4.5.

4.4 Objective assessment 4.4.1 Speech data

Around half an hour (83418 frames of 20 ms duration) of 8 kHz bandlimited speech uttered by many speakers, male and female, was collated from TIMIT. This was low


pass filtered to a cutoff frequency of 3.8 kHz and downsampled to form the narrowband database. For each synchronized 20 ms frame of narrowband and wideband speech, 10th and 18th order LP envelopes were calculated, and from these, 32-sample and 64-sample log envelope parameterizations were derived to form the narrowband and wideband training databases respectively. These training databases were then used to design the codebooks, matrices or other parameters required for each method.

A test speech database of 96 s length (4800 frames) was also formed from TIMIT, using several speakers not represented in the training database.

4.4.2 Comparison of highband envelope estimation methods

The results of the application of the highband spectral distortion criterion (4-1) to the methods described in chapters 3 and 4 are summarized in Table 4.5.

Table 4.5. Comparison of highband envelope estimation methods Description of method Codebook mapping, LSFs, codebook size 256 [CAR94a,b] Codebook mapping, log envelope samples (LES), codebook size 512 Codebook mapping with codebooks split by voicing, LES, total codebook size 256 Sub-frame Codebook mapping, LES, codebook size 1024 Codebook mapping with nearest-neighbour Euclidean interpolation, LES, codebook size 512, K = 6 nearest neighbours Codebook mapping with fuzzy interpolation, LES, codebook size 512, K = 6 nearest neighbours, m = 1.4 Statistical recovery, N = 256, M = 256 Statistical recovery using LSFs, N = 256, M = 256 Statistical recovery using cepstral coefficients, N = 256, M = 256 Fixed shaping filter (average of many highband envelopes) Flat highband envelope Straight line extension Improved straight line envelope Linear mapping Linear mapping with narrowband voicing Linear mapping with narrowband gain Piecewise linear mapping, L = 8 partitions Multidimensional filter banks * New methods developed in this thesis

Section 3.1.4 3.1.4

DHC (dB) 3.49 3.29

4.1.6

3.25

*

4.1.8 4.1.9

3.25 3.22

* *

4.1.9

3.22

*

3.2 4.2.1 4.2.2

3.47 3.49 3.40

* *

3.3.7 3.3.6 3.3.5 4.3.2 3.3.1 4.3.1 4.3.1 3.3.2 3.3.4

3.49 4.15 7.08 3.66 3.72 3.47 3.43 3.57 5.90

* * *


In all cases, the parameters chosen for each method were optimal for that method, and were determined by calculating DHC on the same test data over a range of parameter values. Voicing estimation for all methods employing narrowband voicing was achieved using the technique described in [ATA76].

Codebook mapping generally produced the smallest highband spectral distortion in this comparison. In all instances codebook mapping was improved by the inclusion of new narrowband information such as narrowband voicing, temporal evolution of the narrowband envelope or interpolation between several mappings. The most effective method in terms of computation and storage is probably the fixed shaping filter, based on an average of many highband envelopes, which produces a spectral distortion within 0.27 dB of the minimum obtained. Linear mapping with narrowband gain information also performs well considering the small memory and computational requirements.

While most new methods outperformed the existing methods from which they were derived, some of these improvements are in the order of 0.1 dB (3%). It is important to note that these small improvements produce larger improvements under a spectral distortion criterion which does not remove the average highband log spectral difference (5-1), in contrast to that of (4-1). Thus an improvement of 0.1 dB in ‘shape only’ highband spectral distortion DHC may produce an improvement of 0.8 dB or more in ‘gain-shape’ highband spectral distortion DH (5-1). It was observed that smaller ‘shape only’ distortion was almost always matched by smaller ‘gain-shape’ distortion, although their relationship was not proportional in general. Some results for ‘gain-shape’ distortion can be found in Table 5.1 of section 5.3.1.

4.4.3 Limits to highband envelope estimation

This chapter has examined many methods for highband envelope estimation designed to produce small highband spectral distortion, and an important remaining question is the likely minimum spectral distortion which can be obtained under present constraints. One means of estimating these limits is to examine the extent to which two similar narrowband envelopes have similar corresponding highband envelopes.


Suppose two narrowband-highband spectral envelope pairs are selected at random from independent speech databases. Their narrowband spectral distortion is calculated and their highband spectral distortion is also computed according to (4-1). If this procedure is then repeated for all combinations of codebook and database envelope pairs, then the resulting data can be used to gain an idea of the distribution of highband spectral distortion results which could be expected from any codebook mapping scheme with a given maximum narrowband spectral distortion.

The 10%, 50% and 90% contours of the highband spectral distortion distribution are plotted against the maximum tolerated narrowband spectral distortion in Fig. 4.12.


7

6

5

4

3

2

1

0

1

2

3

4

5

6

Tolerated narrowband spectral distortion (dB)

Figure 4.12. Distribution of highband spectral distortion DHC as a function of tolerated (maximum) narrowband spectral distortion. Results are based on around 5u106 data points. The percentile contours shown are the 10% (), 50% or median (| DQG'). Note that there are relatively few data points at 1 dB tolerated narrowband distortion.

The predominantly flat characteristic of Fig. 4.12 shows that highband spectral distortion is weakly correlated with narrowband spectral distortion except for small narrowband spectral distortions. A similar study [NIL00] also found that the correlation between narrowband and highband envelope shape was very limited, with a lower bound of 0.1 bits of mutual information. Good performance (median highband spectral distortion 2 dB) with codebook mapping schemes is therefore possible only if the narrowband spectral distortion can be contained to around 1 dB. This would require


codebooks consisting of roughly 230 vectors, a size which is barely feasible under present storage constraints.

Figure 4.12 also suggests that the better methods from Table 4.5 are able to perform at least as well as the median expected highband spectral distortion. For example, the novel highband envelope estimation technique of codebook mapping with codebooks split by voicing achieves a highband spectral distortion DHC of 3.25 dB for a codebook size of 256. The narrowband codebook of size 256 used in this technique produces a mean narrowband spectral distortion of 2.6 dB, and a maximum narrowband spectral distortion of about 5.5 dB, which yields an expected median highband spectral distortion DHC of approximately 3.8 dB.

4.5 Conclusion In this chapter, new highband envelope estimation methods which exploit narrowband information to a greater extent were developed. The importance of using very large training databases for highband envelope estimation was established, and an estimate for the training ratio was made. The benefits of maximizing the use of available narrowband information such as voicing, sub-frame information and gain in the estimation of the highband envelope were also shown. An extensive spectral distortion comparison was made between many highband envelope estimation methods, with codebook mapping-based methods proving more accurate than other techniques. Some much simpler techniques, such as the novel linear mapping with narrowband gain, achieved a highband spectral distortion DHC within 0.2 to 0.3 dB of the best methods. Limits to the performance of future codebook mapping schemes were estimated, showing that considerable computation and memory resources must be available before codebook mapping can achieve small highband spectral distortion. In any case, achieving less than 1 dB highband spectral distortion DHC will not be possible, and even achieving less than 2 dB will require considerably advanced computational and memory resources compared with those currently available.

Ch 5 / Narrowband-Highband Spectral Envelope Continuity 88

Chapter 5. Narrowband-Highband Spectral Envelope Continuity



narrowbandhighband envelope continuity

upsample 2

LPF 3.8 kHz




wideband speech

narrowband speech

5.0 Overview Regardless of how the highband spectral envelope shape is estimated, an important problem is the matching of the estimated highband envelope to the narrowband envelope in order to minimize error in the highband gain. Existing methods for wideband enhancement of narrowband speech estimate a highband gain, which is applied to the highband spectral envelope after it has been estimated. Methods of this kind only produce acceptable subjective quality in the output speech if both highband spectral envelope shape and highband gain estimation can be performed accurately.

An important perceptual consideration is the matching of the estimated highband envelope to the narrowband envelope so as to ensure that, in the resulting wideband envelope, there is no discontinuity at the junction between the narrowband and estimated highband envelopes. This can be applied as a constraint to highband gain estimation. Alternatively, gain estimation using existing methods can be used in combination with smoothing of the resulting wideband envelope to reduce the perceptual impact of discontinuities.

This chapter reviews existing methods for highband gain estimation, devises new techniques for preserving continuity between the narrowband and estimated highband envelopes, and compares these approaches using an appropriate objective assessment criterion.


Methods outlined in chapters 3 to 5 allow the estimation of a highband spectral envelope and its matching to the narrowband envelope. The aim of section 5.4 is to assess the overall envelope estimation and matching process objectively and subjectively. The performance of the overall wideband enhancement scheme is also subjectively assessed.

The structure of this chapter is as follows. Section 5.1 reviews existing methods for highband gain estimation. Section 5.2 proposes new techniques for narrowbandhighband envelope matching, and in section 5.3 existing and novel methods are objectively evaluated. In section 5.4, the performance of the combined highband envelope estimation and narrowband-highband envelope matching scheme is assessed. Conclusions are presented in section 5.5.

5.0.1 Highband spectral distortion criterion

Instead of using the ‘shape only’ highband spectral distortion measure DHC from section 4.0.1, a new measure is introduced which allows for the highband gain as well as the envelope shape. This ‘gain-shape’ highband spectral distortion DH is used extensively throughout chapter 5, and is calculated over K frames as

DH

1 K

GN

2 KZ s

2

Zs 2 ª § Ak (Z ) ·º 2 ¨ ¸ » dZ , 20 log G « ¦ 10 ¨ N ~ ³ « Ak (Z ) ¸¹¼» k 1 (1 K )Z s KZ s 2 ¬ © K

(5-1)

where

KZ s 2

³ 0

~ Ak (Z ) Ak (Z )

dZ ,

(5-2)

~ and Ak(Z) and Ak (Z ) are the true and estimated envelopes, respectively, of the kth (temporally aligned) frames of wideband speech, Zs is the wideband sampling frequency and K is the proportion of the wideband bandwidth occupied by the narrowband bandwidth. In this instance K = 0.475, corresponding to a 3.8 kHz


narrowband bandwidth. The compensating gain factor GN has the effect of removing the mean difference between the two log envelopes in the narrowband region only. DH thus measures the highband spectral distortion after gain estimation has occurred, as illustrated in Fig. 5.1. DH should be contrasted with the ‘shape only’ highband spectral distortion DHC defined in (4-1) and illustrated in Fig. 4.1 (see section 4.0.1). spectral distortion

DH

0

KZ s 2

Zs 2

Figure 5.1. Highband spectral distortion measure DH.

The ‘gain-shape’ spectral distortion DH is the most suitable criterion for comparing different envelope matching techniques since it measures the closeness of the estimated highband envelope to the true wideband envelope after the gain of the estimated highband envelope has been determined. DH also provides a more realistic assessment of the complete highband envelope estimation process (i.e. including envelope matching) than the ‘shape only’ spectral distortion DHC because it employs a narrowband gain correction factor GN (5-2) which does not require access to the true highband envelope, unlike the highband gain correction factor GH (4-2) used in the calculation of DHC. Note that in cases where the parameterization of the estimated highband spectral ~ envelope Ak (Z ) does not allow for its being defined in the low band (0 to KZ s / 2 ), a wideband spectral envelope is formed by combining the estimated highband envelope with the input narrowband envelope after first determining the appropriate matching gain between the two envelopes.


5.1 Existing methods for highband gain estimation The objective of highband gain estimation is the calculation of a gain, gM, which determines the gain of the highband signal gH relative to that of the narrowband signal gN

gH

gM gN

(5-3)

5.1.1 Range matching

Although there are several existing techniques for matching the highband to the narrowband envelope, none of these explicitly preserves continuity in the resulting spectral envelope. Of these, however, range matching comes close to preserving continuity.

If there is a frequency domain overlap between the input narrowband envelope and the estimated highband envelope, then one approach to highband gain estimation [CAR94a,b] is to match them over some part of that overlapping range, yielding a gain

g

2 M

1 ZH ZL

ZH

³

ZL

W (Z )

AN e jZTN

2

AH e jZTH

2

2

dZ ,

(5-4)

or in discrete form

g M (dB)

ZH W (Z i ) AN e jZiTN 1 ¦ 20 log10 A e jZiTH Z H Z L Zi Z L H

,

(5-5)

where AN and AH are the narrowband and highband envelope magnitude responses, and TN and TH are the narrowband and highband sampling periods respectively. ZL and ZH are the lower and upper frequency limits of the matching range, and are set equivalent to 0 and 4 kHz respectively in [CAR94a,b]. W (Z ) is a frequency-dependent weighting


function which is chosen to be unity inside and zero outside the range of the narrowband speech signal in [CAR94a,b]. This gain matching calculation is equivalent to finding the mean difference in log spectra between the narrowband and highband envelopes in the overlapping spectral region, as shown in Fig. 5.2. mapped highband envelope

gM (dB)

ZL

input narrowband envelope

ZH

ZS/2

Figure 5.2. Range matching of a narrowband and a (0-8 kHz) highband log envelope to produce a matching gain gM (equal in magnitude to the shaded area).

5.1.2 Range matching with correction factor

Range matching tends to produce large highband spectral distortion in cases where, for example, the highband envelope estimation stage has incorrectly produced an unvoiced highband envelope for a voiced frame. In this example (see Fig. 5.3), range matching overestimates the highband gain due to the narrowband matching criterion.

ZL

ZH

ZS/2

Figure 5.3. Example of range matching overestimating the highband gain where the highband envelope estimation stage has produced an incorrect estimate (dashed) of the true highband envelope (solid).

A gain correction factor gM2 was proposed by Carl [CAR94a] as part of the codebook mapping scheme, with the objective of compensating for inaccuracies in the gain gM1 estimated during the initial range matching. The value of the correction factor is conditioned by the choice of highband envelope made during the codebook mapping process, as seen in Fig. 5.4.


narrowband codebook

highband codebook

7.65


gain codebook

5.92 4.97 6.55 . . .

. . .

. . . 6.12 3.43

correction factor gM2

range matching

gM1

gM

Figure 5.4. Block diagram of range matching with correction factor gM2.

The gain codebook is calculated by applying codebook mapping and range matching to each envelope pair in the training data, and computing the mean highband difference between the estimated highband log envelope and the true highband log envelope. The jth gain vector is then calculated as the average of all such differences resulting from a codebook mapping using the jth narrowband-highband code vector pair. The overall matching gain gM is the product of the range matching gain gM1 and the linear gain factor (converted from a log gain) gM2. 5.1.3

Statistical recovery

Statistical recovery (c.f. section 3.2) is a maximum likelihood method for estimating both the wideband envelope shape and gain based upon the narrowband envelope shape and gain [CHE94]. The gain from narrowband to wideband is calculated in terms of the probability of each wideband random source Tj

¦

H (y t ) p (T j | x t , y t ) H (x t )

T

ET j

t 1

¦

T t 1

(5-6)

p (T j | x t , y t )

where xt and yt are the t’th narrowband-wideband pair of T training vector pairs, and

H (y t )

H (x t )

are their respective energies, and the distributions of the sources Tj are

iteratively estimated using the EM algorithm, as described in [CHE94]. Equation (5-6)


is computed as part of Algorithm D.1 (see Appendix D), immediately after equation (D4b) in step 4. More recent work [ENB98] has shown that this gain estimate can be improved by including dependency on the narrowband sources Oi. 5.1.4 Straight line extrapolation

Perhaps the simplest method of estimating a highband gain is by extrapolation of the narrowband envelope. One means of envelope extrapolation is to determine the average gradient of the narrowband log envelope, form a straight line with this gradient, and extend the straight line into the highband (c.f. Fig. 3.14) [YAS96a]. The straight line is matched to the narrowband envelope using range matching, and its highband component is used to jointly estimate the highband envelope (c.f. section 3.3.5) and gain.

5.2 New methods for narrowband-highband envelope matching An important property of any spectral envelope is that it is continuous across its entire frequency range. This property arises from the fact that the magnitude response of the vocal tract filter cannot contain discontinuities. The artefact introduced by a discontinuity of more than a few dB in a synthesized spectral envelope can be perceived by a listener (see for example [MOO89]).

The adverse perceptual effects of poor highband gain estimates can be reduced, either by smoothing discontinuities in the estimated wideband spectral envelope, or by making narrowband-highband continuity a constraint in the gain estimation process. This section addresses smoothing of discontinuities by linear interpolation and splicing in sections 5.2.1 and 5.2.2, and continuity-constrained highband gain estimation in sections 5.2.3 to 5.2.6.


5.2.1 Highband gain estimation with linear interpolation

Regardless of which highband gain estimation technique of section 5.1 is employed, the narrowband and highband envelopes will not, in general, match exactly at the junction between narrowband and highband. Therefore, to avoid perceptual artefacts resulting from this mismatch, the log narrowband and highband envelopes can be linearly interpolated in the region surrounding the narrowband-highband junction to produce a final wideband envelope.

Linear interpolation of the log envelope samples between the narrowband envelope magnitude at ZNH (e.g. 3.8 kHz) and the highband envelope magnitude at ZHL (e.g. 4.2 kHz) is used in place of the original samples in this range, as illustrated in Fig. 5.5. Once an interpolated wideband log envelope has been determined, a spectral estimation algorithm with a frequency domain minimization criterion, such as spectral linear prediction [MAK75b] or log amplitude modelling [MAL99], is used to estimate a wideband all-pole envelope.

Interpolation could also be achieved simply by fitting an all-pole model directly to the discontinuous wideband spectral envelope samples, although this may cause a spurious peak to occur as a result of the discontinuity.

0

Zs 2

KZs

ZNH

ZHL

Figure 5.5. The (solid) wideband envelope is the result of linearly interpolating between the (dashed) narrowband and highband envelopes. Note that an allpole envelope would be fitted to the final (solid) envelope.

5.2.2 Highband gain estimation with splicing

If there is an overlap between the narrowband and highband spectral envelopes, splicing is a more elegant means of attaining continuity. As seen in Fig. 5.6, splicing between


the narrowband envelope AN(Z) and the highband envelope AH(Z) to determine a wideband envelope AW(Z) is achieved as follows

AW (Z )

AN (Z ), Z d Z NH °° Z Z Z Z NH HL AN (Z ) AH (Z ), Z NH Z d Z HL . (5-7) ® Z Z Z Z HL NH HL NH ° °¯ AH (Z ), Z ! Z HL

0

KZs

Zs 2

2

ZNH

ZHL

Figure 5.6. The (solid) wideband envelope is the result of splicing the dashed portions of the narrowband and highband envelopes. Note that an all-pole envelope would be fitted to the spliced (solid) envelope.

5.2.3

Point matching

Wherever there is an overlap in the spectral range of the narrowband and highband envelopes, the gain may be estimated at a single chosen frequency Z M as

gM

e

AN e jZ M TN AH

jZ M TH

,

(5-8)

where AN and AH are the narrowband and highband spectral envelopes respectively. This technique guarantees continuity at the transition frequency Z M between the narrowband and highband envelopes and requires very little computation. The resulting highband spectral distortion is sensitive to any errors in estimated highband envelope shape which are local to Z M , and care must be taken not to choose Z M too close to any bandlimiting effects in the narrowband envelope. An all-pole narrowband spectral envelope is a good choice in this instance, since its magnitude is often close to that of the wideband speech at the upper end of the narrowband.


One possibility for reducing the sensitivity to errors in estimated highband envelope shape is to apply range matching to a small frequency band containing Z M . Over any range matching interval, the envelopes being matched must intersect at least once after matching has occurred. Continuity between the envelopes could then be achieved by defining the junction between the narrowband and highband envelopes as, for instance, the highest frequency intersection (within the matching interval) between the two envelopes after matching has occurred. One novel method proposed in this thesis for determining a weighting function W (Z i ) for range matching (c.f. equation (5-5) of section 5.1.1) makes use of point matching. Point matching is applied using a matching frequency Z i to a speech database, and the resulting highband spectral distortion DH is calculated using (5-1). This procedure is then repeated for a range of different matching frequencies Z i , and the dependency of the highband spectral distortion upon the choice of Z i is determined (see for example Fig. 5.7a). A set of frequency-dependent weights W (Z i ) is then derived which gives more emphasis to frequencies Z i which produce small highband spectral distortion, for example

W (Z i )

DH (Z i ) 1 . ¦ DH (Z i ) 1

(5-9)

i

A set of example weights, derived using codebook mapping on 83418 frames of training speech and computing equation (5-5) for a range of 32 frequencies Z i evenly spaced across the narrowband, is illustrated in Fig. 5.7b. These weights were used in place of 20 log10 W (Z i ) in (5-5) to estimate a weighted range matching gain, and the spectral distortion results are given in Table 5.1 of section 5.3.1.

Highband spec. dist. (dB)


9 8.5 8 7.5 7 6.5 6

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

Frequency (Hz) (a)

Weights

0.04

0.035

0.03

0.025

Frequency (Hz) (b)

Figure 5.7. Example of (a) the dependency of highband spectral distortion DH upon point matching frequency Z i and (b) the resulting weights W (Z i ) .

5.2.4

Reference envelopes

Rather than determine the highband spectral envelope and match it to the narrowband envelope, an alternative is to extrapolate the narrowband envelope in some fashion and estimate the highband envelope in terms of its difference from the extrapolated, or ‘reference’ envelope. This concept is illustrated in Fig. 5.8, using codebook mapping to estimate the highband difference envelope. The reference envelope could be determined by a number of methods, and two are proposed in sections 5.2.5 and 5.2.6. narrowband codebook

highband codebook

highband difference envelopes

select most similar envelope highband difference envelope

narrowband envelope calculate wideband reference envelope

output wideband envelope

Figure 5.8. Block diagram of gain estimation using reference envelopes


5.2.5

Reference envelopes: Voicing-controlled straight log envelope

In this approach the log reference envelope is a straight line whose gradient PH is linearly related to the estimated narrowband degree of voicing Q PH

DQ ,

(5-10)

similarly to (4-4), where D is a constant. In this case, the straight line is ‘anchored’ to the narrowband envelope by point matching at Z M

KZ s , as seen in Fig. 5.9. This 2

anchoring is also applied as a constraint during the least squares determination of D from a large database of narrowband degree of voicing estimates Q and highband spectral gradients PH.

0

KZs 2

Zs 2

Figure 5.9. Reference envelope (solid line) calculated using voicing controlled straight line extension point matched to the narrowband envelope (dashed line).

Objective assessment results for reference envelopes using a voicing-controlled straight log envelope can be found in section 5.3.1.

5.2.6

Reference envelopes: Analog envelope extension

In recognition of the fact that the estimated envelope derives from a physical model of the vocal tract, which is in reality an analog system, the analog pole positions can be estimated. Their contributions to the highband envelope can then be determined, giving an alternative, physically based, reference envelope as seen in Fig. 5.10.


0

KZ s

Zs

2

2

Figure 5.10. Reference envelope (solid line) calculated using analog envelope extension, whereby the reference envelope is estimated based on the (dashed) narrowband envelope. The reference envelope is then range matched to the narrowband envelope.

A gradient-based least squares method using log amplitude criteria was used to estimate the analog poles for analog envelope extension. The objective of this method is to fit the magnitude response of an analog all-pole model G(s) to that of a given discrete all-pole model Hd(z) at a series of digital frequencies Zi. Although this fit occurs only in the narrowband, the resulting magnitude response of G(s) over the entire wideband is then used as a reference envelope. Since the envelope is to be represented in dB, this fit is best achieved in the log domain. Thus if d i :i

log H d (e jZi ) and N i

log G ( j: i ) where

Z i Fs and Fs = 1/T is the sampling frequency in Hz, the objective is to minimize

I

¦ (N

i

di )2 .

(5-11)

i

If G(s) has p/2 conjugate pole pairs,

G (s)

K

p 2

k

( s 2 a k s bk2 ) 1

,

(5-12)

where K, ak, bk . Note that the bk are squared in order to force the constant coefficient to be greater than zero, thereby avoiding spurious estimates of G ( j: i ) . The j: i can be expressed as

log magnitude of G(s) at the frequency s

p

Ni

log G ( j: i )

>

@

1 2 log K ¦ log (bk2 : i2 ) 2 : i2 ak2 . 2k1

(5-13)


The estimation of K, ak, bk is achieved by forming a vector c [ K a1 . . . a p b1 . . . b p ]c , 2

(5-14)

2

where c represents vector transpose, and iteratively applying a gradient algorithm update step

wI , wc l

c l 1

cl P

(5-15)

where cl represents the value of c at the l’th iteration, P is a convergence constant wI wI at the l’th iteration. Now represents the value of wc wc l

(or parameter), and

wI wck

2¦ ( N i d i ) i

wN i , wck

(5-16)

so that the calculation of

wN i wK

1 , K

wN i wak

: i2 ak , (bk2 : i2 ) 2 : i2 ak2

(5-17b)

wN i wbk

2(bk2 : i2 )bk , (bk2 : i2 ) 2 : i2 ak2

(5-17c)

(5-17a)

and Ni at each iteration l allows cl to be updated to cl+1 for a given number of iterations or until a predetermined convergence criterion is met.


Initialization of this algorithm requires a good first estimate of the analog pole positions sk

V k j: k . One such estimate can be derived by estimating the digital poles

zk

rk e jT k from the given magnitudes H d (e jZi ) (for example using spectral linear

prediction [MAK75b]) and transforming them into analog poles using z k

e sk T ,

yielding

sk

1 log z k T

Fs log rk jFsT k

(5-18)

While the magnitude spectrum arising from these poles drops away fairly rapidly at higher frequencies in comparison to the discrete model, they provide a reasonable initial value for c. No drop in the magnitude spectrum at higher frequencies (i.e. towards the Nyquist frequency) occurs in the discrete model due to the effects of poles above the Nyquist frequency which are the conjugates of poles below the Nyquist frequency.

The convergence of this algorithm is slow, taking tens of iterations to reach a locally optimum fit, however, this can be improved using more sophisticated descent methods. This algorithm provides a good fit in the frequency range spanned by Zi (i.e. 0 - 4 kHz), typically resulting in narrowband spectral distortions between 1.3 and 1.6 dB. This means that range matching works very well, as implied by Fig. 5.10. In summary, this analog envelope extension algorithm provides a new method for estimating the highband behaviour of the analog system which gives rise to the speech system. The extended analog envelope can then be used as a reference for codebook mapping, as seen in Fig. 5.8 of section 5.2.4. Results of objective tests performed on the analog envelope extension reference envelope technique follow in section 5.3.1.

5.3 Objective assessment of gain estimation 5.3.1 Comparison of gain estimation techniques

For this work, around half an hour (83418 frames of 20ms duration) of 8 kHz bandlimited speech uttered by many speakers, male and female, was collated from


TIMIT as per section 4.4.1. A test speech database of 96s length (4800 frames) was also formed from TIMIT as described in section 4.4.1. The results of the application of the highband spectral distortion criterion (5-1) to the methods described in sections 5.2 and 5.3 are summarized in Table 5.1. All methods, except statistical recovery and straight line extrapolation, employed codebooks of size 1024. Voicing estimation for the voicing controlled highband straight line method was achieved using the method described in [ATA76].

Table 5.1. Highband spectral distortion comparison of narrowband-highband envelope matching methods Description Range matching Range matching with correction factor Statistical recovery, N = 256, M = 256 Straight line extrapolation Point matching Weighted range matching Reference envelopes: Voicing controlled straight line Reference envelopes: Analog envelope extension * New methods proposed in this thesis

Section 5.1.1 5.1.2 5.1.3 5.1.4 5.2.3 5.2.3 5.2.5 5.2.6

DH (dB) 6.66 6.45 11.34 15.83 6.76 6.40 6.37 9.90

* * * *

These results show that at present there is no method which can achieve a small spectral distortion when the highband gain is taken into account. Several existing and new methods can achieve an average spectral distortion of around 6.5 dB, suggesting that this may be close to the best performance likely to be achieved without larger codebooks. It is interesting to note that in this comparison, the computationally efficient voicing controlled straight line reference envelope method produced the smallest spectral distortion. This suggests that the majority of the narrowband information which can be exploited to estimate the highband gain is related to the voicing. Research reported in [NIL00] shows that there is a lower bound of 0.45 bits of mutual information between the narrowband spectral envelope shape and the highband gain. Since spectral envelope shape is often used as a feature in the voicing estimation for the voicing controlled highband straight line, these results tend to concur.

Whilst the introduction of linear interpolation or splicing to highband gain estimation schemes was found to have very little effect on DH when applied to gain estimation


techniques, informal listening tests revealed a slight improvement in the perceptual quality of the output speech.

The overall best result was obtained by combining sub-frame codebook mapping of log envelope samples (codebook size 1024) with range matching, which produced a ‘gainshape’ highband spectral distortion DH of 6.17 dB.

5.4 Assessment of highband envelope and gain estimation 5.4.1 Objective assessment of limits to highband envelope and gain estimation

In order to put the large values of spectral distortion obtained in section 5.3.1 in context, it is useful to estimate the expected best possible performance from a realistic highband envelope estimation scheme that employs gain matching for narrowband-highband envelope continuity.

Suppose two narrowband-highband spectral envelope pairs are selected at random from independent speech databases. The highband envelope from one pair is then matched to the narrowband envelope from the other using range matching. Their narrowband spectral distortion is calculated and their highband spectral distortion DH is also computed according to (5-1). If this procedure is then repeated for all combinations of codebook and database envelope pairs, then the resulting data can be used to gain an idea of the distribution of the ‘gain-shape’ highband spectral distortion results which could be expected from any codebook mapping scheme with a given maximum narrowband spectral distortion.

The 10%, 50% and 90% contours of the highband spectral distortion distribution from a large number of narrowband-highband envelope pairs are plotted against the maximum tolerated narrowband spectral distortion in Fig. 5.11.


16

Highband spectral distortion DH (dB)

14

12

10

8

6

4

2

0 1

2

3

4

5

6

Tolerated narrowband spectral distortion (dB)

Figure 5.11. Distribution of ‘gain-shape’ highband spectral distortion as a function of tolerated (maximum) narrowband spectral distortion. Results are based on around 5u106 data points. The percentile contours shown are the 10% (), 50% or median (| DQG').

Good performance (median highband spectral distortion 2 dB) with codebook mapping schemes is therefore possible only if the narrowband spectral distortion can be contained to less than 1 dB. This would require codebooks consisting of roughly 230 vectors, a size which is not feasible under present storage constraints. As an example, sub-frame codebook mapping with range matching achieves a highband spectral distortion DH of 6.41 dB for a codebook size of 256. The narrowband codebook of size 256 used in this technique produces a mean narrowband spectral distortion of 2.6 dB, and a maximum narrowband spectral distortion of about 5.5 dB, which yields an expected median highband spectral distortion DH of approximately 6.7 dB.

Thus the better techniques for narrowband-highband envelope matching are roughly in keeping with the median expected ‘gain-shape’ spectral distortion. It is concluded that the present performance of highband envelope estimation methods is only likely to be substantially improved, compared to existing methods, by allowing some knowledge of the true highband speech, rather than relying entirely on narrowband information. This is the subject of the following chapter.


5.4.2

Subjective assessment

In order to assess the performance of highband envelope combined with gain estimation relative to narrowband and true wideband speech, several conditions containing combinations of true and synthetic highband envelopes/gains and excitation were prepared for subjective assessment. 18 listeners (16 male and 12 female, between the ages of 20 and 35) were presented with samples of speech prepared according to each condition and asked to rank them on a five-point quality scale from ‘bad’ to ‘excellent’ (see Appendix A for details).

The synthetic highband envelopes were estimated using codebook mapping with interpolation. Codebooks of size 1024 were employed, and the 5 nearest neighbours were averaged to form the highband envelope shape. The synthetic highband gain was calculated using range matching with splicing. The resulting mean opinion scores (MOS) for each condition are shown in Table 5.2.

Table 5.2. ACR listening test results Condition True wideband speech Synthetic highband envelope/gain, true highband excitation True highband envelope/gain, synthetic highband excitation Synthetic highband excitation, envelope and gain True narrowband speech

MOS 4.25 3.65 3.31 2.78 2.74

95% CI r0.13 r0.15 r0.15 r0.15 r0.17

When synthetic envelopes and gains were employed in conjunction with the true highband excitation, a surprisingly high MOS was obtained, considering the average highband spectral distortion of around 6.5 dB. A possible reason for this is that parts of the highband speech may be masked [ZWI90, MOO97] by components of the narrowband speech. While this condition did not score as well as the true wideband speech, it scored much higher than the narrowband speech.

When synthetic envelopes and gains were combined with synthetic excitation (as in chapter 2), listeners found little improvement over the narrowband speech. Indeed, the confidence intervals for these two scores overlap, indicating that they cannot be reliably separated using the available statistics. Informal listening tests, on the other hand, found


that listeners almost always preferred the wideband enhanced speech to the narrowband speech.

These results seem to indicate that listeners are slightly less sensitive to errors in envelope and gain estimation than they are to artefacts in the highband excitation synthesis. This appears to concur with findings that, in speech coding, the perceptual weighting at high frequencies should be associated with the temporal waveform of the speech signal [MA94].

5.5 Conclusion In this chapter, existing methods for highband gain estimation were reviewed in section 5.1, and in section 5.2 new techniques were proposed which attempt to match the estimated highband envelope to the narrowband envelope in order to preserve continuity in the resulting wideband envelope. Most of these new techniques produced a similar performance or a slight improvement over existing methods in the objective comparison of section 5.3.

Highband gain estimation is a difficult problem which has been shown in sections 5.3 and 5.4 to have no entirely satisfactory solution at present. Perceptual artefacts arising from poor estimates of the highband gain can be mitigated to some extent by ensuring that the estimated highband envelope is continuous with the narrowband envelope. Despite the shortcomings of highband gain estimation, wideband speech comprising the true highband excitation, together with synthetic highband envelopes and gain, was rated highly compared with narrowband speech in subjective tests. This suggests that in applications such as wideband speech coding, a high degree of accuracy in modelling the highband portion of the spectral envelope may not be required.

Ch 6 / Very Low Bit Rate Wideband Speech Coding 108

Chapter 6. Very Low Bit Rate Wideband Speech Coding

6.0 Overview Results from objective tests in chapter 5 show that highband spectral envelope and gain estimation are a major source of distortion in the synthesized highband speech. To this point estimation of highband envelopes and gain has been based purely on narrowband information. Relaxing this constraint to allow coding of highband speech produces an interesting application of wideband enhancement to wideband speech coding.

In this chapter, previous work on low bit rate wideband coding is examined, a new wideband enhancement-based wideband coder is presented, and the performance of this coder is evaluated relative to selected existing schemes.

Sections 6.1 and 6.2 review selected previous research on wideband speech coding, with an emphasis on how highband information is coded at very low bit rates. A new wideband coder is presented in section 6.3 which combines any narrowband coder, wideband enhancement, and a few bits per frame of highband information. This new scheme is objectively compared with existing wideband spectral coding techniques in section 6.4, and in section 6.5 results from subjective assessment of the new coder are discussed.

6.1 Existing schemes for wideband excitation coding To date, many wideband coding schemes employ waveform coding, which is better suited to high quality excitation quantization than parametric coding. High quality wideband speech coding can be obtained at relatively high bit rates (48-64 kbit/s) using the sub-band ADPCM model of the ITU-T G.722 codec [MAI88]. Transform coded


excitation (TCX) [ADO95] and CELP offer lower bit rate approaches capable of achieving subjective test scores close to those of G.722 at 32 kbit/s [ORD91], 16 kbit/s [LAF91, ROY91, FUL92, MCE95, PAU96a, UBA97, NOM98], 13 kbit/s [PAU96b, SCHN98], 6.4 - 14.8 kbit/s [PAU95], and reasonable quality can be obtained at bit rates down to around 7.2 kbit/s [MCE93]. Highband excitation coding using CELP requires a small but not very low bit rate to achieve high quality, for example 2.4 kbit/s or 15% of the total bit stream in one multi-band CELP implementation [UBA97]. Wideband coding has also recently been reported at 8 kbps using MELP excitation coding [LIN00].

More recent wideband coders [AGU00, KOI00, MCCR00, TAO00], of the type proposed in section 6.3, employ existing narrowband coders in conjunction with parametric highband models, resulting in bit rates in the range of 9.6 to 16 kbit/s. In terms of excitation, substantial bit rate savings can be obtained using parametric coding methods such as sinusoidal transform coding. Wideband coding using a sinusoidal coder has been reported at below 8 kbit/s [ERI98], with the potential to achieve rates below 5 kbit/s, although no indication of the resulting speech quality was given. As discussed in chapter 2, sinusoidal synthesis in particular is highly suited to excitation synthesis up to bandwidths wider than narrowband where little or no information regarding the highband excitation is transmitted.

6.2 Existing techniques for wideband spectral coding Spectral coding is a major issue for very low bit rate narrowband and wideband coders, since a large proportion of the encoded bit stream is allocated to representing the shortterm spectral envelope. One report of a low bit rate wideband coder implementation [ERI98] estimates this proportion to be as high as 80% of the total wideband bit stream.

6.2.1 Wideband VQ

Perhaps the most obvious means of achieving wideband spectral coding at a very low bit rate is to vector quantize the entire wideband spectral envelope [ERI98]. This is because wideband VQ design explicitly attempts to minimize the average wideband


spectral distortion. Depending on memory and computational constraints, the size of the codebook is determined directly by the number of bits available for spectral envelope coding.

One disadvantage of wideband VQ is the high dimension usually required (at least 14), making codebook searches computationally intensive. This can be addressed at a small cost to the quantization performance by the use of split VQ or multistage VQ [GER93, ERI98]. Another disadvantage of wideband VQ, where an unweighted distance measure is used, is that the spectral distortion criterion is minimized uniformly over the entire wideband. In some applications it may be desirable to place more emphasis on minimizing the narrowband spectral distortion. Motivation for doing so arises from properties of the human ear such as its decreasing frequency resolution at higher frequencies [PLA90, ZWI90] and its masking of high frequency components by large magnitude lower frequency components [MOO97].

6.2.2

Split band VQ

Wideband speech is often split into sub-bands (e.g. 0-4 and 4-8 kHz) before quantization occurs, and there are several reasons for doing this. Quantization distortion in the higher sub-band has been found to have smaller perceptual importance than distortion in the lower band, and hence many wideband coders [e.g. MAI88, DRO89, ROY91, MCE93, COM99, AGU00, MCCR00, TAO00] allocate between 75 and 90 percent of their bit stream to coding the lower sub-band. Where spectral coding is employed, splitting the wideband signal into sub-bands allows lower order envelope models to be employed for each band, reducing the overall dimension of the envelope parameterization while maintaining good modelling accuracy. In one wideband codec [DRO89], linear prediction orders of 10 and 4 were employed for the narrowband and highband respectively.

Vector quantization of the sub-band spectral envelopes, or split VQ, thus has the advantage of lower order and more flexible envelope modelling over wideband VQ, resulting in less computationally intensive codebook searches. A drawback of split VQ is that separate gains must be coded for the narrowband and highband envelopes, so that it has a higher bit rate than wideband coding, which only requires one gain. The


highband gain may be coded somewhat more efficiently by quantizing its log difference from the narrowband gain. Although many existing techniques for very low bit rate wideband spectral coding (c.f. sections 6.2.3 through 6.2.5) do employ a sub-band approach, no examples of the use of wideband split VQ were found in this literature survey.

Analysis of wideband split VQ for a given narrowband codebook size can provide an insight into the optimum bit allocation between the highband envelope and gain. An experiment was conducted in this thesis to calculate the wideband spectral distortion resulting from the optimum bit allocation choice between highband envelope and gain, for a narrowband codebook size of 1024 and an unquantized narrowband gain. Results of this experiment, shown in Fig. 6.1, reveals that the allocation of the first four highband spectral coding bits to the highband gain is by far the more effective scheme, while additional bits should be allocated primarily towards the highband envelope.

Wideband spec tral distortion (dB)

Wideband spectral distortion (dB)

6 5.5 5 4.5 4 3.5 3 2.5 1

Total highband bits 1 Highband envelope bits 0 Highband gain bits 1

2

2 0 2

3

3 0 3

4

4 0 4

5

5 1 4

6

6 2 4

7

7 3 4

8

8 4 4

9

9 5 4

10

10 6 4

11

11 6 5

12

12 7 5

Figure 6.1. Wideband spectral distortion graphed against the number of bits allocated to highband spectral coding, using split VQ with a narrowband codebook of size 1024.

6.2.3 Highband low order LP

One efficient highband spectral envelope coding method is to quantize the narrowband and highband spectral envelopes separately, using a lower analysis order for the highband envelope. A low analysis order at higher frequencies is easy to justify


perceptually since the ear has poorer frequency resolution in those frequencies. By employing a quantized second order LP envelope and gain to represent the highband envelope, wideband spectral coding was achieved at a bit rate of 500 bit/s additional to that of narrowband speech [MCE93, SEY97]. A sixth order highband LP analysis was applied every 10ms in the wideband codec of [TAO00], and this was coded along with a highband gain (every 2.5ms) to obtain a highband bit rate of 2.3 kbit/s. Subjective tests reported in [TAO00] indicate that their 16 kbit/s overall scheme compared well with the 56 kbit/s G.722 codec, despite the fact that their highband excitation consisted purely of random noise.

6.2.4 Fixed highband envelope and log gain quantization

The highband spectral envelope can take on a fixed shape, for example in the case of [PAU96a], where the 6-7 kHz highband has a flat spectrum whose gain is adjusted according to the short-term energy in that band. In this example, the highband gain is encoded for each 2.5 ms sub-frame using logarithmic quantization, resulting in a 1.2 kbit/s increase in the bit rate over the narrowband coding rate.

6.2.5

Codebook mapping with differential log gain quantization

As discussed in chapters 4 and 5, codebook mapping can provide a reasonable approximation to the highband spectral envelope shape, but fails to estimate the highband gain with much accuracy. The transmission of information concerning the highband gain allows this shortcoming to be remedied, and this was first proposed in Appendix A of [ENB98]. In this instance, the short-term energy of the highband Ey was calculated in terms of the short-term energy of the narrowband Ex as shown in (6-1),

Ey

2n Ex ,

(6-1)

where n is an integer in the range -2 d n d 13 for a 4 bit gain quantizer. A similar differential gain quantization scheme was employed in [MCCR00], where once again 4 bits per frame was used for this purpose. The performance of this method, along with that of other methods from this section, is objectively evaluated in section 6.4.


6.3 A new wideband speech coder with a narrowband bit rate 6.3.1

Overview

The proposed structure of the new wideband speech coder is based around a sub-band structure, where the lower band is coded using any existing narrowband coder, as seen in Fig. 6.2. This basic structure is similar to those independently suggested by other researchers for low bit rate wideband coding [NOM98, AGU00, KOI00, MCCR00, TAO00], and is also similar to that used in the G.722 standard [MAI88].

wideband encoder highband encoder

wideband decoder highband bit stream

highband decoder

wideband speech narrowband encoder

narrowband bit stream

wideband speech

narrowband decoder

Figure 6.2. Wideband encoder and decoder with embedded narrowband codec (dashed box).

In the novel approach of this thesis, the higher band is coded using a combination of wideband enhancement and a small amount of highband envelope and gain information, as seen in Fig. 6.3. This configuration is well suited to coders which must primarily satisfy a set of narrowband performance criteria, but which can accommodate a few bits per frame of highband information. Bit allocation between the narrowband and highband in this codec is flexible, due to its sub-band structure.


encoder wideband speech

decoder

highband analysis

narrowband analysis

narrowband coding

narrowbandhighband mapping

highband envelope and gain coding

highband envelope/gain

narrowband information

highband envelope/gain decoding

highband synthesis narrowbandhighband mapping

narrowband decoding

wideband speech

narrowband synthesis

Figure 6.3. New wideband encoder and decoder based on the wideband extension paradigm.

6.3.2 Highband excitation synthesis

The highband excitation signal is generated according to the STC-based approach of section 2.4. As discussed in chapter 2, a parametric model for speech excitation gives rise to a sinusoidal excitation synthesis method which does not require any coded highband information, an attractive option for very low bit rate wideband coding. If a sinusoidal narrowband coder were chosen for the lower sub-band, excitation could be computed very efficiently and consistently across the entire wideband.

If CELP style narrowband coding were used, the pitch and voicing information can be input directly into the highband sinusoidal synthesis. As an alternative to sinusoidal excitation, narrowband information could be used to estimate highband CELP-style excitation to produce a sub-band wideband CELP scheme similar to those mentioned in section 6.1. If waveform coding were employed in narrowband coding, a new module would need to be added to the decoder of Fig. 6.3 to calculate the pitch and degree of voicing from the narrowband signal before highband synthesis, substantially increasing the computation required by the overall coder.

6.3.3 Highband spectral coding

The allocation of a few bits per frame to highband spectral envelope and/or gain coding allows substantial improvements to be made to highband envelope and gain estimation techniques. Codebook mapping relies on classifying narrowband envelope shapes in order to estimate highband envelope shapes, but the classification performance is


limited by the ‘one-to-many’ relationship characteristic of narrowband and highband envelopes (c.f. section 4.1.1). The addition of highband bits allows decisions to be made between many highband envelope shapes for one given narrowband envelope shape.

A new method for highband spectral coding is to vector quantize the highband envelope and gain using a small partition of a full highband codebook, where the selection of vectors comprising the partition is based upon the shape of the narrowband envelope. The narrowband code vector most similar to the input narrowband spectral envelope is first selected. Each narrowband code vector contains 2R indices to the highband codebook, where R is the number of highband bits employed. The input highband envelope and highband gain are compared with all 2R vectors in the highband codebook whose indices are contained in the selected narrowband code vector, and the most similar highband code vector is chosen. The index of the chosen highband code vector is then coded using R bits. This classified codebook mapping scheme, illustrated in Fig. 6.4, is similar to the reduced storage codebook mapping technique of section 4.1.10, except that a narrowband envelope may map to more than one highband envelope. 1.0

highband envelope and gain

narrowband codebook 1 5


select most similar envelope and gain

2 3 2 4 3 6 3 4 1 2

highband gain

3.2 0.4 2.7

0 1

2 6 5 6

highband codebook

1.3 3.7 1.2

index to highband codebook

index of highband codebook index

Figure 6.4. Block diagram of a classified codebook mapping-based highband envelope and gain encoder using R = 1 highband bit per frame.

This scheme can also be interpreted as a form of split band vector quantization in which the choice of highband envelope vector is constrained, based upon the shape of the narrowband envelope. Classified codebook mapping shares the advantages of smaller vector dimensions and smaller codebook storage requirements with split band VQ. It


also takes advantage of inter-sub-band dependency (mutual information) within the overall wideband envelope and has a narrowband-highband matching gain built in to the highband vectors.

The training procedure for this method begins with a trained narrowband and a trained highband codebook. The narrowband and highband codebooks are trained independently in this instance, since the mapping is performed via an index to the highband codebook rather than one-to-one mapping between codebooks. If gains are included in the highband codebook, then the spectral distortion measure used in the highband codebook training accounts for this gain also.

The indices to the highband codebook for the ith narrowband code vector are determined from a large training set of narrowband and highband envelopes (and highband gains) by forming a partition in the highband training data from those highband envelopes (and gains) whose corresponding narrowband envelopes are closest in shape to the ith narrowband code vector. The 2R highband code vectors which yield the smallest highband spectral distortions after quantizing the same partition of highband training data are then selected, and their indices are stored alongside the ith narrowband code vector.

The classified codebook mapping scheme depicted in Fig. 6.4 combines the functions implicit in the ‘envelope mapping’ and ‘highband envelope and gain coding’ blocks in Fig. 6.3, and requires the storage of the narrowband and highband codebooks in both the coder and the decoder. The narrowband codebook could, alternatively, be integrated with the narrowband coder if the narrowband envelope is quantized, since an n-bit envelope quantization of any kind is equivalent to an envelope codebook of size 2n. Integration of this kind would save computation and possibly reduce the storage required in both the encoder and the decoder.

6.3.4 Highband gain considerations

The spectral coding technique of section 6.3.3 allows the highband gain to be estimated jointly with the highband spectral envelope; however, this is only one of several very low bit rate design options. If the highband gain is to be coded separately to the


highband envelope, then the gain needs to be quantized as efficiently as possible. While the highband gain coding schemes described in 6.2.4 and 6.2.5 both rely on a logarithmic highband gain quantizer, this is not, in general, the optimal quantizer (see for example [LLO82]). By accumulating differences GHB G NB ,

(6-2)

between highband and narrowband gains GHB and GNB in dB and using these to train a codebook of differential gains, a more efficient quantizer can be designed. As seen in Fig. 6.5 of section 6.4.2, use of a differential gain codebook results in a substantial improvement in spectral distortion performance over the logarithmic quantizer of section 6.2.5.

6.3.5

Implementation Considerations

The complexity of the proposed wideband coder is only slightly greater than that of a narrowband parametric coder, provided the narrowband and highband codebooks are not too large. As mentioned in section 6.3.2, if this wideband coder employed a waveform coder in the narrowband, it may require considerably more computation than a narrowband waveform coder alone.

The storage required for this wideband coder is a potential implementation problem, particularly if a large bit rate is required for the narrowband and highband envelopes. The possibility of integrating the narrowband codebook with the narrowband envelope coding, discussed in section 6.3.3, offers avenues for reducing the storage requirement. The size of the highband codebook is very flexible, and could be adjusted to provide a suitable compromise between spectral distortion and memory usage.

The proposed wideband coder offers many possibilities for variable bit rate implementation. One example is variation of the number of bits used to code the highband spectral envelope based upon different classes of speech frame being transmitted. Sub-band entropy [MCC97] has been shown to provide effective classification of this kind for narrowband coding.


A bandwidth-scalable scheme could easily be developed from the proposed model, whereby highband codebooks representing successively higher frequency sub-bands could be added to the narrowband envelope coding scheme as the bit rate is increased. The STC-based highband excitation scheme of section 2.4 allows the excitation bandwidth to be varied from one frame to the next as required. Bandwidth-scalable wideband coding schemes have very recently attracted considerable research interest, for example [NOM98, AGU00, KOI00].

6.4 Objective assessment 6.4.1 Comparison criteria and methods

Codebook mapping with log gain quantization (c.f. section 6.2.5), codebook mapping with gain codebook (c.f. section 6.3.4), split band VQ (c.f. section 6.2.2) and classified codebook mapping (c.f. section 6.3.3) schemes were designed from a database of 83418 frames of wideband speech containing many different speakers. In all cases a narrowband codebook of size 1024 was employed. The bit allocation used in split VQ was optimum, identical to that used in section 6.2.2. These schemes were used to code a database of around 1000 frames of speech whose source was different from that of the training data.

The coded speech was then compared with the uncoded speech and the wideband spectral distortion was calculated as

DW

where

1 K

K

2 ¦ k 1 Zs

Zs 2

2

ª § ·º «20 log10 ¨ GW Ak (Z ) ¸» dZ , ~ ³0 « ¨¨ Ak (Z ) ¸¸» © ¹¼ ¬

(6-3)


GW

1 Zm

Zm

³ 0

~ Ak (Z ) Ak (Z )

dZ ,

(6-4)

~ Ak (Z ) and Ak (Z ) are the true and coded envelopes of the k’th (temporally aligned) frames of wideband speech respectively, and Zs is the wideband sampling frequency (16 kHz). The value of Zm was taken as 0.25Zs since all coding schemes compared used a narrowband (0-4 kHz) gain.

6.4.2

Results

According to the wideband spectral distortion criterion DW, classified codebook mapping is clearly the most effective method for wideband spectral coding using one highband bit per frame, and marginally so at higher bit rates also (see Fig. 6.5). Split VQ performs well at 3 bits per frame, presumably due to its emphasis on the highband gain. The use of a gain codebook produced a substantial improvement over log gain quantization, although both methods suffer at higher bit rates due to their allocation of bits solely to the highband gain.

Wideband spectral distortion (dB)

9 8

7

6 5

4

3 0

1

2

3

4

5

6

7

No of highband bits

Figure 6.5. Wideband spectral distortion vs. number of highband bits for codebook mapping with log gain quantization ( FRGHERRN PDSSLQJ ZLWK gain codebook (| VSOLW94u) and classified codebook mapping ('), with 0bit codebook mapping ( ) shown for reference.


6.5 Subjective assessment In order to assess the performance of the wideband coded speech relative to narrowband and true wideband speech, 18 listeners (16 male and 12 female, between the ages of 20 and 35) were presented with samples of speech prepared according to each condition and asked to rank them on an absolute category rating five-point quality scale from ‘bad’ to ‘excellent’ (see Appendix A for details). Conditions comprised wideband coded speech with uncoded narrowband speech and various combinations of true, synthetic and coded highband excitation, envelopes and gain. The resulting mean opinion scores (MOS) for each condition are shown in Table 6.1.

Table 6.1. ACR test results for 3 bit highband coded wideband speech Condition True wideband speech 3 bit coded highband envelope and gain, true highband excitation Synthetic highband envelope/gain, true highband excitation True highband envelope/gain, synthetic highband excitation 3 bit coded highband envelope and gain, synthetic highband excitation Synthetic highband excitation, envelope and gain True narrowband speech

MOS 4.25 3.99 3.65 3.31 2.83 2.78 2.74

95% CI r0.13 r0.15 r0.15 r0.15 r0.15 r0.15 r0.17

The 3 bit per frame coding of the highband envelope and gain was rated nearly as highly as the true wideband speech when the true highband excitation was employed. The improvement of the 3 bit coded speech with true excitation over the synthetic highband envelope and gain with true excitation was perhaps not as large as the corresponding improvement in spectral distortion would suggest. Once again, the introduction of the synthetic highband excitation had a substantial detrimental effect on listeners’ perceptions. Indeed, the improvement of the 3 bit coded speech with synthetic highband excitation over the narrowband speech was not large enough to distinguish the two with the available test results.


6.6 Conclusion A new scheme for wideband speech coding at a bit rate slightly greater than that of narrowband coding has been presented. This scheme is a sub-band coder whose lower band is coded using any existing narrowband codec, and whose upper band is based upon wideband enhancement techniques. The quality of the upper band speech is significantly better than that obtained using wideband enhancement, due to the transmission of a small amount of highband spectral envelope and gain information. Listening tests show that a significant quality improvement over narrowband speech can be efficiently achieved using this scheme, although more attention to the quality of the highband excitation is required.

Wideband speech coders of the type proposed in section 6 show a great deal of potential for achieving higher speech quality at low bit rates, also evidenced by the number of schemes for wideband coding based on existing narrowband coders independently proposed in very recent literature [AGU00, KOI00, JAX00, MCCR00, TAO00, VALI00].

Ch 7 / Conclusion 122

Chapter 7. Conclusion

7.1 Wideband enhancement and coding The growing need for higher quality speech communication at lower bit rates is resulting in a gradual increase in research attention towards wider bandwidth speech coding. In the future, speech bandwidths in the range 5-15 kHz will be employed in various sub-classes of applications [JAY92], including mobile, satellite and internet telephony, high quality tele- and video-conferencing, voicemail and media broadcasts. A good deal of flexibility will be required of the bit rates and bandwidths at which these services are delivered (refer to such recent work as [AGU00, KOI00]). Wideband enhancement offers a means of upgrading narrowband speech services to a wider bandwidth either as an add-on to an existing narrowband network or as a very low bit rate wideband coding solution.

Chapter 2 of this thesis has reviewed a number of existing methods for generating highband excitation using only narrowband information, and has proposed a new method for mixed excitation synthesis based on sinusoidal transform coding (STC) techniques. This method is capable of accurately modelling any mix of voiced and unvoiced highband excitation, producing perfectly harmonic voiced excitation and achieving a flat spectrum before spectral shaping. It provides a new model for highband parametric representation of speech which can be employed either as a component of a wideband enhancement scheme, or as a method for efficiently coding speech excitation at higher frequencies.

In comparison with other prevalent highband excitation synthesis techniques, the STCbased method was shown to have a number of advantages, such as harmonic highband sinusoids, accurate highband voicing control and smooth phase transitions at frame boundaries. Informal listening tests revealed that the novel mixed voiced-unvoiced highband excitation model resulted in higher quality speech than harmonic-only or random-only excitation.


In subjective listening tests, narrowband speech enhanced by STC-based highband excitation with true highband envelopes was clearly preferred to the input narrowband speech. During some informal listening tests, subjects had difficulty differentiating narrowband speech enhanced by STC-based highband excitation from the true wideband speech.

A survey of existing literature concerning highband envelope estimation was reported in chapter 3, revealing three classes of techniques: codebook mapping, statistical recovery and linear mapping. In chapter 4, codebook mapping was analyzed in greater detail in order to optimize the design of the codebooks. Experiments showed that for a highband spectral envelope estimation application, the size of the training data required to effectively train a pair of codebooks for codebook mapping is in the order of 250 times that of the desired codebook size.

Three new methods for highband envelope estimation based upon codebook mapping were presented in chapter 4. Narrowband voicing information was used to refine the codebook mapping scheme by splitting the mapping into voiced and unvoiced cases. Temporal evolution of the narrowband envelope was exploited in a sub-frame codebook mapping approach which improved the accuracy of highband envelope estimates. Interpolation between highband envelope vectors was applied to highband envelope estimation with improvements over codebook mapping. New developments to statistical recovery and linear techniques for highband envelope estimation were also reported.

In an extensive highband spectral distortion comparison, the new methods described in chapter 4 produced smaller distortions than existing techniques. Estimates were made of the theoretical optimum performance likely to be obtained from highband envelope estimation using codebook mapping. These estimates show that the new methods for highband envelope estimation produce approximately the optimum results likely to be achieved under the present implementation constraints.

In chapter 5, existing methods for combining the estimated highband spectral envelope with the narrowband spectral envelope were assessed using a new highband spectral distortion criterion which accounts for highband gain. The shortcomings of these


methods were discussed, and new techniques were proposed which ensure continuity in the resulting wideband envelope. The new techniques made small improvements on existing methods in terms of spectral distortion. Estimates of the theoretical optimum performance likely to be achieved using highband envelope estimation followed by envelope matching were made. These showed that little improvement is likely to be made over the existing and new methods for narrowband-highband envelope matching.

Subjective tests in chapter 5 showed that narrowband speech which had been enhanced using the new highband envelope estimation and matching techniques in conjunction with the true highband excitation was strongly preferred to the input narrowband speech. The tests also revealed that subjects were surprisingly tolerant of perceptual artefacts resulting from highband envelope and gain inaccuracies. On the other hand, the results suggest that artefacts originating from the synthetic highband excitation tend to be exacerbated by inaccuracies in the synthetic highband envelope or gain.

In chapter 6, a new wideband coder was presented that combines any narrowband coder with wideband enhancement and one or more bits per frame of highband spectral envelope and gain information. The novel inclusion of parametric highband excitation in this wideband coder allows the entire excitation to be synthesized based upon narrowband parameters such as pitch and voicing, without transmitting any extra information. A new technique for highband spectral coding using classified codebook mapping was introduced, and in wideband spectral distortion comparisons it was shown to outperform split-band vector quantization.

In subjective tests on the new coding scheme in combination with the true highband excitation, the addition of just 3 bits per frame of highband envelope and gain information produced speech which scored almost comparably with the true wideband speech. Using the new STC-based highband excitation synthesis technique with 3 bits per frame of highband envelope and gain information, the resulting wideband speech did not score as well, however it was still preferred to the narrowband speech.

The various methods for speech enhancement and coding presented in this thesis offer possibilities for higher quality speech to be transmitted more efficiently across limited


bandwidth channels. Thus, a likely advantage of these methods is more natural speech communication at lower cost for telecommunications service providers and consumers.

This thesis has investigated wideband enhancement in depth, revealing that there are many promising methods which exploit the structure of the speech signal to extend the bandwidth of the excitation and spectral envelope and produce higher quality speech. Examination and development of these methods has shown that wideband speech coding can be achieved at or near narrowband bit rates.


7.2 Limitations of the research 7.2.1 Highband excitation synthesis

Due to the time required for their implementation, not all of the existing methods for highband excitation synthesis were implemented, so their full assessment could not be made. While informal listening tests were conducted throughout the research, formal MOS testing assessment of the new wideband enhancement scheme under various conditions (Table 6.1) was conducted quite late in the research. The MOS test results suggested that more attention should be paid to the highband excitation synthesis. Unfortunately, these results became known too late for due attention to be directed towards further improvements to the STC-based technique.

7.2.2

Highband envelope estimation

While the design of codebook pairs for codebook mapping methods in this thesis begins with minimization of the spectral distortion within the narrowband codebook, this approach is not necessarily optimal. The joint simultaneous optimization of distortion within the narrowband and highband codebooks has been addressed in image enhancement [RAO96, MIL96, FEK98a, FEK98b], and the application of this approach to codebook pair design to highband envelope estimation would have been pursued given more time.

7.2.3 Very low bit rate wideband coding

The new very low bit rate wideband coding scheme introduced in chapter 6 differs considerably from the majority of existing wideband codecs in that highband coding is based on parametric techniques. In order to evaluate this new coding scheme in the more general context of wideband coding, substantial subjective comparisons need to be made with well-known existing wideband coders. Some comparisons of this type have very recently been made [AGU00, KOI00, MCCR00, TAO00]. In this thesis, due to time constraints, subjective comparison was only made between the novel coding scheme and uncoded wideband speech.


7.3 Suggestions for future research 7.3.1 Highband excitation synthesis

Further subjective testing of the STC-based highband excitation synthesis technique could be employed to isolate elements of the excitation which could be developed. The proliferation of parametric highband excitation in recent wideband coding literature [AGU00, KOI00, MCCR00, TAO00] suggests that detailed subjective comparisons of the STC-based method of section 2.4 with all alternative highband excitation synthesis approaches is now also required.

A faster frame update rate may provide highband excitation which is better able to model speech transients, particularly in view of the increasing temporal resolution with increasing frequency of the human ear [PLA90, ZWI90], due to its critical band structure [SCHA70].

Many telephony standards are based on a speech bandwidth of around 0.3 to 3.4 kHz, and some of the previous literature on wideband enhancement also deals with the problem of low frequency regeneration [e.g. MIE00]. The new method reported in section 2.4 would be extremely well suited to the regeneration of the 0-300 Hz region of the wideband speech (this is suggested in [CAR94b]), and its performance in this application could be compared with other existing techniques.

7.3.2

Highband spectral envelope estimation

Research topics such as speaker adaptation, image enhancement and neural networks could be further investigated in an attempt to find new envelope estimation methods which improve on those reviewed and proposed in this thesis. As the processing and memory resources of computers improve, many of the methods of this thesis should be re-trained using a much larger speech corpus, since this is expected to produce better envelope estimation performance. However, the results of section 4.4.3 suggest that only limited improvements will be possible.


7.3.3 Highband gain estimation

The large highband spectral distortion resulting from all methods examined in this thesis clearly shows the need for improved highband gain estimation techniques. Any improvements to this aspect of wideband enhancement are anticipated to have a direct positive impact on the subjective quality of the output wideband speech, although future improvements in highband spectral distortion may be limited (see section 5.4.1).

7.3.4 Very low bit rate wideband coding

As mentioned in section 7.3.1, the human ear has improved temporal resolution at higher frequencies, which suggests that the highband spectral envelope and gain should be coded on a sub-frame basis (e.g. 2.5-5ms sub-frames). The novel method for highband envelope/gain coding presented in section 6.3 would be ideal for sub-frame highband spectral coding since very few bits (e.g. 1-3) are required to achieve perceptually acceptable highband envelope/gain estimates for each sub-frame.

The new technique for wideband spectral coding presented in section 6.3.3 could be applied to narrowband coding with some modification to the codebook bandwidths, and may prove an effective means of quantizing the narrowband spectral envelope at low bit rates. The technique may also prove very effective for parametric coding of wideband speech in applications where a long delay is acceptable and a very low bit rate is desired, such as voice storage, using methods similar to those described in [MUD98].

7.3.5 Wideband subjective testing

More work is needed towards identifying the most appropriate methods for subjective comparison between wideband coders, and in particular between coders with different bandwidths. Although many experiments to date show that subjects generally prefer wider bandwidth speech [e.g. ROY91, JAY92, UBA97], their preferences need to be unambiguously quantified, particularly where comparisons between speech coded using entirely different techniques are made.


7.3.6 Wideband auditory masking

The potential for bit rate reduction in wideband coders through exploitation of the properties of the human auditory system has been suggested widely throughout the literature [e.g. JAY92, NOL93, ADO95]. Results in this thesis confirm that much less information needs to be transmitted for the highband speech than the lowband speech to maintain high quality speech transmission, and suggest that listeners have a relatively high tolerance for highband envelope distortion. Further work into the perceptually important features in the higher frequency components of speech would throw light on improved approaches to highband excitation and envelope coding.

Appendix A / Subjective Assessment 130

Appendix A.

Subjective Assessment

A.1 Listener opinion tests Listener opinion tests can be classified according to three classes [DIM91]: absolute category rating (ACR), degradation category rating (DCR) and equality threshold rating (ETR) assessments.

A.1.1 Absolute category rating

ACR assessment involves the presentation of a number of randomly ordered ‘samples’ (two to three sentences, separated by at least one second) of speech prepared according to each ‘condition’ to the listener. The listener may be given 5 to 10 seconds in which to record their opinion before the next sample is heard. The scale on which listeners’ opinions are scored is often taken as that recommended by the ITU-T [IEEE69, CCI88b], shown in Table A.1. Listeners’ opinions for each condition can be averaged to obtain a mean opinion score (MOS).

Table A.1. ITU-T five point quality scale Score 5 4 3 2 1

Quality rating Excellent Good Fair Poor Bad

The speech source data used in MOS testing should include at least 4 different speakers (2 female, 2 male) with different voice characteristics. While at least 12 listeners should be employed, 24 is common and 32 is preferable.

A.1.2 Degradation category rating

DCR assessment uses (A-B) stimuli pairs or (A-B-A-B) repeated pairs, where A is the high quality reference signal and B is the same sample after degradation has been


introduced by the processing technique under test. The ITU-T [CCI88c] recommends a four speaker corpus as follows:

Speaker 1 reading A then B Speaker 2 reading A then B Speaker 3 reading A then B Speaker 4 reading A then B

Each sample is then recorded on the scale shown in Table A.2. The average of the scores obtained over a many listeners is then calculated to obtain a DMOS (degradation mean opinion score). DCR is well suited to the evaluation of high quality speech processing schemes.

Table A.2. ITU-T five point degradation category scale Score 5 4 3 2 1

Degradation category Degradation is inaudible Degradation is audible but not annoying Degradation is slightly annoying Degradation is annoying Degradation is very annoying

A.1.3 Absolute vs. comparative rating for wideband enhanced speech

DCR is a comparative rating method which provides a sensitive measure of the degradation of a speech ‘sample’ to a high quality reference sample, and is thus very useful for assessing how close the quality of a high quality speech coder is to the uncoded speech. The value of a DCR test is perhaps more questionable where the degradation is much larger, for example in a comparison between wideband enhanced speech and the true wideband speech.

A variant of DCR is comparative category rating (CCR), which can be used to assess enhancement or degradation relative to a reference sample. CCR may be useful for assessing the improvement of enhanced speech over the true speech, for example the improvement of wideband enhanced speech over the narrowband speech from which it was derived.


In general, if degradations of different kinds are tested simultaneously, then an absolute rating system is preferable [DIM91]. Considering the range of conditions to be tested in this thesis, this makes ACR an attractive choice, however, the following limitations of ACR should be noted. ACR tests need to be ‘anchored’ by reference conditions which allow subjects a better understanding of the different quality levels on the five-point scale. Often the modulated noise reference unit (MNRU) [CCI88a] is used to introduce such reference conditions from ‘bad’ to ‘good’ quality, but the type of degradation introduced by the MNRU is not always appropriate to the conditions being tested. Another limitation of ACR tests is the ordering effect, whereby different scores may be given to the same sample depending on the quality of the sample which preceded it. This effect can be overcome by randomizing the presentation order of samples sufficiently and collecting a large total number of scores for each condition.

A.2 Experiment design A speech database containing 64 sentences spoken by 8 North American speakers, 4 male and 4 female, was used to prepare 7 different conditions (listed in Table 6.1 of section 6.5). This test database was entirely separate from speech data used to train any of the algorithms applied to it. After the test database was processed for each condition, it was segmented into 7*64 sentences of 3 to 5 seconds in length.

Eighteen subjects, 16 male and 2 female, between the ages of 20 and 30, were selected at random from the student population of the University of New South Wales. Subjects were asked to wear headphones that were preset to a comfortable listening level (held constant for all subjects). The subjects were given a set of instructions, listed in Fig. A.1 of Appendix A, and were presented with 7 randomly selected training conditions, to familiarize themselves with the presentation format.

Subjects were presented with 10 sentences for each of the 7 conditions. For each condition, the sentences were chosen randomly from a database of 64 sentences, except that no sentences were presented twice. The presentation of the 70 sentences for each subject was random, except that for each bracket of 7 sentences, all conditions were heard once.


No effort was made to ensure that each subject heard an equal number of male and female sentences, except that, over all subjects, approximately an equal number of each were presented.

Thank you for participating in this study.

You will be listening to a number of speech samples through headphones, so please ensure that you are sitting comfortably and that the headphones are correctly adjusted before you begin.

After each sample has been played, a prompt will be displayed on the monitor:

Score (1 - 5):

Please wait until the speech sample has finished playing before entering a score. You must score the speech samples you have just heard based on the following scale:

Score 5 4 3 2 1

Quality rating Excellent Good Fair Poor Bad

You must enter one of the whole numbers shown on the scale. If you do not, you will see the following message:

Please re-enter score or ask for assistance Score (1 - 5):

If you have any problems, please stop and ask for assistance.

Do not enter a score until you have heard the entire sample, however try to enter a score within about 10 seconds of hearing the entire sample.

The study will begin with a short training sequence to help familiarize you with the procedure.

If you have any questions, please ask now.

Figure A.1. Instruction sheet for ACR subjective test


A.3 Analysis Typically the mean of all listeners’ opinions for several repetitions of a single condition, or the mean opinion score (MOS), is quoted when ACR and DCR assessments are made. Since the objective of the test is to compare the means of different conditions, it is necessary to estimate the statistical significance between a pair of means. The difference between two means is often declared statistically significant if their 95% confidence intervals do not overlap. The 95% confidence interval (CI) about a mean is calculated as

CI

r z 0.025

V2 , n

(A-1)

where V 2 is the variance and n is the number of scores for a given condition. z 0.025 = 1.96 is the z-value which corresponds to 97.5% of the area under the normal curve.

Appendix B / Alternative Distance Measures 135

Appendix B. Alternative Distance Measures The most commonly used [MAK85] distance between two vectors x, xˆ k is the mean square error or Euclidean measure

1

d ( x, y )

(x xˆ ) (x xˆ ) T

§ k ·2 ¨ ¦ ( xi xî ) 2 ¸ , ©i1 ¹

(B-1)

Often it is desired to weight the mean square error, and one method of achieving this is to introduce a square matrix W containing the weights

1

d (x, xˆ )

(x xˆ ) W (x xˆ ) T

§ k ·2 ¨ ¦ >wi ( xi xî )@¸ , ©i1 ¹

(B-2)

where W is a diagonal matrix with diagonal elements w1, . . . , wp. The weights wi may be derived from perceptual information or by experimental estimation. For example, in the space of narrowband LSF vectors of dimension k = 10, the weights 1, 1, . . . , 1, 0.8, 0.4 were derived in [PAL93] for narrowband coding purposes.

This Mahalanobis distance is a measure of the distance from a vector x to a class with class mean xˆ and covariance matrix C, and is used in maximum likelihood problems. The distance is formulated as

d (x, xˆ )

(x xˆ )T C 1 (x xˆ ) ,

(B-3)

which is simply a weighted squared Euclidean distance. The distance d in (B-3) is often employed as the exponent of a Gaussian (or other) probability density function, unlike the weighted Euclidean distance.

Appendix B / Alternative Distance Measures 136

Where linear predictor (LP) coefficients [MAK75a] are used in speech spectral analysis, the Itakura-Saito distance [ITA75b] can be applied. The minimum mean square error (MSE) arising from the LP analysis of a frame of speech is given by H

a T Ra , where

the vector a = [1, a1, . . . , ap]T contains the LP coefficients for that frame, R is the autocorrelation matrix with Rij = R(i-j), and R(m) is the autocorrelation sequence of that frame. A measure of the similarity of another LP coefficient vector aˆ to a can be defined as the log ratio of the MSE arising from the prediction of the original frame of speech by aˆ to the minimum MSE

d (a, aˆ )

log

aˆ T Raˆ , a T Ra

the Itakura-Saito distance measure.

(B-4)

Appendix C / Proof of Jensen’s Inequality 137

Appendix C. Proof of Jensen’s Inequality Consider H ( | 0 ) H ( 0 | 0 )

¦ log p(s

k

| X, Y, ) p ( s k | X, Y, 0 )

sk

¦ log p ( s k | X, Y, 0 ) p ( s k | X, Y, 0 ) sk

¦ >log p(s

k

| X, Y, ) log p ( s k | X, Y, 0 )@ p ( s k | X, Y, 0 )

sk

ª p ( s k | X, Y, ) º » p ( s k | X, Y , 0 ) k | X, Y, 0 ) ¼ ¬

¦ log « p(s sk

ª p ( s k | X, Y, ) º d ¦« 1» p ( s k | X, Y, 0 ) sk ¬ p ( s k | X, Y, 0 ) ¼

(C-1)

since log x d x 1 . Now

ª p ( s k | X, Y, ) º 1» p ( s k | X, Y, 0 ) k | X, Y, 0 ) ¬ ¼

¦ « p( s sk

¦ p(s

k

| X, Y, ) p ( s k | X, Y, 0 )

sk

0, therefore H ( | 0 ) d H (0 | 0 ) as required.

(C-2)

Appendix D / Statistical Recovery Algorithm 138

Appendix D. Statistical Recovery Algorithm [CHE92a, CHE92b, CHE94] 1.

Initialize parameters: arbitrarily choose initial sources i , 1 d i d N and j , 1 d j d M, represented as vectors of narrowband and wideband autocorrelation samples r i and r j respectively. Set D ij

1 i,j and choose p( i ) z 0 (e.g. 1/N), 1 d i d M

N. Construct sequences of narrowband training data xt and wideband training data yt, 1 d t d T, from T frames of wideband speech to form a database T = {xt,yt}. Normalize each frame by dividing by the square root of its energy per sample. 2.

Initialize cumulatives rnum [k ] 0 , rden [k ] 0 i i

1 d k d p+1

rnum [k ] 0 , rden [k ] 0 j j

1 d k d q+1

p cum ( i ) 0

1didN

D ijnum 3.

0 , D ijden

1 d i d N, 1 d j d M

0

Set t = 1. i) Calculate autocorrelation sequences rx and ry from xt and yt, the t’th narrowband and wideband frames respectively. ii) Compute the LP parameters a i and a j of r i and r j , using the LevinsonDurbin recursive algorithm [DUR60]. Calculate the autocorrelation sequences ra , i and ra , j of i ’s and j ’s LP parameters. iii) Calculate p (y t , xt , i , j ) as follows p ª § ·º exp « ¨¨ rx ,t [0]ra , i [0] 2¦ rx ,t [k ]ra , i [k ]¸¸» k 1 ¹¼ ¬ ©

(D-1a)

q ª § ·º p(y t | j ) exp « ¨¨ ry ,t [0]ra , j [0] 2¦ ry ,t [k ]ra , j [k ]¸¸» k 1 ¹¼ ¬ ©

(D-1b)

p (x t | i )

p (y t , xt , i , j ) iv) Compute

p (y t | j )D ij p (xt | i ) p( i )

(D-1c)

Appendix D / Statistical Recovery Algorithm 139

¦ p(y , x , , ) ¦ ¦ p(y , x , , )

(D-2a)

¦ p(y , x , , ) ¦ ¦ p(y , x , , )

(D-2b)

M

p( i | xt , y t )

t

j 1

N

t

i

j

M

i 1

t

j 1

t

i

j

N

p( j | xt , y t )

t

i 1

N

M

i 1

j 1

t

i

t

t

j

i

j

v) Update cumulatives rnum [k ] rnum [k ] p ( i | xt , y t )rx ,t [k ] i i

(D-3a)

rden [k ] rden [k ] p ( i | xt , y t ) i i

(D-3b)

rnum [k ] rnum [k ] p ( j | xt , y t )ry ,t [k ] j j

(D-3c)

rden [k ] rden [k ] p ( j | xt , y t ) j j

(D-3d)

p cum ( i ) D ijnum

D ijnum

D ijden 4.

p cum ( i ) p ( i | xt , y t ) p(y t , xt , i , j )

¦ ¦ N

M

i 1

j

p (y t , xt , i , x j ) 1

D ijden p ( i | xt , y t )

(D-3f) (D-3g)

Set t m t + 1 and repeat Step 3 until t = T. Then update parameters

¦

T

rnum [k ] i

r i [k ]

den i

r

r j [k ]

p ( i )

t 1

T t

¦

T

rnum [k ] j den j

r [k ] p cum ( i ) T

D ij

D ijnum D

den ij

p ( i | x t , y t )rx ,t [k ]

¦

[k ]

¦

t 1

T t

(D-4b)

p( j | xt , y t ) 1

¦ j 1 p (y t , xt , i , j ) 1 T ¦ T t 1 ¦ N ¦ M p (y t , xt , i , j ) M

i 1

p(y t , xt , i , j ) N

M

i 1

t

j 1

t

i

j

)

j

)

(D-4d)

M

T

t 1

(D-4c)

j 1

¦ ¦ p(y , x , , ¦ p(y , x , , ) ¦ ¦ ¦ p(y , x , , t 1

(D-4a)

p ( i | x t , y t ) 1

p ( j | xt , y t )ry ,t [k ]

¦

T

5.

(D-3e)

t

j 1

N

M

i 1

j 1

t

t

i

t

j

i

Test the stop criterion: if satisfied, then stop; otherwise repeat steps 2-4. Algorithm D.1. Parameter design for statistical recovery using the EM algorithm. Note that the exponents in (D-1a) and (D-1b) are a form of the Itakura-Saito distance [ITA75b] (see Appendix B).

References 140

References

[ABE95]

Abe, M., and Yoshida, Y. (1995). “More natural sounding voice quality over the telephone!”, NTT Review, vol. 7, no. 3, pp. 104-109, May.

[ABU90]

Abut, H. (1990). Vector Quantization, IEEE Press, New York.

[ADO95]

Adoul, J.-P., and Lefebvre, R. (1995). “Wideband speech coding”, in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal (Eds), Elsevier, Amsterdam, Chapter 4, pp. 289-309.

[AGU00]

Aguilar, G, Chen J.-H., Dunn, R. B., MacAulay, R. J., Sun, X., Wang, W., and Zopf, R. (2000), “An embedded sinusoidal transform codec with measured phases and sampling rate scalability”, in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 11411144.

[ALM82]

Almeida, L. B., and Tribolet, J. M. (1982). “Harmonic coding: a low bit-rate, good-quality speech coding technique”, in Proc. Int. Conf. on Acoustics, Speech and Signal Processing (Paris, France), pp. 16641667.

[ATA76]

Atal, B. S., and Rabiner, L. R. (1976). “A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition”, IEEE Trans. Acoust., Sp., and Sig. Proc., vol. ASSP-24, no. 23, pp. 201-212, June.

[ATA86]

Atal, B. S. (1986). “High-quality speech at low bit rates: multi-pulse and stochastically excited linear predictive coders”, in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1681-1684.

[AVE95]

Avendano, C., Hermansky, H., and Wan, E. A. (1995). “Beyond Nyquist: towards the recovery of broad-bandwidth speech from narrow-bandwidth speech”, in Proc. 4th European Conf. on Speech Commun. and Technol., EUROSPEECH (Madrid, Spain), vol. 1, pp. 165-168, Sept.

[BAU72]

Baum, L. E. (1972). “An inequality and associated maximization

References 141

technique in statistical estimation for probabilistic functions of Markov processes”, Inequalities, vol. 3, pp. 1-8. [BER79]

Berouti, M., Schwartz, R., and Makhoul, J. (1979). “Enhancement of speech corrupted by acoustic noise”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 208-211.

[BEZ81]

Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms, Plenum Press, New York.

[BI97]

Bi, N., and Qi, Y. (1997). “Application of speech conversion to alaryngeal speech enhancement”, IEEE Transactions on Speech and Audio Processing, vol. 5, no. 2, pp. 97-105, March.

[CAM86]

Campbell, J. P., and Tremain, T. E. (1986), “Voiced/unvoiced classification of speech with applications to the U.S. Government LPC-10E algorithm”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (Tokyo, Japan), vol. 1, pp. 473-476.

[CAR94a]

Carl,

H.

(1994).

“Untersuchung

vershiedener

Methoden

der

Sprachkodierung und eine Anwendung zur BandbreitenvergröHUXQJ von

Schmalband-Sprachsignalen”,

Ph.D.

Dissertation,

Ruhr-

Universität, Bochum. [CAR94b]

Carl, H., and Heute, U. (1994). “Bandwidth enhancement of narrowband speech signals”, SIGNAL PROCESSING VII, Theories and Applications, EUSIPCO, vol. 2, pp. 1178-1181.

[CCI88a]

CCITT (1988). “Subjective performance assessments of digital processes using the modulated noise reference unit (MNRU)”, Suppl. 14, ‘Blue Book’, Vol. V, pp. 341-360, Melbourne.

[CCI88b]

CCITT (1988). “Absolute category rating (ACR) method for subjective testing of digital processes”, Suppl. 14, ‘Blue Book’, Vol. V, pp. 346-351, Melbourne.

[CCI88c]

CCITT (1988). “Subjective performance assessment of digital encoders using the degradation category rating procedure (DCR)”, Suppl. 14, ‘Blue Book’, Vol. V, pp. 351-356, Melbourne.

[CHA96]

Chan, C-F., and Hui, W-K. (1996). “Wideband enhancement of narrowband coded speech using MBE re-synthesis”, in Proc. Int. Conf. on Signal Processing, ICSP, vol. 1, pp. 667-670.

References 142

[CHEN96]

Chen, C.-Q., Koh, S.-N., and Sivaprakaspillai, P. (1996). “A modified Generalised Lloyd Algorithm for VQ codebook design”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 542545.

[CHE92a]

Cheng, Y. M., O’Shaughnessy, D., and Mermelstein, P. (1992). “Statistical signal mapping: a general tool for speech signal processing”, Proc. IEEE 6th Workshop on Statistical and Array Processing, pp. 436-439.

[CHE92b]

Cheng, Y. M., O’Shaughnessy, D., and Mermelstein, P. (1992). “Statistical recovery of wideband speech from narrowband speech”, Int. Conf. on Spoken Language Processing, Banff.

[CHE94]

Cheng, Y. M., O’Shaughnessy, D., and Mermelstein, P. (1994). “Statistical recovery of wideband speech from narrowband speech”, IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 544-548, October.

[COH91]

Cohn, R. P. (1991). “Robust voiced/unvoiced speech classification using a neural net”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 437-440.

[COM99]

Combescure, P., Schnitzler, J., Fischer, K., Kirchherr, R., Lamblin, C., Le Guyader, A., Massaloux, D., Quinquis, C., Stegmann, J., and Vary, P. (1999) “A 16, 24, 32 kbit/s wideband speech codec based upon ATCELP”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 5-8.

[CRO72]

Croll, M. G. (1972). Sound quality improvement of broadcast telephone calls, BBC Research Department Report No. 1972/26.

[DAS95]

Das, A., and Gersho, A. (1995). “Variable dimension spectral coding of speech at 2400 bps and below with phonetic classification”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 492495.

[DEL93]

Deller, J. R., Proakis, J. G., and Hansen, J. H.L. (1993). Discrete-Time Processing of Speech Signals, Macmillan, New York.

[DEM77]

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). “Maximum likelihood from incomplete data using the EM algorithm”, Ann. Royal

References 143

Stat. Soc., pp. 1-38. [DIM89]

Dimolitsas, S. (1989). “Objective speech distortion measures and their relevance to speech quality assessments”, IEE Proc. I, Commun. Speech and Vision, vol. 136, pt. 1, no. 5, pp. 317-324, October.

[DIM91]

Dimolitsas, S. (1991). “Subjective quality quantification of digital voice communications systems”, IEE Proc. I, Commun. Speech and Vision, vol. 138, no. 6, pp. 585-595, December.

[DRO89]

Drogo de Jacovo, R., Montagna, R., Perosino, F., and Sereno, D. (1989). “Some experiments of 7 kHz audio coding at 16 kbit/s”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 192-195.

[DUR60]

Durbin, J. (1960). “The fitting of time-series models”, Review of Institute for International Statistics, vol. 28, pp. 233-243.

[ELJ91]

El-Jaroudi, A., and Makhoul, J. (1991). “Discrete all-pole modelling”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 39, no. 2, pp. 411-423, February.

[ENB98]

Enbom, N. (1998). Bandwidth Expansion of Speech, Thesis, Royal Institute of Technology (KTH).

[ENB99]

Enbom, N., and Kleijn, W. B. (1999). “Bandwidth expansion of speech based on vector quantization of the mel frequency cepstral coefficients”, in Proc. IEEE Workshop on Speech Coding (Porvoo, Finland), pp. 171-173, June.

[EPH90]

Ephraim, Y. (1990) “A minimum mean square error approach for speech enhancement”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 829-832.

[EQU87]

Equitz, W. (1987). “Fast algorithms for vector quantization picture coding”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 725-728.

[ERI98]

Eriksson, T., Kang, H-G., and Stylianou, Y. (1998). “Quantization of the spectral envelope for sinusoidal coders”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 37-40.

[FAN60]

Fant, G. (1960). Acoustic theory of speech production, Mouton & Co, Gravenhage, The Netherlands.

References 144

[FED91]

Federal Standard 1016 (1991). Telecommunications: analog to digital conversion of radio voice by 4,800 bit/second code excited linear prediction (CELP), National Communications System, Office of Technology and Standards, Washington, DC20305-2010.

[FEK98a]

Fekri, F., Mersereau, R. M., Schafer, R. W. (1998). “A generalized interpolative VQ method for jointly optimal quantization and interpolation of images”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 2657-2660.

[FEK98b]

Fekri, F., Mersereau, R. M., Schafer, R. W. (1998). “Enhancement of text images using a context based nonlinear interpolative vector quantization method”, in Proc. IEEE Int. Conf. on Image Processing, pp. 237-241.

[FUJ70]

Fujisaki, H., and Nagashima, S. (1970). “A new type of baseband vocoder and its computer simulation”, Annual Report of the Eng. Research Inst., Fac. of Eng., Univ. of Tokyo, vol. 29, pp. 187-194, Jun.

[FUL92]

Fuldseth, A., Harborg, E., Johansen, J. T., and Knudsen, J. E. (1992) “Wideband speech coding at 16 kbit/s for a videophone application”, Speech Communication, vol. 11, no. 2-3, pp. 139-148, June.

[GER83]

Gersho, A. (1983). “Vector quantization: a pattern-matching technique for speech coding”, IEEE. Commun. Mag., vol. 21, pp. 15-21, Dec.

[GER90]

Gersho,

A.

(1990).

“Optimal

nonlinear

interpolative

vector

quantization”, IEEE Trans. Comm., vol. 38, no. 9, pp. 1285-1287, Sept. [GER93]

Gersho, A., and Gray, R. M. (1993). Vector Quantisation and Signal Compression, Kluwer Academic Publishers, Boston.

[GHI91]

Ghiselli-Crippa, T., and El-Jaroudi, A. (1991). “A fast neural net training algorithm and its application to voiced-unvoiced-silence classification of speech”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 441-444.

[GRA80]

Gray, R. M., Buzo, A., Gray, A. H., and Matsuyama, Y. (1980). “Distortion measures for speech processing”, IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 367-376, August.

[GRA84]

Gray, R. M. (1984). “Vector Quantization”, IEEE ASSP Magazine,

References 145

vol. 1, pp. 4-29, April. [GRI88]

Griffin, D. W., and Lim, J. S. (1988). “Multiband excitation vocoder”, IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 8, pp. 12231235, August.

[HED81]

Hedelin, P. (1981). “A tone-oriented voice-excited vocoder”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 205208.

[HEI98]

Heide, D. A., and Kang, G. S. (1998). “Speech enhancement for bandlimited speech”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 393-396.

[HES83]

Hess, W. (1983). Pitch Determination of Speech Signals: Algorithms and Devices, Springer-Verlag, Berlin.

[HOL90]

Holmes, W. J., Holmes, J. N., and Judd, M. W. (1990). “Extension of the JSRU parallel-formant synthesizer for high quality synthesis of male and female speech”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 313-316.

[IEEE69]

IEEE (1969). “IEEE recommended practice for speech quality measurements”, IEEE Trans. Audio and Electroacoust., vol. AU-17, no. 3, pp. 227-246, September.

[ITA75a]

Itakura, F. (1975). “Line spectrum representation of linear predictive coefficients of speech signals”, J. Acoust. Soc. Amer., vol. 57, S35(A).

[ITA75b]

Itakura, F. (1975). “Minimum prediction residual principle applied to speech recognition”, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-23, no. 1, pp. 67-72, February.

[IWA95]

Iwahashi, N., and Sagisaka, Y. (1995). “Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by

radial basis function

networks”, Speech

Communication, vol. 16, pp. 139-151. [JAX00]

Jax, P., and Vary, P. (2000). “Wideband extension of telephone speech using a Hidden Markov Model”, to appear in Proc. IEEE Workshop on Speech Coding (Delavan, USA), September 18-20.

[JAY92]

Jayant, N., Johnston, J. D., and Shoham, Y. (1992). “Coding of wideband speech”, Speech Communication, vol. 11, pp. 127-138.

References 146

[JUA96]

Juang, B.-H., Chou, W., and Lee, C.-H. (1996). “Statistical and discriminative methods for speech recognition”, in Automatic Speech and Speaker Recognition, Lee, C.-H., Soong, F. K., and Paliwal, K. K. (Eds), Kluwer, Boston, Chapter 5, pp. 109-132.

[KAT94]

Katsavounidis, I., Kuo, C.-C. J., and Zhang, Z. (1994). “A new initialization technique for Generalized Lloyd Iteration”, IEEE Sig. Proc. Letters, vol. 1, no. 10, pp. 144-146, October.

[KLE95a]

Kleijn, W. B., and Paliwal, K. K. (1995). “An introduction to speech coding”, in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal (Eds), Elsevier, Amsterdam, Chapter 1, pp. 3-47.

[KLE95b]

Kleijn, W. B. and Haagen, J. (1995). “Waveform interpolation for coding and synthesis”, in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal (Eds), Elsevier, Amsterdam, Chapter 4, pp. 175-207.

[KLE95c]

Kleijn, W. B. and Haagen, J. (1995). “A speech coder based on decomposition of characteristic waveforms”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 508-511.

[KOI00]

Koishida, K., Cuperman, V., and Gersho, A. (2000). “A 16 kbit/s bandwidth scalable audio coder based on the G.729 standard”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 1149-1152.

[KUB91]

Kubichek, R. F. (1991). “Standards and technology issues in objective voice quality assessment”, Digital Signal Processing, vol. 1, pp. 3844.

[KUW95]

Kuwabara, H., and Sagisaka, Y. (1995). “Acoustic characteristics of speaker

individuality:

Control

and

conversion”,

Speech

Communication, vol. 16, pp. 165-173. [LAF91]

Laflamme, C., Adoul, J-P., Salami, R., Morrissette, S., and Mabilleau, P. (1991), “16 kbps wideband speech coding technique based on algebraic CELP”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 13-16.

[LI99]

Li, C., Cuperman, V., and Gersho, A. (1999). “Robust closed-loop pitch estimation for harmonic coders by time scale modification”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing,

References 147

vol. 1, pp. 257-260. [LIN00]

Lin, W., Koh, S. N., and Lin, X. (2000), “Mixed excitation linear prediction coding of wideband speech at 8 kbps”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 11371140.

[LIND80]

Linde, Y., Buzo, A., and Gray, R. M. (1980). “An algorithm for vector quantiser design”, IEEE Trans. Commun., vol. COM-28, no. 1, pp. 8495, January.

[LLO82]

Lloyd, S. P. (1982). “Least squares quantization in PCM”, IEEE Trans Inf. Theory, vol. IT-28, no. 2, pp. 129-137, March.

[MA94]

Ma, C., and O’Shaughnessy, D. (1994). “The masking of narrowband noise by broadband harmonic complex sounds and implications for the processing of speech sounds”, Speech Communication, vol. 14, pp. 103-118.

[MAI88]

Maitre, X. (1988). “7 kHz audio coding within 64 kbit/s”, IEEE J. Sel. Areas. In Commun., vol. 6, no. 2, pp. 283-298, February.

[MAK75a]

Makhoul, J. (1975). “Linear prediction: a tutorial review”, Proc. IEEE, vol. 63, no. 4, pp. 561-580, April.

[MAK75b]

Makhoul, J. (1975). “Spectral linear prediction: properties and applications”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-23, no. 3, pp. 283-296, June.

[MAK78]

Makhoul, J., Viswanathan, R., Schwartz, R., and Higgins, A. W. F. (1978). “A mixed-source model for speech compression and synthesis”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 163-166.

[MAK79]

Makhoul, J., and Berouti, M. (1979). “High frequency regeneration in speech coding systems”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 428-431, April.

[MAK85]

Makhoul, J., Roucos, S., and Gish, H. (1985). “Vector quantization in speech coding”, Proc. IEEE, vol. 73, no. 11, pp. 1551-1588, November.

[MAL99]

Malik, N., and Holmes, W. H. (1999). “Log amplitude modelling of sinusoids in voiced speech”, in Proc. IEEE Int. Conf. on Acoustics,

References 148

Speech and Signal Processing (Phoenix, Arizona), vol. 1, pp. 465-468. [MAR72]

Markel, J. D. (1972). “The SIFT algorithm for fundamental frequency estimation”, IEEE Trans. Audio and Electroacoust., vol. AU-20, no. 5, pp. 367-377.

[MARA93]

Maragos, P., Quatieri, T. F., and Kaiser, J. F. (1993). “On amplitude and frequency demodulation using energy operators”, IEEE Trans. Sig. Process., vol. 41, no. 4, pp. 1532-1550, April.

[MAR76]

Markel, J. D., and Gray, A. H. (1976). Linear Prediction of Speech, Springer-Verlag, Berlin.

[MAT92]

Matsukoto, H., and Inoue, H. (1992). “A piecewise linear spectral mapping for supervised speaker adaptation”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. I, pp. 449-452.

[MCA86]

McAulay, R. J., and Quatieri, T. F. (1986). “Speech analysis/synthesis based on a sinusoidal representation”, IEEE Trans. Acoust., Speech Signal Processing, vol. ASSP-34, no. 4, pp. 744-754, August.

[MCA87]

McAulay, R. J., and Quatieri, T. F. (1987). “Multirate sinusoidal transform coding at rates from 2.4 kbps to 8 kbps”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 38.7.1-38.7.4.

[MCA88]

McAulay, R. J., and Quatieri, T. F. (1988). “Computationally efficient sine-wave synthesis and its application to sinusoidal transform coding”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 2558-2561.

[MCA90]

McAulay, R. J., and Quatieri, T. F. (1990). “Pitch estimation and voicing detection based on a sinusoidal speech model”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 249-252.

[MCA91]

McAulay, R. J., and Quatieri, T. F. (1991). “Sine-wave phase coding at low data rates”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 577-580.

[MCA92]

McAulay, R. J., and Quatieri, T. F. (1992). “Low-rate speech coding based on the sinusoidal model”, in Advances in Speech Signal Processing, S. Furui and M. Sondhi (Eds), Marcel Dekker, New York, pp. 165-208.

[MCA95]

McAulay, R. J., and Quatieri, T. F. (1995). “Sinusoidal coding”, in

References 149

Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal (Eds), Elsevier, Amsterdam, Chapter 4, pp. 121-173. [MCC97]

McClellan, S., and Gibson, J. D., (1997). “Variable-rate CELP based on subband flatness”, IEEE Trans. Speech and Audio Proc., vol. 5, no. 2, pp. 120-130, March.

[MCCR00]

McCree, A. (2000). “A 14 kb/s wideband speech coder with a parametric highband model”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 1153-1156.

[MCE93]

McElroy, C., Murray, B., and Fagan, A. D. (1993). “Wideband speech coding in 7.2 kb/s”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 620-623.

[MCE95]

McElroy, C., Murray, B., and Fagan, A. D. (1995). “Wideband speech coding using multiple codebooks and glottal pulses”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 253256.

[MIE00]

Miet, G., Gerrits, A., and Valière, J. C. (2000). “Low-band extension of telephone band speech”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 3, pp. 1851-1854.

[MIL96]

Miller, D., Rao, A. V., Rose, L., and Gersho, A. (1996). “A global optimization technique for statistical classifier design”, IEEE Trans. Sig. Proc., vol. 44, no. 12, pp. 3108-3122, December.

[MOO89]

Moore, B. C. J., Oldfield, S. R., and Dooley, G. J. (1989). “Detection and discrimination of spectral peaks and notches at 1 and 8 kHz”, J. Acoust. Soc. Am., vol. 85, no. 2, pp. 820-836, February.

[MOO97]

Moore, B. C. J. (1997). An Introduction to the Psychology of Hearing, Academic Press, San Diego.

[MUD98]

Mudugamuwa, D. J., and Bradley, A. B. (1998). “Optimal transform for segmented parametric speech coding”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 53-56.

[NAK89]

Nakamura, S., and Shikano, K. (1989). “Speaker adaptation applied to HMM and Neural Networks”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 89-92.

[NAK90]

Nakamura, S., and Shikano, K. (1990). “A comparative study of

References 150

spectral mapping for speaker adaptation”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 157-160. [NAKA97]

Nakatoh, Y., Tsushima, M., and Norimatsu, T. (1997). “Generation of broadband speech from narrowband speech using piecewise linear mapping”, in Proc. 5th European Conf. on Speech Commun. and Technol., EUROSPEECH, Rhodes, vol.3, pp. 1643-1646, Sept.

[NAY90]

Naylor, J. A. (1990). “A neural network algorithm for enhancing delta modulation/LPC tandem connections”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol.1, pp. 221-224.

[NEY83]

Ney, H. (1983). “Dynamic programming algorithm for optimal estimation of speech parameter contours”, IEEE Trans. Sys., Man and Cyber., vol. SMC-13, no. 3, pp. 208-214, March/April.

[NIL00]

Nilsson, M., Andersen, V., and Kleijn, W. B. (2000). “On the mutual information between frequency bands in speech”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 3, pp. 13271330.

[NIS93]

Nishiguchi, M., Matsumoto, J., Wakatsuki, R., and Ono, S. (1993). “Vector quantized MBE with simplified V/UV division at 3.0 kbps”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (Minneapolis, MN), vol. 2, pp. 492-495, April.

[NOL66]

Noll, P. (1966). “Cepstrum pitch determination”, J. Acoust. Soc. Am., vol. 41, no. 2, pp. 293-309.

[NOL93]

Noll, P. (1993). “Wideband speech and audio coding”, IEEE Communications Magazine, pp. 34-44, November.

[NOM98]

Nomura, T., Iwadare, M., Serizawa, M., and Ozawa, K. (1998). “A bitrate and bandwidth scalable CELP coder”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 341-344.

[OPP75]

Oppenheim, A. V., and Schafer, R. W. (1975), Digital Signal Processing, Prentice-Hall, New Jersey, USA.

[ORD91]

Ordentlich, E., and Shoham, Y. (1991). “Low-delay code-excited linear-predictive coding of wideband speech at 32 kbps”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol.1, pp. 9-12.

References 151

[OSH87]

O’Shaughnessy, D. (1987), Speech Communication Human and Machine, Addison-Wesley, Reading, Massachusetts.

[PAK93]

Paksoy, E., Srinivasan, K., and Gersho, A. (1993). “Variable rate speech coding with phonetic segmentation”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. II, pp. 155-158.

[PAL93]

Paliwal, K. K., and Atal, B. S. (1993). “Efficient vector quantization of LPC parameters at 24 bits/frame”, IEEE Trans. Speech and Audio Proc., vol. 1, no. 1, pp. 3-13, January.

[PAN98]

Panchapakesan, K., Bilgin, A., Marcellin, M. W., and Hunt, B. R. (1998). “Joint compression and restoration of images using wavelets and non-linear interpolative vector quantization”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 5, pp. 26492651.

[PAR00]

Park, K.-Y., and Kim, H. S. (2000). “Narrowband to wideband conversion of speech using GMM based transformation”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 1843-1846.

[PAT81]

Patrick, P. J., and Xydeas, C. S. (1981). “Speech quality enhancement by high frequency band generation”, Digital processing of signals in communications: Loughborough, pp. 365-373 (Proc IERE; no. 49) , 7th-10th April.

[PAUL81]

Paul, D. B. (1981). “The spectral envelope estimation encoder”, IEEE Trans. Acoust., Speech and Sig. Proc., vol. ASSP-29, no. 4, pp. 786794.

[PAU95]

Paulus, J. W. (1995). “Variable bitrate wideband speech coding using perceptually motivated thresholds”, Proc. IEEE Workshop on Speech Coding for Telecommunications, Annapolis, Maryland, USA, pp. 3536, September.

[PAU96a]

Paulus, J. W., and Schnitzler, J. (1996). “16 kbit/s wideband speech coding based on unequal subbands”, in Proc. Int. Conf. Acoust., Speech, Signal Processing, ICASSP, Atlanta, Georgia, USA, pp. 157160.

[PAU96b]

Paulus, J. W., and Schnitzler, J. (1996). “Wideband speech coding for

References 152

the GSM fullrate channel ?”, ITG F’tagung Sprachkomm., 17th-18th September. [PLA90]

Plack, C. J., and Moore, B. C. J. (1990). “Temporal window shape as a function of frequency and level”, J. Acoust. Soc. Am., vol. 87, no. 5, pp. 2178-2187, May.

[QUA88]

Quackenbush, S. R., Barnwell III, T. P., and Clements, M. A. (1988). Objective Measures of Speech Quality, Prentice-Hall, Englewood Cliffs NJ.

[RAB93]

Rabiner, L. R., and Juang, B.-H. (1993) Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs N.J.

[RAO96]

Rao, A., Miller, D., Rose, K., and Gersho, A. (1996). “A generalized VQ method for combined compression and estimation”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 2032-2035.

[ROB87]

Roberts, R. A., and Mullis, C. T. (1987). Digital Signal Processing, Addison-Wesley, Reading, Massachusetts.

[ROS71]

Rosenberg, A. E. (1971). “Effect of glottal pulse shape on the quality of natural vowels”, Journal of the Acoustical Society of America, vol. 49, no. 2, pp. 583-590.

[ROS92]

Rose, K., Gurewitz, E., and Fox, G. C. (1992). “Vector quantization by deterministic annealing”, IEEE Trans. Inf. Theory, vol. 38, no. 4, pp. 1249-1257, July.

[ROY91]

Roy, G., and Kabal, P. (1991). “Wideband CELP speech at 16 kbits/sec”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 17-20.

[SAD95]

Sadegh Mohammadi, H. R. (1995). Efficient coding of the short-term speech spectrum, Ph.D. Thesis, The University of New South Wales.

[SCHA70]

Scharf, B. (1970). “Critical Bands”, in Tobias, J. V. (Ed), Foundations of Modern Auditory Theory, Academic Press, New York, pp.159-202.

[SCHN98]

Schnitzler, J. (1998). “A 13.0 kbits/s wideband speech codec based on SB-ACELP”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 157-160.

[SCHR85]

Schroeder, M. R., and Atal, B. S. (1985). “Code-excited linear prediction (CELP) high quality speech at very low bit rates”, in Proc.

References 153

IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 937-940. [SEN93]

Sen, D., Irving, D. H., and Holmes, W. H. (1993). “Use of an auditory model to improve quality or lower the bit rate of speech coders”, J. Electrical and Electronics Engineering Australia, vol. 13, no. 2, pp. 123-128, Jun.

[SEY97]

Seymour, C. W., and Robinson, A. J. (1997). “A low-bit-rate speech coder using adaptive line spectral frequency prediction”, in Proc. 5th European Conf. on Speech Commun. and Technol., EUROSPEECH, (Rhodes, Greece), vol. 3, pp. 1319-1322, Sept.

[SHE98a]

Sheppard, D. G., Bilgin, A., Nadar, M. S., Hunt, B. R., and Marcellin, M. W. (1998). “A vector quantizer for image restoration”, IEEE Trans. Image Proc., vol. 7, no. 1, pp. 119-124, January.

[SHE98b]

Sheppard, D. G., Panchapakesan, K., Bilgin, A., Hunt, B. R., and Marcellin, M. W. (1998). “Lapped nonlinear interpolative vector quantization and image super-resolution”, Proc. Asilomar Conf. on Signals, Systems and Computers, vol. 1, pp. 224-228.

[SHI86]

Shikano, K., Lee, K-F., and Reddy, R. (1986). “Speaker adaptation through vector quantization”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 2643-2646.

[SOO93]

Soong, F. K., Juang, B.-H. (1993). “Optimal quantization of LSP parameters”, IEEE Trans. Speech and Audio Proc., vol. 1, no. 1, pp. 15-24, January.

[STE96]

Stegmann, J., Schröder, G., and Fischer, K. (1996). “Robust classification of speech based on the dyadic wavelet transform with application to CELP coding”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 546-549.

[SUN97]

Sun, X., Plante, F., Cheetham, B. M. G., and Wong, K. W. T. (1997). “Phase modelling of speech excitation for low bit-rate sinusoidal transform coding”, in Proc. ICASSP, pp. 1691-1694.

[SUZ96]

Suzuki, J., Nagami, K., and Yashima, H. (1996). “Generation of wideband speech from bandlimited speech by use of an autocorrelation function”, Proc. 3rd Meeting: Acoust. Societies of America and Japan,

References 154

in J. Acoust. Am., vol. 100, no. 4, pt. 2, pp. 2761-2762, October. [TAN97]

Tanaka, K., and Abe, M. (1997). “A new fundamental frequency modification algorithm with transformation of spectrum envelope according to F0”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 951-954.

[TAO99]

Taori, R., and Sluijter, R. J. (1999). “Closed loop tracing of sinusoids for speech and audio coding”, in Proc. IEEE Workshop on Speech Coding (Porvoo, Finland), pp. 1-3, June.

[TAO00]

Taori, R., Sluijter, R. J., and Gerrits, A. J. (2000), “Hi-BIN: An alternative approach to wideband speech coding”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 11571160.

[TOL98]

Tolba, H., and O’Shaughnessy, D. (1998). “On the application of the AM-FM model for the recovery of missing frequency bands of telephone speech”, Proc, ICSLP (Sydney, Australia), pp. 1115-1118, December.

[UBA97]

Ubale, A., and Gersho, A. (1997). “A multi-band wideband speech coder”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1367-1370.

[VAL92]

Valbret, H., Moulines, E., and Tubach, J. P. (1992). “Voice transformation using PSOLA technique”, Speech Communication, vol. 11, pp. 175-187.

[VALI00]

Valin, J.-M., and Lefebvre, R. (2000). “Bandwidth extension of narrowband speech for low bit-rate wideband coding”, to appear in Proc. IEEE Workshop on Speech Coding (Delavan, USA), September 18-20.

[VOI77]

Voiers, W. D. (1977). “Diagnostic acceptability measure for speech communication systems”, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 204-207.

[WANG96]

Wang, X., Chan, E., Mandal, M. K., and Panchanathan, S. (1996). “Wavelet-based image coding using non-linear interpolative vector quantization”, IEEE Trans. Image Proc., vol. 5, no. 3, pp. 518-522, March.

References 155

[YAS95a]

Yasukawa, H. (1995). “Frequency domain adaptive digital filter for spectrum extrapolation of band limited speech”, 2nd Asia Pacific Conf. on Commun., vol. 2, pp. 550-554.

[YAS95b]

Yasukawa, H. (1995). “Spectrum broadening of telephone band signals using multirate processing for speech quality enhancement”, IEICE Trans. Fundamentals, vol. E78-A, no. 8, pp. 996-998, August.

[YAS96a]

Yasukawa, H. (1996). “Restoration of wide band signal from telephone speech using linear prediction error processing”, in Proc. Int. Conf. on Spoken Language Processing, ICSLP, pp. 901-904.

[YAS96b]

Yasukawa, H. (1996). “Implementation of frequency domain digital filter for speech enhancement”, in Proc. 3rd IEEE Int. Conf. on Electronics, Circuits and Systems, ICECS, vol. 1, pp. 518-521.

[YOS94]

Yoshida, Y., and Abe, M, (1994). “An algorithm to reconstruct wideband speech from narrowband speech based on codebook mapping”, in Proc. Int. Conf. on Spoken Language Processing, ICSLP, Yokohama, pp. 1591-1594.

[ZWI90]

Zwicker, E., and Fastl, H. (1990). Psychoacoustics – Facts and Models, Springer-Verlag, Berlin.