Speech Coding Based On Sparse Modeling

AIN SHAMS UNIVERSITY FACULTY OF ENGINEERING

Electronics Engineering and Electrical Communications

Speech Coding Based on Sparse Modeling A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Electrical Engineering

Submitted by Eng. Ahmed Mohammed Naguib Elsayed Omara M.Sc. of Electrical Engineering

Supervised by Prof. Abdelhalim Abdelnaby Zekry Prof. Alaa Abdelfattah Hefnawy

Cairo, Egypt 2016

AIN SHAMS UNIVERSITY FACULTY OF ENGINEERING CAIRO - EGYPT Electronics and Communications Engineering Department

Examiners Committee Name: Ahmed Mohammed Naguib Elsayed Omara Thesis: Speech Coding Based On Sparse Modeling Degree: Doctor of Philosopy in Electrical Engineering (Electronics and Communications Engineering) Title, Name and Affiliation

Signature

1. Prof. Samia Abdelrazek Mashali Computers and Systems Department, Electronics Research Institute

…………… (Member)

2. Prof. Salwa Hussein Elramly Electronics and Communications Eng. Dept. Faculty of Engineering - Ain Shams University

…………… (Member)

3. Prof. Abdelhalim Abdelnaby Zekry Electronics and Communications Eng. Dept. Faculty of Engineering - Ain Shams University

…………… (Supervisor)

4. Prof. Alaa Abdelfattah Hefnawy Head of Computers and Systems Department, Electronics Research Institute

…………… (Supervisor)

Date: 15 /8 / 2016

STATEMENT This Thesis is submitted for the degree of Doctor of Philosophy to the Department of Electronics and Communication Engineering, Faculty of Engineering of Ain Shams University, 2016. The work included in this thesis was carried out by the author in the Department of Electronics and Communication Engineering, Ain Shams University and Electronics Research Institute, Computers and Systems Department. No part of this Thesis has been submitted for a degree or a qualification at any other university or institute.

Name: Ahmed Mohammed Naguib Elsayed Omara Signature: Date:

Researcher Data Name

: Ahmed Mohammed Naguib Elsayed Omara

Date of birth

: 29-11-1980

Place of birth

: Cairo - Egypt

Last academic degree

: Master of Science – Electrical Engineering

Field of specialization

: Wireless Networks – Signal Processing

University issued the degree : Ain Shams University Date of issued degree

: 2011

Current job

: Assistant Researcher in ERI

ACKNOWLEDGEMENTS All gratitude is due to “ALLAH” who guides me to bring forth to light this thesis. I would like to express my special appreciation and thanks to my advisor Professors Prof. Abdelhalim Zekry and Prof. Alaa Abdel Fattah Hefnawy, they are my mentor. I would like to thank them for encouraging my research and for allowing me to grow as a research scientist. Their advice on both research as well as on my career was priceless. I also want to thank them for letting my defense be an enjoyable moment, and for their brilliant comments and suggestions. Special thanks are to my family. Words cannot express how grateful I am to my father (Mohammed Naguib), mother (Kh. Hammouda), grandfather (Sayed Omara), grandmother (Ra. Mansour), brother (Ehab Omara), and my sister (Eman Omara) for all of the sacrifices that they have made for me. Your prayer for me was what sustained me thus far. I would also like to thank all of my friends who supported me in writing, and incented me to strive towards my goal. Also, I would like to express appreciation to my beloved wife (Ma. Abdel Rahman) who spent sleepless nights with me and was always my support in the moments when there was no one to answer my queries. At the end, I thank my children (Hussam and Ziad) to whom this thesis is dedicated.

AIN SHAMS UNIVERSITY FACULTY OF ENGINEERING Electronics Engineering and Electrical Communications Supervised by Prof. Abdelhalim Zekry, Prof. Alaa Abdelfattah Hefnawy

ABSTRACT In this thesis, we introduce three contributions in the field of sparse-based speech compression. The main contribution of this thesis is introducing a new backward technique so-called the Backward Replacement (BRe) that takes into consideration the impact of backward processing on the signal compression. The new technique doesn’t exploit the correlations to eliminate the weights, but it exploits the converged weights to replace the sparse vector with a sparse symmetric matrix which could be encoded efficiently. As for the second contribution, we introduce a lossybased rate saving enhancement on the BRe using an optimized approach that exploits the correlation among the atoms to reduce the replacement errors. Finally, we introduce a losslessbased rate saving enhancements which is based on hiding rows and columns in the obtained sparse matrix during the index encoding process. By comparing our approach with the backward elimination algorithms in the field of speech compression we concluded that, the BRe enhanced the compression capabilities of the forward modeling by 47% and outperforms the other backward elimination techniques whose enhancements reaches at most 15%. Also, from the obtained results we proved that, the proposed algorithm have the ability of encoding the speech signals at different bit rates with a reasonable quality. Not only, the proposed strategy proved its effectiveness on the compression but also it has a low complexity in comparison to the elimination-based backward algorithms.

As for the second and third contributions in this study, they are considered trials to enhance the rate savings of the OMP-BRe coder. The second contribution is a lossy-based RS enhancement approach and is termed the Optimized BRe (OBRe). The corresponding coder of this approach is termed OMP-OBRe and the results proved that, it has demonstrated successfully the superiority over OMP-BRe coder in enhancing the rate savings under certain conditions by approximately 3%, but in return, there is more computation time needed to fulfill the backward process. As for the third contribution, it is a lossless-based RS enhancement approach, and the results proved that the enhancements obtained by this approach are weak and don’t exceed 1%.

Keywords: Speech coding, sparse modeling, forward greedy pursuit, backward elimination, backward replacement.

Published Papers •

A. N. Omara, A. A. Hefnawy, A. A. Zekry. "On Sparse Compression Complexity of Speech Signals" Indonesian Journal of Electrical Engineering and Computer Science , vol. 1, No. 2, pp. 329-340, Feb. 2016.

•

A. N. Omara, A. A. Hefnawy, A. A. Zekry. "Sparse Modeling with Applications to Speech Processing: A Survey" Indonesian Journal of Electrical Engineering and Computer Science , vol. 2, No. 1, pp. 161-167, Apr. 2016.

•

A.N.Omara, A.A.Hefnawy, Abdelhalim Zekry. "A Compression-based Backward Approach for the Forward Sparse Modeling with Application to Speech Coding", Computers and Electrical Engineering, under publication.

Table of contents List of figures

xv

List of tables

i

List of Abbreviations

iii

List of Symbols

vii

1 Introduction 1.1 Speech processing . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sparse Representation . . . . . . . . . . . . . . . . . . . . 1.3 Speech Processing Based on Sparse Representation . . . . 1.3.1 Speaker Identification and Blind Source Separation 1.3.2 Speech Enhancement . . . . . . . . . . . . . . . . . 1.3.3 Speech Recognition . . . . . . . . . . . . . . . . . . 1.3.4 Speech Compression . . . . . . . . . . . . . . . . . 1.4 Motivation and Objectives . . . . . . . . . . . . . . . . . . 1.5 Summary of contributions . . . . . . . . . . . . . . . . . . 1.6 Organization of the thesis . . . . . . . . . . . . . . . . . . 2 Speech Coding 2.1 Time Domain-Based Encoders . . . 2.2 Frequency Domain-Based Encoders 2.3 LPC Vocoders . . . . . . . . . . . . 2.4 Multi-Band Excitation Vocoders . . 2.5 Hybrid Codecs . . . . . . . . . . . 2.6 Compressed Sensing-based Codecs . 2.7 Quality Measurements . . . . . . . 2.7.1 Subjective measures . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

1 1 2 3 3 4 5 6 6 7 7

. . . . . . . .

9 9 10 12 12 13 15 16 17

xii

Table of contents

2.8

2.7.2 Objective measures . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Forward and Backward Greedy Pursuit 3.1 Forward Pursuit . . . . . . . . . . . . . . . . . . 3.1.1 Matching Pursuit . . . . . . . . . . . . 3.1.2 Orthogonal Matching Pursuit . . . . . . 3.1.3 Optimized Orthogonal Matching Pursuit 3.1.4 Other Techniques . . . . . . . . . . . . . 3.1.5 Forward Pursuit Complexity . . . . . . . 3.2 Backward Pursuit . . . . . . . . . . . . . . . . . 3.2.1 Discrete Backward Stage (DBS) . . . . . 3.2.2 Feedback-based Backward Stage (FbBS) 3.2.3 Backward Pursuit Complexity . . . . . . 3.3 Summary . . . . . . . . . . . . . . . . . . . . .

17 18

. . . . . . . . . . .

19 20 21 22 23 23 23 26 27 29 31 31

. . . . . . . . . . . . . . . .

33 35 36 37 37 37 38 38 40 41 41 43 43 45 47 49 49

5 OMP-BRe Speech Coder Evaluation 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Encoder Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 51 51

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

4 Backward Replacement 4.1 Greedy Pursuit and Signal Compression . . . . . . . . 4.2 Backward Elimination Drawbacks . . . . . . . . . . . . 4.3 The Backward Replacement Approach . . . . . . . . . 4.3.1 Objective Function . . . . . . . . . . . . . . . . 4.3.2 From wk to Wk . . . . . . . . . . . . . . . . . . 4.3.3 Index Grouping . . . . . . . . . . . . . . . . . . 4.3.4 Replacement Weight and Replacement Error . . 4.3.5 Distortion Matrix . . . . . . . . . . . . . . . . . 4.3.6 Pair Selection Criterion . . . . . . . . . . . . . . 4.3.7 The Algorithm . . . . . . . . . . . . . . . . . . 4.3.8 Complexity and Memory Requirements of BRe . 4.4 Rate Analysis . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Compression Ratio and Rate Saving . . . . . . 4.4.2 Index Encoding . . . . . . . . . . . . . . . . . . 4.4.3 Runs Encoding and Decoding . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . .

Table of contents

xiii

5.3

Performance Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

5.4

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

5.4.1

Rate-Distortion Evaluation . . . . . . . . . . . . . . . . . . . . . .

55

5.4.2

Rate saving evaluation . . . . . . . . . . . . . . . . . . . . . . . .

63

5.4.3

Complexity evaluation . . . . . . . . . . . . . . . . . . . . . . . .

66

5.4.4

Quality and intelligibility assessment . . . . . . . . . . . . . . . .

67

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

5.5

6 Extra Rate Savings in OMP-BRe Codec

77

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

6.2

Lossy-based RS Enhancements . . . . . . . . . . . . . . . . . . . . . . .

77

6.2.1

Optimized BRe algorithm(OBRe) . . . . . . . . . . . . . . . . . .

77

6.2.2

OBRe Complexity . . . . . . . . . . . . . . . . . . . . . . . . . .

78

6.2.3

Results and discussions . . . . . . . . . . . . . . . . . . . . . . . .

79

Lossless-based RS Enhancements . . . . . . . . . . . . . . . . . . . . . .

91

6.3.1

I-Based Hidden Indices HI . . . . . . . . . . . . . . . . . . . . . .

91

6.3.2

P-Based Hidden Indices HP . . . . . . . . . . . . . . . . . . . . .

94

6.3.3

Furthest Index-Based Hidden Indices HF I . . . . . . . . . . . . .

96

6.3.4

Forced Hidden Indices HF . . . . . . . . . . . . . . . . . . . . . .

96

6.3.5

Results and discussions . . . . . . . . . . . . . . . . . . . . . . . .

96

6.3

6.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7 Conclusions and Future Works

103

7.1

Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.2

Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

References

105

Appendix A Rate - Distortion Results

117

A.1 ISOLET’s Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.1.1 No Lossless Compression . . . . . . . . . . . . . . . . . . . . . . . 117 A.1.2 Lossless Compression . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.2 TIMIT’s Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2.1 No Lossless Compression . . . . . . . . . . . . . . . . . . . . . . . 123 A.2.2 Lossless Compression . . . . . . . . . . . . . . . . . . . . . . . . . 126

xiv

Table of contents

Appendix B Chapter 5 Results (Rate vs. B.1 ISOLET’s Results . . . . . . . . . . . . B.1.1 No Lossless Compression . . . . B.1.2 Lossless Compression . . . . . . B.2 TIMIT’s Results . . . . . . . . . . . . B.2.1 No Lossless Compression . . . . B.2.2 Lossless Compression . . . . . .

SegSNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

129 129 129 132 135 135 138

Appendix C Chapter 5 Results (Rate Savings vs. SegSNR)

141

Appendix D Chapter 6 Results D.1 OMP-OBRe coder’s Results (RS vs D.1.1 ISOLET’s Results . . . . . . D.1.2 TIMIT’s Results . . . . . . D.2 Hidden Indices Results (RS vs N ) D.2.1 OMP-HBRe’s Results . . . . D.2.2 OMP-HOBRe’s Results . . .

149 149 149 152 155 155 156

SegSNR . . . . . . . . . . . . . . . . . . . . . . . . .

) . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

List of figures 2.1 2.2

Speech codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compressed sensing-based speech codecs. . . . . . . . . . . . . . . . . . .

10 16

3.1 3.2 3.3 3.4 3.5

MP Algorithm . . . . . . OMP Algorithm . . . . . Failure of forward greedy BGA’s psudeo code . . FBP’s pseudo code . . .

. . . . .

21 22 27 28 30

4.1 4.2

Compressed sensing-based coding system. . . . . . . . . . . . . . . . . . . BRe Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 42

5.1 5.2

(Rate vs. Distortion): ISOLET, N = 2M , No RLE. . . . . . . . . . . . . Effect of N and type of atoms on the average distortion levels of OMP over all sparse levels. Error bars represent 95% confidence intervals. . . . Effect of N and runs encoding on the rate levels averaged over all sparse levels for SDic. Error bars represent 95% confidence intervals. . . . . . . Effect of N and runs encoding on the rate levels averaged over all sparse levels for LDic. Error bars represent 95% confidence intervals. . . . . . . Average error difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rate resolution (bits/sample) required to increase SegSNR by 1dB, Case 1: SDic, No RLE, N = yM . . . . . . . . . . . . . . . . . . . . . . . . . . Rate resolution (bits/sample) required to increase SegSNR by 1dB, Case 2: SDic, RLE, N = yM . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rate resolution (bits/sample) required to increase SegSNR by 1dB, Case 3: LDic, No RLE, N = yM . . . . . . . . . . . . . . . . . . . . . . . . . . Rate resolution (bits/sample) required to increase SegSNR by 1dB, Case 4: LDic, RLE, N = yM . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10

. . . . . . . . . . . . algorithm. . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

56 57 58 60 60 64 64 65 66

xvi

List of figures

5.11 CPU time elapsed by backward processing. Error bars represent 95% confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.12 Interval plot of MOS-LQO and STOI (95% CI for the mean) at 8kbps. . 5.13 Interval plot of MOS-LQO and STOI (95% CI for the mean) at 16kbps. . 5.14 Interval plot of MOS-LQO and STOI (95% CI for the mean) at 32kbps. . 5.15 Interval plot of MOS-LQO and STOI (95% CI for the mean) at 64kbps. . 5.16 Interval plot of MOS-LQO and STOI (95% CI for the mean) of standard codecs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 6.3

OBRe Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interval plot of η (95% CI for the mean). . . . . . . . . . . . . . . . . . . Effect of N and RLE on the rate levels averaged over all sparse levels (95% CI for the mean). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 CPU time elapsed by backward processing (95% CI for the mean). . . . 6.5 (BRe vs OBRe), Rate resolution (bits/sample) required to increase SegSNR by 1dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 MOS-LQO and STOI improvements obtained by OBRe over BRe. . . . . 6.7 I-Based hidden indices in Wkr . . . . . . . . . . . . . . . . . . . . . . . . 6.8 P-Based hidden indices in Wkr . . . . . . . . . . . . . . . . . . . . . . . . 6.9 FI-Based hidden indices in Wkr . . . . . . . . . . . . . . . . . . . . . . . . 6.10 (BRe vs HBRe), Rate resolution (bits/sample) required to increase SegSNR by 1dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 (OBRe vs HOBRe), Rate resolution (bits/sample) required to increase SegSNR by 1dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 (Rate A.2 (Rate A.3 (Rate A.4 (Rate A.5 (Rate A.6 (Rate A.7 (Rate A.8 (Rate A.9 (Rate A.10 (Rate A.11 (Rate A.12 (Rate

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion):

ISOLET, N = 2M , No RLE ISOLET, N = 4M , No RLE ISOLET, N = 8M , No RLE ISOLET, N = 16M , No RLE ISOLET, N = 32M , No RLE ISOLET, N = 2M , RLE . . ISOLET, N = 4M , RLE . . ISOLET, N = 8M , RLE . . ISOLET, N = 16M , RLE . . ISOLET, N = 32M , RLE . . TIMIT, N = 2M , No RLE . TIMIT, N = 4M , No RLE .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

68 71 72 73 74 75 80 82 83 87 88 89 92 95 96 98 99 117 118 118 119 119 120 121 121 122 122 123 124

xvii

List of figures A.13 (Rate A.14 (Rate A.15 (Rate A.16 (Rate A.17 (Rate A.18 (Rate A.19 (Rate A.20 (Rate

vs. vs. vs. vs. vs. vs. vs. vs.

Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion): Distortion):

B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 B.9 B.10 B.11 B.12 B.13 B.14 B.15 B.16 B.17 B.18 B.19 B.20

(Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate (Rate

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR): SegSNR):

D.1 D.2 D.3 D.4 D.5 D.6 D.7

(RS (RS (RS (RS (RS (RS (RS

vs vs vs vs vs vs vs

SegSNR SegSNR SegSNR SegSNR SegSNR SegSNR SegSNR

): ): ): ): ): ): ):

TIMIT, TIMIT, TIMIT, TIMIT, TIMIT, TIMIT, TIMIT, TIMIT,

N N N N N N N N

= 8M , No RLE . = 16M , No RLE = 32M , No RLE = 2M , RLE . . . = 4M , RLE . . . = 8M , RLE . . . = 16M , RLE . . = 32M , No RLE

ISOLET, N = 2M , No RLE . ISOLET, N = 4M , No RLE . ISOLET, N = 8M , No RLE . ISOLET, N = 16M , No RLE ISOLET, N = 32M , No RLE ISOLET, N = 2M , RLE . . . ISOLET, N = 4M ,RLE . . . ISOLET, N = 8M , RLE . . . ISOLET, N = 16M ,RLE . . ISOLET, N = 32M , RLE . . TIMIT, N = 2M , No RLE . TIMIT, N = 4M , No RLE . TIMIT, N = 8M , No RLE . TIMIT, N = 16M , No RLE . TIMIT, N = 32M , No RLE . TIMIT, N = 2M , RLE . . . TIMIT, N = 4M ,RLE . . . . TIMIT, N = 8M , RLE . . . TIMIT, N = 16M ,RLE . . . TIMIT, N = 32M , RLE . . .

ISOLET, N = 2M . ISOLET, N = 4M . ISOLET, N = 8M . ISOLET, N = 16M ISOLET, N = 32M TIMIT, N = 2M . TIMIT, N = 4M .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

124 125 125 126 127 127 128 128

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

129 130 130 131 131 132 133 133 134 134 135 136 136 137 137 138 139 139 140 140

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

149 150 150 151 151 152 153

xviii D.8 (RS D.9 (RS D.10 (RS D.11 (RS D.12 (RS D.13 (RS D.14 (RS

List of figures vs vs vs vs vs vs vs

SegSNR ): TIMIT, SegSNR ): TIMIT, SegSNR ): TIMIT, N ): ISOLET . . N ): TIMIT . . . N ): ISOLET . . N ): TIMIT . . .

N = 8M N = 16M N = 32M . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

153 154 154 155 155 156 156

List of tables 2.1

Narrowband and broadband speech codecs [1]. . . . . . . . . . . . . . . .

14

3.1 3.2

The computational complexity of MP . . . . . . . . . . . . . . . . . . . The computational complexity of OMP . . . . . . . . . . . . . . . . . .

24 25

5.1 5.2

Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 53

6.1

Improvements in rate savings of BRe by OBRe, for ISOLET and TIMIT datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . r for B = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . k

90 95

6.2 6.3 6.4

r k

for B = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Improvements in rate savings obtained by hidden indices strategy, for ISOLET and TIMIT datasets . . . . . . . . . . . . . . . . . . . . . . . . 100

C.1 Rate Saving (%) vs. SegSNR (dB), ISOLET . . . . . . . . . . . . . . . . C.2 Rate Saving (%) vs. SegSNR (dB), TIMIT . . . . . . . . . . . . . . . . . C.3 Improvements in rate savings of OMP by BGA, FBP and BRe, for ISOLET and TIMIT datasets, N = 2M . . . . . . . . . . . . . . . . . . . . . . . . C.4 Improvements in rate savings of OMP by BGA, FBP and BRe, for ISOLET and TIMIT datasets, N = 4M . . . . . . . . . . . . . . . . . . . . . . . . C.5 Improvements in rate savings of OMP by BGA, FBP and BRe, for ISOLET and TIMIT datasets, N = 8M . . . . . . . . . . . . . . . . . . . . . . . . C.6 Improvements in rate savings of OMP by BGA, FBP and BRe, for ISOLET and TIMIT datasets, N = 16M . . . . . . . . . . . . . . . . . . . . . . . C.7 Improvements in rate savings of OMP by BGA, FBP and BRe, for ISOLET and TIMIT datasets, N = 32M . . . . . . . . . . . . . . . . . . . . . . .

142 143 144 145 146 147 148

List of abbreviations bps

bits per sample

kbps

kilo bit per second

ACELP

Adaptive Code Excited Linear Prediction

ADPCM

Adaptive Differential Pulse Code Modulation

AMR

Adaptive Muli-Rate

AMR-WB

Adaptive Muli-Rate Wide Band

ASR

Automatic Speech Recognition

BE

Backward Elimination

BGA

Backward Greedy Algorithm

BOOMP

Backward Optimized Orthogonal Matching Pursuit

BP

Basis Pursuit

BRe

Backward Replacement

BSS

Blind Source Separation

CELP

Codebook Excitation Linear Prediction

CI

Confidence Interval

CS-ACELP

Conjugate Structure

CoSaMP

Compressive Sampling Matching Pursuit

DALT

Diagnostic Alliteration Test

iv

List of tables DBS

Discrete Backward Stage

DCT

Discrete Cosing Transform

DFT

Discrete Fourier Transform

DPCM

Differential Pulse Code Modulation

DRT

Diagnostic Rhyme Test

DS

Dantzig Selector

DUET

Degenerate Un-mixing Estimation Technique

DiA

DiAgonal scanning method

EOB

End-of-Block

EVS

Enhanced voice Services

FBP

Forward Backward Pursuit

FFT

Fast Fourier Transform

FG

Forward Greedy

FOCUSS

FOcal Underdetermined System Solver

FWT

Fast Wavelet Transform

FbBS

Feedback-based Backward Stage

FoBa

Forward-Backward

HBRe

Hidden indices-based BRe

HD

High Definition

HMM

Hidden Markov Model

HOBRe

Hidden indices-based OBRe

ICA

Independent Component Analysis

K-SVD

K means- Singular Value Decomposition

KLT

Karhunen-Loève Transform

v

List of tables LARS

Least Angle Regression

LASSO

Least Absolute Shrinkage and Selection Operator

LD-CELP

Low Delay Codebook Excitation Linear Prediction

LDic

Learned Dictionary

LP

Linear prediction

LtR

Left-to-Right scanning method

MBE

Multi-Band Excitation

MDCT

Modified Discrete Cosine Transform

MELP

Mixed Excitation Linear Prediction

MMP

Molecular Matching Pursuit

MOD

Method of Optimal Directions

MOS

Mean Opinion Score

MOS-LQO

Mean Opinion Score-Listening Quality Objective

MP

Matching Pursuit

MPE

Multi-Pulse Excited codecs

NESTA

Nesterov’s Algorithm

NMSE

Normalized Mean Square Error

NP-hardness

Non-deterministic Polynomial-time hard

OBRe

Optimized Backward Replacement

OLS

Orthogonal Least Square

OMP

Orthogonal Matching Pursuit

OOMP

Optimized Orthogonal Matching Pursuit

PCA

Principal Component Analysis

PCM

Pulse Code Modulation

vi

List of tables PESQ

Perceptual Evaluation of Speech Quality

R-D

Rate-Distortion

RCELP

Relaxed Codebook Excitation Linear Prediction

RIP

Restricted Isometric Property

RLE

Run Length Encoding

ROMP

Regularized Orthogonal Matching Pursuit

RPE

Regular-Pulse Excited codecs

SDic

Structured Dictionary

SNR

Signal to Noise Ratio

STFT

Short Time Fourier Transform

STOI

Short Time Objective Intelligibility

SegSNR

Segmental Signal to Noise Ratio

SpaRSA

Sparse Reconstruction by Separable Approximation

StOMP

Stagewise Orthogonal Matching Pursuit

VBR

Variable Bit Rate

VSELP

Vector Sum Excited Linear Prediction

dB

decibel

List of symbols a

lower case Bold-face characters represent vectors

A

upper case Bold-face characters are used for matrices

|z|

the absolute value, or modulus, of z.

|g| ≡ Card(g)

the cardinality of a set g.

∥a∥2

the Frobenius norm of vector a

∥a∥0

the zero norm of vector a; i.e., the number of non zeros in a

aT , AT

the transpose of a

⌊.⌋

the largest integer smaller or equal to its argument

⌈.⌉

the smallest integer larger or equal to its argument

Γ = {1, ..., N }

an index set whose cardinality is |Γ| = N

Λk = {1, ..., k}

another index set whose cardinality is |Λk | = k

Γk = {γλ : λ ∈ Λk } a subset of Γ such that |Γk | = k ¯ Γk (λ, λ)

returns the corresponding indices (γ, γ¯ ) ∈ Γk

1T ∈ RN

a column of N ones

x ∈ RM

a vector of M samples

x ˆ ∈ RM

estimation of x

Q(.) ∈ RM

quantized M samples

O(.)

big O notation of complexity

viii

List of tables Φ = {ϕγ : γ ∈ Γ} the dictionary matrix of N normalized atoms ϕγ ∈ RM N M

Ov(Φ)

Over-completeness of Φ and equals

ΦΓk

a submatrix containing the columns of Φ with indices in Γk

wk ∈ RN

weights vector with k non zero elements

wkb ∈ RN

weights vector with k − b non zero elements

[a]i

entry i in vector a

wi = [wk ]i

entry i in vector wk

wΓ k

a subvector of wk and consists of k weights only

V(wΓk )

a function recovers vector wk from subvector wΓk

diag(a)

returns a square matrix whose diagonal are the entries of a

[A]i,j

entry i in column j of matrix A

[A]i

column j of matrix A

P

a set of pair of integers

[A]P

a pair of entries [A]i,j and [A]j,i

→ − en

residual error after n forward iterations

← − en

residual error after n backward iterations

r

number of replacements

Wkr

sparse symmetric matrix with k − 2r nonzeros in the diagonal

CR

Compression Ratio

RS

Rate Saving

∆q

quantization step

GΣ

number of additions

GΠ

number of multiplications

G ¸

Gram matrix

ix

List of tables Υk ∈ Rk×k

distortion matrix

η

efficiency factor of the backward replacement approach

E[.]

the mean value of a given random variable

Chapter 1 Introduction 1.1

Speech processing

Speech processing is a provocative field of signal processing which has drawn the attention of researchers over the past few decades. Nowadays, different applications utilize many tools which are related to speech processing including speech analysis, synthesis, enhancement, recognition and speech compression (coding). The process of converting the raw recordings of the speech signal into more characteristic representations is called speech analysis, and this process facilitates the access to descriptive and invariant attributes of this signal. The reciprocal process of the speech analysis is called the speech synthesis. In speech synthesis task, we try to generate the original speech signal using the extracted attributes of the analysis process, or to generate artificial speech signal. The speech signal can be generated by mixing pieces of recorded speech segments that are stored in a large database. Moreover, systems diverge in the size of the recorded speech; a system that stores phones or diphones is capable of supplying the largest output range, but may cause a lack of clarity. For particular usage fields, the storage of whole sentences or words permits for high-quality output. Instead of that, a synthesizer can incorporate human voice characteristics and a model of the vocal tract to create a completely synthetic voice output. With the speech recognition process, we can promote methodologies and technologies that enable the recognition and translation of uttered language into text by computers and computerized equipment such as those labeled as smart technologies and robotics. But, with speech enhancement process, we aim to promote speech quality by using several algorithms for improving the speech intelligibility and the overall perceptual quality of degraded speech signal. Enhancing of speech corrupted by noise, or noise alleviation, is the most important area of speech enhancement, and used for many recent applications

2

Introduction

such as mobile phones, teleconferencing systems, VoIP and hearing aids. Finally, speech coding; or speech compression, is an operation that tries to represent any speech signal using a few bits whenever possible, preserving at the same time a sensible level of speech quality. Current coders can be splitted into different categories such as the waveform based speech coders, parametric coders and hybrid coders. As known, parametric coders are not able to transcribe the original waveform; but they aim at reproducing a perception of the original. On the other side, waveform approximating coders, are able to reproduce the original waveform at high bit rates. Hybrid coding is an integration of two or more coders of any type for the best subjective performance at a given bit rate.

1.2

Sparse Representation

The sparse representation was first presented in [2], [3] as a technique to find a linear combinations of basis functions to encode natural images sparsely. Sparse decomposition of signals is a growing field of research which aims at finding a set of prototype signals called atoms ϕ ∈ RM which forms a dictionary [4], or codebook Φ ∈ RM ×N that can be utilized to represent a specific set of a given signal x ∈ RM sparsely using a few atoms in the dictionary. Mathematically, for the given signal, we need to find the suitable atoms in Φ such that x = Φwk + e

(1.1)

where wk is a k-sparse vector which contains k non-zero weights for the linear combination, and e is the reconstruction error vector. The main idea of sparse modeling is to find the best k atoms belonging to the dictionary Φ to represent the given signal in a compressed sense, i.e; k ≪ M . Furthermore, the atoms can be categorized according to the inter-correlations to orthogonal, quasi-orthogonal or non-orthogonal atoms [5]. There are different methods for solving the sparse problems. The greedy-based methods are one of the sparse solution strategies such as MP (Matching Pursuit) [6], OMP (Orthogonal Matching Pursuit) [7], OOMP (Optimized Orthogonal Matching Pursuit) [8], CoSaMP (Compressive Sampling Matching Pursuit)[9] and StOMP (Stagewise Orthogonal Matching Pursuit) [10]. The greedy-based methods are based on a NP hard procedures, so another efficient trend had been employed and based on convex optimization strategies such as LASSO (Least Absolute Shrinkage and Selection Operator) and LARS (LeastAngle Regression)[11], BP (Basis Pursuit) and basis pursuit denoising [12]. Recently fast solvers for the LASSO problem have been introduced in [13] and [14]. Non-convex

1.3 Speech Processing Based on Sparse Representation

3

optimization methods include FOCUSS [15], sparse Bayesian learning [16, 17] and Monte Carlo-based methods like those in [18], [19], [20] and [21]. The sparse representation of a signal takes into account not only the k-atoms selection, but also the internal structure of the atom itself. From this point of view, the dictionary of atoms is classified into two categories, namely: Structured Dictionary (SDic) or Learned Dictionary (LDic). The structured dictionaries, such as Gabor wavelets, Haar wavelets, Daubechies wavelets, steerable wavelets, undecimated wavelets, contourlets, curvelets, ... etc. Although the structured dictionaries are leading to transform any signal rapidly but their performance is poor in sparse representations. Unlike the structured dictionaries, the learned dictionary is built through training signal, and these signals are identical to those anticipated in the application. Also, it is able to adapt to any type of signals according to the training database. But the main cost of this trend is the higher computational complexity, compared to the structured dictionaries. Another drawback of the training methodologies is its limitation to low dimensional signals. The most common training mechanisms are K-Singular Value Decomposition Algorithm (K-SVD) [22] and Method of Optimal Directions (MOD) [23].

1.3

Speech Processing Based on Sparse Representation

The sparse representation of a signal has drawn the attention of several researchers across several communities including information theory, signal processing and optimization [24– 26]. Also, it has ubiquitous applications in speech and audio processing areas, including model regularization, acoustic modeling, acoustic/audio feature selection , dimensionality reduction, speech compression, speech recognition, blind source separation, and many others. This section presents some efforts on speech processing using sparse modeling.

1.3.1

Speaker Identification and Blind Source Separation

The authors in [27] introduced a novel method for speaker identification or determining an unknown speaker’s identity based on a sparse model that allows the use of low power of voice transmission which is recorded by a sensor. Besides, this method has the ability to be robust to the background noise in the recorded signal. As for the Blind Source Separation (BSS), it is a challenging problem that has been studied extensively in recent years. For the work in [28], the author presented a promising method to the BSS for speech signals based on sparse representation with adaptive dictionary learning. In

4

Introduction

another work [29], the author illustrated that the sparse modeling can be utilized with an appropriate signal dictionary to provide high-quality blind source separation. In [30], the author addressed the convolutive BSS issue and suggested a solution using sparse Independent Component Analysis (ICA).

1.3.2

Speech Enhancement

Recently, sparse modeling is vastly used for speech processing in noisy conditions; despite that, many problems need to be solved because of the particularity of speech. In [31], a novel view for the enhancement of signals was applied successfully to speech using the K-SVD [22]. As illustrated in section 1.2, the K-SVD algorithm is designed for training a set of atoms that best suits a collection of given signals. Another speech enhancement technique was suggested in [32] when the author proposed an exemplar-based technique for the noisy speech. The technique works by locating a sparse representation of the degraded speech in a dictionary containing both speech and noise examples, and employs the activated dictionary atoms to create a time-varying filter to enhance the degraded speech. A good effort was done in [33]; the author proposed an effective dual-channel de-noising algorithm based on sparse modeling. This approach possesses four steps. Firstly, the patches which are overlapped and sampled from two channels are trained by the K-SVD algorithm to be a learned dictionary. Secondly, the sparse weights of these patches can be obtained using the orthogonal matching pursuit algorithm (OMP) and the learned dictionary. Thirdly, the de-noising signal can be acquired by updating the weights. In the last step, the previous steps are iterated and repeated to obtain clearer speech until the preset condition is reached. Experimental results prove that the dual-channel algorithm works better than that of single channel. Another speech denoising method based on adaptive dictionary learning was proposed in [34]. The algorithm builds a user-defined complete dictionary, whose elements clearly encode local features of the speech signal. The performance of the algorithm was compared to that of the Principal Component Analysis (PCA) method, and it was found that it gives good estimation for the signal, even as the number of atoms in the reconstructions decreases considerably; it was also noted that the algorithm has good tolerance to noise, comparable to that produced by PCA. The enhancement of any speech signal corrupted by a non-stationary interferer was addressed in [35]. The author presented a monaural speech enhancement method based on sparse coding of noisy speech signals in a mixed learned dictionary. This dictionary consists a speech and interferer dictionaries, both being possibly over-complete. The speech dictionary is learned off-line and the interferer dictionary is learned on-line during speech pauses.

1.3 Speech Processing Based on Sparse Representation

1.3.3

5

Speech Recognition

Most of automatic speech recognition (ASR) technologies are based on hidden Markov models (HMMs), which model a time-varying speech signal using a sequence of states, each of which is associated with a distribution of acoustic features. While HMMs reach a relatively high performance in good conditions, they have problems in modeling wide variances in natural speech signals, such as speech in natural environments which is often interfered by environmental noises. Recently, some studies [36], [37], [38], [39], [40] and [41] have aimed at ASR using sparse representations of speech. They are using a linear combination of speech atoms as a time-frequency representation of speech. Benefits of the existing systems range from improved recognition accuracy to an easy incorporation of robustness to additive noise. Some of these systems construct the dictionary of atoms to be used in the sparse representation from exemplars of speech, which are realizations of speech in the training data, spanning multiple time frames [41]. When the weights of the sparse representation are used directly in the recognition, the fundamental problem that takes place is the association of higher-level information with the atoms in the dictionary to enable the recognition. In [37], the author trained a neural network to map the weights of the atoms directly to phoneme classes. Whereas in [40], the author associated each atom with one phonetic class, and recognition was done by finding the phoneme class with the highest sum of weights. Also in [39], the author used a dictionary consisting of both acoustic information and higher-level phonetic information. But in [38], the author used the index of the speech atom with the highest weight as an additional feature for their Dynamic Bayesian Network recognizer. Beside the foregoing efforts, there are more researches on speech recognition based on sparse representations. In [42], the author enhanced the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm for improving the speech recognition rates. In [43], the author used the sparse representation to estimate the missing weights of the speech signal. In [44], the author had evaluated the sparsity assumptions incorporated in sparse component analysis in the framework of Degenerate Un-mixing Estimation Technique (DUET) for speech recognition in a multi-speaker environment. In [45], the author proposed a state-based labeling for acoustic patterns of speech and a method for using this labeling in noise robust automatic speech recognition. In [46], a framework for an exemplar-based, de-convolutive speech recognition system was presented.

6

1.3.4

Introduction

Speech Compression

As for sparse-based speech compression efforts, in [47], the author presented the Molecular Matching Pursuit (MMP) algorithm that is suitable for speech coding. The main goal of MMP is to make a practical decomposition such that the algorithm identifies and removes more than two orthogonal atoms at every iteration. Although, the MMP gives a sub-optimal approximation, it is significantly faster since the inner products in the update step is made for a vast number of atoms at every iteration. Also, the use of a Modified Discrete Cosine Transform (MDCT) for speech coding was investigated in [48]. This approach gives a sparser decomposition than the traditional MDCT-based transform and permits better coding performance at low bit rates. Opposed to recent low bit-rate coders, which are based on pure parametric or hybrid representations, the approach is capable of providing transparency.

1.4

Motivation and Objectives

Although the evolution of the hybrid speech codecs such as Adaptive Multi-Rate codec (AMR) [49], AMR-wideband codec (AMR-WB) [50] and Enhanced Voice Services codec (EVS) [51]), the waveform-based speech codecs such as Pulse Code Modulation (PCM) [52] and adaptive PCM (ADPCM) [53] are still used in applications, for example, they can be used to send audio on fiber-optic long-distance lines as well as to record it on electronic devices such as CDs, USBs and others. Also, there are recent enhancements to the ADPCM such as NUT-ADPCM [54] that yields improvements in both signal-to-noise ratio and average bit rate. In this work, we consider the usage of sparse modeling as a waveformbased speech encoder that tries to reproduce the speech waveform without regard to the statistics of the signal. As we will see later (Chapter 3), the sparse modeling consists of two stages; the mandatory stage is the forward modeling that aims at reconstructing the signal by few atoms and the secondary stage is the backward processing that aims at increasing the sparseness by eliminating more atoms and compensating the elimination error. To our knowledge, there is no study about the effect of the existing backward processing techniques upon the signal compression. But there are few efforts such as [55] and [56] that discuss the Rate-Distortion (R-D) performance of the forward modeling techniques. According to these studies the forward algorithms have a poor performance due to the lack of tying up their distortion and the required bit rates. The previous studies enhanced the R-D performance of the forward algorithms by replacing the sparse-based objective function with a rate-based objective function. This replacement procedure converted the

1.5 Summary of contributions

7

methodologies from a heuristic-based to be optimal-based. The main drawback of the optimized solutions is the computational complexity that increases with increasing the nonzero elements. Therefore, as stated by their authors, this type of solutions are suitable for low bit rates at which the sparse level is low too. As mentioned, sparse coding is one type of waveform-based encoders, which doesn’t work well at low bit rates. Consequently, the R-D optimized strategies are also unsuitable for quality-based compression such as speech and image compression because it is expected to be computationally expensive at high bit rates. In this thesis, our strategy is to introduce another backward heuristic-based methodology that can work efficiently at different bit rates to match the requirements of quality-based applications such as the speech compression which is the main goal of this work.

1.5

Summary of contributions

In this thesis, we introduce three contributions in the field of sparse-based speech compression. The main contribution of this thesis is introducing a new backward technique so-called the Backward Replacement (BRe) that takes into consideration the impact of backward processing on the signal compression. At the time the other techniques adopt the correlations to eliminate the weights, our new proposed technique exploits the converged weights to replace the sparse vector with a sparse symmetric matrix which can be encoded efficiently. As for the second contribution, we introduce a lossy-based rate saving enhancement on the BRe using an optimized approach that exploits the correlation among the atoms to reduce the replacement errors. Finally, we introduce a lossless-based rate saving enhancement which is based on hiding rows and columns in the obtained sparse matrix during the index encoding process.

1.6

Organization of the thesis

Here is a brief description of the organization of the thesis. Chapter 2, presents an overview of speech coding categories with emphasis on waveform and source coding. In Chapter 3, a detailed overview of the forward and backward greedy algorithms is presented with emphasis on their methodologies and complexity levels. In Chapter 4, a detailed discussion about the first contribution is introduced, and this includes the methodology, rate analysis and complexity analysis.

8

Introduction

In Chapter 5, by conducting a simulation study with varying parameters, an evaluation is introduced for the backward approach which is discussed in Chapter 4. Also there is a statistical tests that measures the significance of the obtained improvements. Chapter 6 presents a detailed analysis of the other contributions including the suggested methodologies and the evaluation study. Finally, Chapter 7 summarizes the thesis and point out a list of future work.

Chapter 2 Speech Coding Like natural sensors, current artificial sensors are capable of sensing the world and gathering signals of physical processes. But, these sensors are usually not conscious of the physical process underlying the phenomena they “see,” and hence they often sample the signal with a higher rate than the effective dimension of the process. Nowadays, there are advanced artificial sensors that have the ability of recording the signals with a high quality. The current digital cameras have the ability of recording the HD videos and high resolution images. Also, the current microphones record the speech and audio signals with a high quality. Due to the evolution of the speech communication technologies, we selected the speech signal as the main test one-dimensional signal in this thesis. For storage and communication purposes, the sampled speech signal x ∈ RM should be represented efficiently, or in other words the speech signal representation should have few bits (speech compression or coding). As illustrated in Figure 2.1, there are three categories for speech codecs, namely: waveform codecs, vocoders and hybrid coders.

2.1

Time Domain-Based Encoders

The representation of the speech signal x through its M sample values or the difference between the consecutive quantized samples are the simplest possible representations such as the Pulse Code Modulation (PCM)[52], the Differential Pulse Code Modulation (DPCM) and the adaptive version of DPCM or ADPCM [53]. In PCM, the continuous speech signal x(t) passes through a filter which passes all low frequencies, then the obtained signal is sampled by a sampler to obtain a discrete speech signal x(nTs ), where 1/Ts is the sampling frequency which is lower bounded by the Nyquist rate 1/Ts ≥ 2fm , where fm is the speech signal bandwidth. Before coding, the obtained samples should be quantized by a scalar quantizer that maps the samples amplitudes to a finite set

10

Speech Coding

Fig. 2.1 Speech codecs of amplitudes. In DPCM, the encoder exploits the correlation among the contiguous samples. So, the sample of order n, x(n), is compared to a level xp (n) which is predicted by the encoder, and the difference between them e(n) is called the prediction error or the prediction residual. e(n) has a smaller range than x(n), so for a preset distortion level, fewer bits are needed for quantizing e(n). As for ADPCM, it adapts the coefficients of prediction using a gradient descent method. Moreover, the obtained error is quantized at a rate of 2-5 bits per sample using a well-known quantizer called by "companded PCM quantizer", and the step size of quantization adapts to the levels of the error signal e(n). Another name for the samples-based representation is the time-domain based encoders, which falls under the category of waveform-based encoders. These coders are high-bit-rate coders; i.e., above 16 kbps. Another common trends in the waveform-based encoders are the frequency domain-based encoders and the compressed sensing-based encoders that will be discussed in the next sections.

2.2

Frequency Domain-Based Encoders

In this category of speech encoders, the speech vector x of highly correlated samples is transformed to another vector with less correlated weights. Each weight represents

11

2.2 Frequency Domain-Based Encoders

a power level of a frequency or sub-band of frequencies. So, this category consists of two trends, namely: transform coding or sub-band coding. In transform coders [57–61], the main idea behind them is to remove redundancy in the speech vector, x ∈ RM , by transforming it to a new vector, w, with the same dimension. The vector w contains the weights of M synthesis vectors {ϕi }M i=1 , and hence the speech vector can be expanded linearly as follows: 

˜ = Φw = [ϕ1 ϕ2 ... x

˜ = x

M X

wi ϕi

   ϕM ]    

w1 w2 .. . wM

       

(2.1)

i=1

where the tilde is used to indicate the possibility of approximated quantities. In the case of common transforms, such as the Discrete Cosine Transform (DCT)[62], Discrete Fourier Transform (DFT)[63], Short Time Fourier Transform (STFT) and the Karhunen-Loève Transform (KLT) [64], the synthesis vectors form an orthogonal basis for RM , Also, if the synthesis vectors are not orthogonal, they should form a basis for RM ; i.e., the synthesis matrix Φ should be full rank matrix, since this is necessary for the inverse, Φ−1 , to exist. The bit-rate reduction in this type of coders backs to the fact that the unitary transforms produce uncorrelated transform components that can be coded independently. In sub-band coders, a filter bank is first utilized to process and filter it into frequency bands and each band is represented by a few bits with a certain criterion. Also, this type of coders exploits the statistics of the signal to encode it in each band using a few number of bits. As well known, it is so difficult to acquire a high quality by the sub-band coders at the low bit rates, so these techniques have been used usually for wide-band medium to high bit rate speech coders and for audio coding. For instance, the standard G.722 [65, 66] employs the ADPCM within two sub-bands, and a preset bit allocation is utilized to achieve a bit rate of 64 kbps for the 7 kHz speech signal. In Refs. [67–69] A robust and flexible sub-band speech coding scheme is proposed. In this coding scheme, there is no speech production model, so it yields robustness to the noise, and to the signals which are generated by non-speech sources. The compression with good quality can be obtained by adding masking attributes of the human auditory system, as depicted in [70, 71].

12

Speech Coding

Even though sub-band coding is not widely used as speech coders today, it is expected that new standards of sub-band coders will be oriented to wide-band coding and rateadaptive schemes.

2.3

LPC Vocoders

The second trend in the speech representation is to represent the speech signal using some parameters, and this kind is termed vocoders or parametric encoders. Unlike the waveform-based speech encoders, the parametric encoders work on maintaining some speech features in the given production model, such as pitch , spectral envelope and energy contour, etc. The quality of LPC output do not converge towards the transparent quality of the original signal and this quality doesn’t depend on the quantization level of the model parameters. This failure of the LP coders backs to the constraints of the speech production model used. Besides, they do not maintain the waveform similarity and the SNR measurement is meaningless, and the quality should be assessed subjectively. This category of coders works well for low bit-rate; i.e., from 2 to 5 kbps. The Linear Prediction Coder (LPC)[72] and Mixed Excitation Linear Prediction (MELP)[73] are examples of this category.

2.4

Multi-Band Excitation Vocoders

Multiband Excitation Coding (MBE) was developed by Griffin and Lim [74]. The Griffins’ coder depends upon a model that handles the spectrum of the short-time speech signal as the product of a vocal tract envelope and an excitation spectrum. By a combination of random and harmonic contributions, we can model the excitation spectrum. This approach of modeling depends on the fact that the spectrum of blended sounds or corrupted speech contain both unvoiced and voiced regions; in other words, random and harmonic regions. Thus, the spectrum of speech signal is partitioned into several sub-bands and each sub-band is labeled as unvoiced sub-band or voiced sub-band. Many studies have shown that the MBE coder can operate with little drop in quality at rates equal to or below 2.4 kbps if more efficient quantization is incorporated. All authors in [75–81] have shown that the bit rate can be reduced significantly if the spectral modeling is replaced with an LP modeling technique. For the vector quantization of the spectral magnitudes which does not use the linear prediction models has been declared in [82] and [83] for 2.4 kbps 3 kbps respectively. Another approach was proposed by [84] for 2.4 kbps, where a post processor follows the MBE analysis which pick out three

2.5 Hybrid Codecs

13

fixed windows of the bandwidth and sends only the spectral information for these bands. Also, in an earlier work [85], a variable rate MBE encoder with high-quality was declared, which combines MBE with phonetic classification and a new spectral vector quantization technique .

2.5

Hybrid Codecs

As parametric-based codecs cannot achieve high speech quality because of the use of simple classification of speech segments into either voiced or unvoiced speech and simple representation of voiced speech with impulse period train, hybrid coding techniques were proposed to combine the features of both waveform-based and parametric-based coding (and hence the name of hybrid coding). It keeps the nature of parametric coding which includes vocal tract filter and pitch period analysis, and voiced/unvoiced decision. Instead of using an impulse period train to represent the excitation signal for voiced speech segment, it uses waveform-like excitation signal for voiced, unvoiced or transition (containing both voiced or unvoiced) speech segments. Many different techniques are explored to represent waveform-based excitation signals such as multi-pulse excitation, codebook excitation and vector quantization. The most well known one, so called “Codebook Excitation Linear Prediction (CELP)”[86] has created a huge success for hybrid speech codecs in the range of 4.8 kbps to 16 kbps for mobile/wireless/satellite communications achieving toll quality (MOS over 4.0) or communications quality (MOS over 3.5). Almost all modern speech codecs (such as G.729 [87], G.723.1[88], AMR[89], iLBC[90] and SILK[91] codecs) belong to the hybrid compression coding with majority of them based on CELP techniques. Variants of CELP include Conjugate-Structure Algebraic CELP (CS-ACELP)[92], Relaxed CELP (RCELP)[93], Low-Delay CELP (LDCELP)[94] and Vector Sum Excited Linear Prediction (VSELP) [95], and others. CELP is a generic term for a class of algorithms and not for a particular codec. Simply, we can say that the CELP is similar to the vector quantization based Analysis-by-Synthesis codec. The main disadvantage of this approach is the computational and space complexity, because it needs a huge space to store the codebooks and high computational complexity to pick out the best codes from the given codebooks. If the rate of transmission is too low, the speech quality of CELP degrades quickly since the size of the codebooks is limited. A more recent 2009 study by Karapantazis et al [1] gives an even better overview for the narrowband and broadband speech codecs as illustrated in Table 2.1.

14

Speech Coding

Table 2.1 Narrowband and broadband speech codecs [1]. Codec Narrowband codecs G.711 G.723.1 G.723.1 G.726 G.726 G.726 G.728 G.729 G.729A G.729D G.729E GSM-FR GSM-HR GSM-EFR AMR-NB iLBC iLBC Speex-NB BV16 Broadband codecs G.722 G.722.1 AMR-WB (G.722.2) Speex-WB iSAC BV32

Bitrate (kbps)

Frame (ms)

Bits per frame

Algorithmic Codec delay (ms) delay (ms)

Compression type

MOS

64 6.3 5.3 16 24 32 16 8 8 6.4 11.8 13 5.6 12.2 4.75-12.2 13.33 15.2 2.15-24.6 16

0.125 30 30 0.125 0.125 0.125 0.625 10 10 10 10 20 20 20 02 20 20 20 5

8 189 159 2 3 4 10 80 80 64 118 260 112 244 95-244 400 304 43-492 80

0.125 37.5 37.5 0.125 0.125 0.125 0.625 15 15 15 15 20 24.4 20 25 40 25 30 5

0.25 67.5 67.5 0.25 0.25 0.25 1.25 25 25 25 25 40 44.4 40 45 60 40 50 10

PCM MP-MLQ ACELP ADPCM ADPCM ADPCM LD-CELP CS-ACELP CS-ACELP CS-ACELP CS-ACELP LPC RPE-LTP VSELP ACELP ACELP LPC LPC CELP TSNFC

4.1 3.8 3.6 3.5 4.1 3.61 3.92 3.7 3.8 4 3.6 3.5 4.1 3.5-4.1 3.8 3.9 2.8-4.2 4

48, 56, 64 24, 32

0.0625 20

3-4 480-640

1.5 40

1.5625 60

SB-ADPCM MLT

∼ 4.1 ∼4

6.6 - 23.85 20

132-477

25

45

ACELP

Various

4-44.2 Variable 10-32 32

80-884 AdaptiveVariable 160

34 Frame + 3ms 5

50 Adaptive 63-123 10

CELP Transform coding TSNFC

Various

20 Adaptive 30-60 ms 5

Various ∼ 4.1

2.6 Compressed Sensing-based Codecs

2.6

15

Compressed Sensing-based Codecs

A common method for signal representation is to transform the M quantized samples into k coefficients or weights using a transform or a filter bank. If the number of weights equals the number of samples then the transform process is just called a complete representation and the represented signal can be expanded as a linear combination of M synthesis vectors. To achieve the efficient representation of the sampled signal, we have to reduce its dimension. In other words, the signal has to be linearly represented with a few k weights such that k ≪ M . Such representations often yield superior signal processing algorithms. Recent theory informs us that, with high probability, a relatively small number of random projections of a signal can contain most of its relevant information. One of the effective signal representations is the sparse modeling. As illustrated in Section 1.3 these representations have several applications in speech processing field like speech recognition, enhancement, speech coding and many more [96]. As depicted in Figure 2.2, the speech codec based on sparse modeling is similar to the transform coding, except that the transform filter is replaced with the sparse filter. In sparse filter, the speech waveform can be reconstructed by a linear combination P of k atoms ki=1 wi ϕi chosen from a dictionary Φ. The over-completed dictionary are matrices whose columns "atoms" are larger than their bases, and they are needed to reconstruct the speech signals sparsely [97]. But choosing the proper dictionary is more difficult and needs further complex algorithms. Also, the type of dictionary; i.e., SDic or LDic plays an important role in compressing the speech signal. The obtained sparse vector should be quantized by a quantizer, then output bit stream passes through a statistical-based lossless compression stage such as the Huffman encoding or arithmetic coding. At the decoder, and after both entropy decoding and de-quantization processes, an approximated speech waveform is reconstructed as follows ˜ = ΦQ(wk ) x

(2.2)

As shown in 2.2, the sparse decoder can not recover the output of the sparse coder exactly. The error between ΦQ(wk ) and Φwk represents the quantization error vector. According to the Restricted Isometric Property (RIP)[98], we can find the range of quantization error as follows (1 − δk )∥Q(wk ) − wk ∥22 ≤ ∥ΦQ(wk ) − Φwk ∥22 ≤ (1 + δk )∥Q(wk ) − wk ∥22

(2.3)

16

Speech Coding

Fig. 2.2 Compressed sensing-based speech codecs. where δk is the RIP constant of the dictionary Φ, and the equality in 2.3 is satisfied if and only if the k selected atoms are orthogonal to each other. If ∆q is the quantization step size of an uniform quantizer, then the second norm of Q(wk ) − wk is upper bounded as follows [4] ∥Q(wk ) −

wk ∥22

k∆2q ≤ 4

(2.4)

In this thesis we will introduce another speech coder based on sparse modeling. The suggested approach depends on splitting the sparse coding stage into two sub-stages namely: forward and backward stages. The forward stage obtains the best k atoms to represent the speech signal at a predefined distortion level, then the suggested backward stage replaces the sparse vector of weights with a sparse symmetric matrix using a lossy procedure. Before discussing the suggested approach, next chapter discusses in more details the most common forward and backward algorithms used in sparse modeling. In addition, it discusses the complexity of each algorithm.

2.7

Quality Measurements

Choosing the appropriate quality assessment parameter is one of the most difficulties that confront the design process of any speech encoder. Also, it is well known that, this is an attractive trend in the research field of speech processing. For instance, early encoders of the speech signal were assessed only by one criterion: intelligibility, so far it is not sufficient to assess the performance by the intelligibility only. The demand of the

2.7 Quality Measurements

17

consumers nowadays is to hear a natural sound. Hence, several quality measures have been developed to quantify "naturalness" of the speech signal.

2.7.1

Subjective measures

In subjective measures we need human listeners to make judgment on the speech coder quality. Also this type of measures is called psychophysical measures and labeled as either intelligibility, Mean Opinion Score (MOS) or explicit comparison [99]. By the intelligibility, each coder is utilized to encode a number of words, and then the listeners will be asked to determine the words they hear, and then we can judge on the quality of each coder by computing the percentage of correct transcriptions. The most common intelligibility test are called Diagnostic Rhyme Test (DRT)[100, 101] and Diagnostic Alliteration Test (DALT) and both of them are using a vocabulary which is controlled to test the intelligibility loss. In addition, the quality assessment can be achieved by means of the Mean Opinion Score (MOS). The MOS is calculated by coding a collection of uttered phrases using different coders, and then the listener will be asked to compare between the degraded speech signal and the original signal, and hence he writes the result of the comparison as a five point numerical scale: 1 = bad, 2 = poor, 3 = f air, 4 = good and 5 = excellent. Sometimes, it is so hard to assess the significance between the obtained MOS values of two different coders. Therefore, a more robust statistical test should be employed if coders are assessed in explicit comparisons. In a comparative test, a listener should hear the same phrase after coding it by two different encoders, and then selects the best one.

2.7.2

Objective measures

As mentioned before, the objective measures of the quality are usually calculated from both the clean and the degraded signals using different mathematical equations which its arguments are those signals. This type of quality assessment doesn’t need listeners, so it is less time consuming and not expensive. Usually, this type of quality measures are utilized to obtain a robust estimation for the quality. Now, we will turn our attention to focus on describing some examples of widely used objective measures. Perceptual Evaluation of Speech Quality (PESQ) is known to be one of the most accurate objective methods to estimate subjective MOS. PESQ estimates the MOS values from both the clean and degraded signal and the estimated MOS is called MOS-LQO (Listening Quality Objective) [102].

18

Speech Coding

Signal-to-Noise Ratio (SNR) is one of the oldest objective measures, and its calculations are very simple, but it needs both degraded and non-degraded samples. The SNR unit is the dB and can be calculated as follows SN R = 10 log10

∥x∥22 ˜ ∥22 ∥x − x

(2.5)

Sometimes we need to calculate the SNR for short speech segment and then we take the average of all measures. This measurement is known by the segmental SNR which is denoted SegSNR, and can be defined as: SegSN R =

F 10 X ∥xi ∥22 log10 ˜ i ∥22 F i=1 ∥xi − x

(2.6)

where F is the number of speech frames (segments), and frame length is usually set between 10 and 20 ms. More information on subjective and objective measures can be found in [53], [103], [104]. [105], [106], [107] and [108].

2.8

Summary

This chapter presented an overview of the speech coders. As illustrated, the speech coders are either waveform-based speech coders or source coding-based speech coders. While the source coding-based coders are suitable for the low bit rate applications, the waveformbased coders are suitable for high bit rate applications. Taking into consideration the sparse modeling, we showed that, the sparse modeling is similar to the transform encoders, but they transform the speech signal to another domain called "atoms domain". Those atoms are either pre-structured atoms like wavelets or learned atoms. The evolution of choosing the atoms space leads to more efficient sparse modeling and consequently more impact on the signal compression.

Chapter 3 Forward and Backward Greedy Pursuit The problem of sparse representation can be formulated as a non-convex or zero-norm (l0 ) optimization problem of finding the linear combination Φw, which satisfies min ∥x − Φw∥22

s.t.

∥w∥0 = k

(3.1)

where k is some predefined threshold which controls the sparseness of the representation and ∥w∥0 denotes the (l0 ) pseudo norm which counts the number of non-zero elements of the vector w. This problem can alternately be formulated as min ∥w∥0

s.t.

∥x − Φw∥22 ≤ ϵ

(3.2)

where ϵ is the reasonable threshold of the reconstruction error. Though the solutions to Eq.(3.1) and Eq.(3.2) need not be the same mathematically, both solutions are similar in essence to what the sparse problem aims at achieving. This problem thus involves a choice of the dictionary and a sparse linear combination of the atoms in the dictionary to represent each desired signal. Commonly used strategies for solving the (l0 ) sparse optimization problem are called greedy algorithms, such as Matching Pursuit (MP)[6], Orthogonal Matching Pursuit (OMP) [7], and Orthogonal Least Squares (OLS) [109]. OMP usually offers greatly superior performance to MP, however, OMP is more costly in both computation time and memory requirements. Using the l0 norm in the sparse approximation problem makes it an NP-Hard with a reduction to NP-complete subset selection problems in combinatorial optimization [110], [111]. A convex relaxation of the problem can instead be obtained by taking the

20

Forward and Backward Greedy Pursuit

first-norm (l1 ) instead of the l0 norm, where ∥w∥1 = N i=1 wi . The l1 norm induces sparsity under certain conditions [7]. The solution of the convex optimization problem will be in the form of P

min ∥x − Φw∥22

∥w∥1 = k

(3.3)

∥x − Φw∥22 ≤ ϵ

(3.4)

s.t.

or min ∥w∥1

s.t.

There are different sparse solutions based on the convex relaxation. An example is the FOcal Underdetermined System Solver (FOCUSS)[112], Basis Pursuit DeNoising (BPDN)[12] (also known as LASSO ) and Dantzig Selector (DS) [113]. The methods called NESTerov’s Algorithm (NESTA) [114] and Sparse Reconstruction by Separable Approximation (SpaRSA) [115] are efficient approaches for solving the optimization problem and suitable for large sparse problems. In this research, we are interested in the greedy pursuit algorithms as the sparse problem solutions. So, in the rest of this chapter we will discuss the forward and backward strategies of this type of solutions. This discussion will illustrate the main aspects of the most common algorithms including the main idea, starting arguments, weights updating procedure, stopping criterion, and finally the algorithm complexity.

3.1

Forward Pursuit

The main idea behind the forward greedy algorithms is to start with the maximum − convergence error → e 0 = x, zero weights vector w0 = 0 and empty support set Γ0 = ∅. At the first iteration, the Forward Greedy (FG) algorithm selects the nearest atom in Φ to − the error → e 0 . The index of the selected atom in the first iteration γ1 is obtained using a predefined picking criterion then it is added to the index set Γ1 = Γ0 ∪ γ1 . After the first − iteration we will have a convergence error → e 1 = x − ΦΓ1 wΓ1 . At the nth iteration, the − error becomes → e n = x − ΦΓn wΓn such that Γn = Γn−1 ∪ γn and the number of non-zero weights is ∥wn ∥0 = n. The above procedure is repeated until the termination conditions are satisfied. The most common termination rules in FG algorithms are the error limit − ∥→ e n ∥22 ≤ ϵ or the sparse limit ∥wn ∥0 = k.

21

3.1 Forward Pursuit 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

procedure MP(x, Φ, Γ, k, ϵ) → − e 0 ← x, w0 ← 0, Γ0 ← ∅, n ← 1 loop − γn = argγ∈Γ\Γn−1 max|ϕTγ → e n−1 | Γn ← Γn−1 ∪ γn − e n−1 )1Tγn wn ← wn−1 + (ϕTγn → → − e n ← x − Φwn − − − − − − − Sparse stopping criterion − − − − − − − if n < k then n←n+1 else break end if − − − − − − − Error stopping criterion − − − − − − − − if ∥→ e n ∥22 > ϵ then n←n+1 else break end if end loop return (wk , ΦΓk ) end procedure Fig. 3.1 MP Algorithm

3.1.1

Matching Pursuit

Early signs of FG pursuit ideas appeared in a pioneering work by Stephane Mallat and Zhifeng Zhang in 1993. They introduced the Matching Pursuit (MP) [6], which is an iterative algorithm that offers suboptimal solutions for decomposing any signal in terms of normalized atoms in an over-complete dictionary Φ. At the nth iteration, the algorithm − picks the atom ϕγn that satisfies the correlation criterion γn = argγ∈Γ max|ϕTγ → e n−1 |, and then determines the weight of the atom according to its inner product with the residue P → − − − e n−1 . The MP expansion after the nth iteration will be x = nl=1 (ϕTγl → e l−1 )ϕγl + → e n. The sub-optimality of MP can be traced back to that, at the iteration n the greedy − algorithm guarantees that the new residual error → e n is only orthogonal to the last → − selected atom ϕγn = ΦΓn \ΦΓn−1 , i.e., e n ̸⊥ ΦΓn , where ̸⊥ is the non-perpendicular operator. While there is no guarantee that MP computes sparse representations that approximate the signal x at best. MP is easily implemented, converges quickly, and has good approximation properties [6, 116, 117]. Figure 3.1 shows the pseudo code of the MP algorithm.

22 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

Forward and Backward Greedy Pursuit procedure OMP(x, Φ, Γ, k, ϵ) → − e 0 ← x, w0 ← 0, Γ0 ← ∅, n ← 1 loop − γn ← argγ∈Γ\Γn−1 max|ϕTγ → e n−1 | Γn ← Γn−1 ∪ γn wΓn ← (ΦTΓn ΦΓn )−1 ΦTΓn x → − e n ← x − ΦΓn wΓn − − − − − − − Sparse stopping criterion − − − − − − − if n < k then n←n+1 else break end if − − − − − − − Error stopping criterion − − − − − − − − if ∥→ e n ∥22 > ϵ then n←n+1 else break end if end loop return (wk , ΦΓk ) end procedure Fig. 3.2 OMP Algorithm

3.1.2

Orthogonal Matching Pursuit

At the same year of MP release, Y.C. Pati introduced a refinement approach termed Orthogonal Matching Pursuit (OMP) [7]. The OMP differs from the MP by optimizing all the weights values after each iteration, thus the OMP gives a better approximation but − is computationally more expensive. What is done in the OMP is that the residual → e n, at iteration n, is orthogonalized to the space spanned by the previously chosen atoms, − i.e., → e n ⊥ ΦΓn . The algorithm starts exactly like the MP algorithm. The difference occurs when the second atom vector has been selected. As described previously, in MP the inner product between the residual and the atom vector is the weight value, and the weight value used with the previously chosen atoms is kept. In OMP both these weights are optimized making the best possible approximation of x using just these two atoms. − At the iteration n, OMP achieves the orthogonality → e n ⊥ ΦΓn by updating the weights such that wΓn = (ΦTΓn ΦΓn )−1 ΦTΓn x where the term (ΦTΓn ΦΓn )−1 ΦTΓn is called Moore− Penrose pseudoinverse. Figure 3.2 illustrates the pseudo code of the OMP algorithm.

3.1 Forward Pursuit

3.1.3

23

Optimized Orthogonal Matching Pursuit

An improved alternative to the OMP algorithm is presented by Laura Rebollo-Neira [8] and called Optimized Orthogonal Matching Pursuit (OOMP). Usually, the atoms in any overcompleted dictionary Φ are not orthogonal to each other, and hence the criterion of OMP that selects the atom is not optimal with respect to the error minimization. Therefore, OOMP uses another selection criterion γn = arg minγ∈Γ\Γn−1 minwΓn ∥x − ΦΓn−1 ∪γ wΓn ∥22 that is optimal with respect to the residual minimization: the algorithm utilizes all atoms of Φ that have not been used so far and picks out the one that produces the smallest residual. Because of the projection with respect to each candidate atom, the OOMP is computationally much more complex than the OMP. Further, the gain in performance over the OMP is not significantly high [118].

3.1.4

Other Techniques

In addition to the foregoing FG algorithms, there are many algorithms on the same trend, but here we will discuss briefly some of them. Stagewise Orthogonal Matching Pursuit (StOMP) [10] is an OMP-based algorithm. The main difference between StOMP and OMP is that where OMP constructs the sparse solution by adding one atom per forward iteration StOMP uses several atoms. Per each forward iteration the StOMP selects a predefined number of atoms whose correlations with the residual are high, and then it uses a least squares method to find the approximation. Regularized Orthogonal Matching Pursuit (ROMP) [119] is again an OMP-based algorithm and similar to StOMP in selecting a cluster of atoms per each forward iteration. The main difference between StOMP and ROMP is that, the ROMP algorithm does not use a preset number of atoms, but it uses the atoms which have a similar inner product with the current error. It selects all atoms which have an inner product above half the size of the largest inner product. Last but not least, Compressive Sampling Matching Pursuit (CoSaMP) [9] like StOMP and ROMP is an OMP-based algorithm and it selects all atoms which produce the largest inner product with the residue. CoSaMP then restricts the construction process to the required level of sparsity by removing all except the required amount of entries.

3.1.5

Forward Pursuit Complexity

Before discussing the complexity of the forward pursuit algorithms, we have to illustrate that, the complexity here is described by the number of multiplications and additions

24

Forward and Backward Greedy Pursuit Table 3.1 The computational complexity of MP Procedure Atom Selection Residual Update

Routine → − e Θ = ΦT

(n)

(n)

GΠ

GΣ

M (N − n + 1)

(M − 1)(N − n + 1)

γn = argmax|Θ|

N −n+1

0

x − ΦΓn wΓn

Mn

M

Γ\Γn−1

n−1

Total Complexity

O(M N )

k-Sparse Complexity

O(kM N )

required by each forward iteration. The number of multiplications and additions at the (n) (n) nth forward iteration are denoted by GΠ and GΣ respectively. For the MP, it is considered the fastest forward greedy pursuit algorithm. This property of MP backs to the fact that, there is no need to update the weights in each iteration. Table 3.1 summarizes GΠ and GΣ for MP. As shown, the computational − complexity at the nth iteration is dominated by the procedure ΦTΓ\Γn−1 → e n−1 . This → − step computes the correlation among the residual vector e n−1 of the prior iteration − and the N − n + 1 contents of ΦΓ\Γn−1 . For → e n−1 ∈ RM , the process of correlation needs N − n + 1 multiplications among vectors, and each vector multiplication needs M elementary products and M − 1 elementary additions respectively. Finally, the total complexity of any iteration of MP algorithm can be denoted approximately by O(M N ) [120, 121], and for the k−sparse solution the complexity becomes O(kM N ) [122]. In contrast to MP algorithm, in OMP, the computational complexity at the iteration n is dominated by the selection and weights updating steps. Note that, the levels of GΠ and GΣ for the Gram inverse G ¸ −1 = (ΦTΓn ΦΓn )−1 had been obtained from [123]. As illustrated in Table 3.2, at the nth iteration, the algorithm should update the set − of selected indices Γn and this requires a matrix-vector product (ΦTΓ\Γn−1 → e n−1 ) whose complexity can be denoted by O(M N ). But the weights updating procedure consists of another complicated process at which the algorithm should apply. As illustrated − previously, to update the weights, the OMP should minimize the previous residual → e n−1 by making orthogonal projection on the new selected subspace ΦΓn , and the complexity of this projection can be denoted by O(M n2 ). Over the last few years, several techniques were suggested for minimizing the computational complexity of the pursuit algorithms. Most of these techniques can be classified into several categories such as the transformation techniques, clustering techniques, factorization techniques and last but not least the optimization techniques.

25

3.1 Forward Pursuit Table 3.2 The computational complexity of OMP Procedure

Routine

Atom Selection

(n)

(n)

GΠ

GΣ

− e n−1 Θ = ΦTΓ\Γn−1 →

M (N − n)

(M − 1)(N − n)

γn = argmax|Θ|

N −n

0

M n2

(M − 1)n2

n3

n3 − 2n2 + n

B = AΦTΓn

M n2

M (n2 − n)

C = Bx

Mn

(M − 1)n

Mn

M

Wights Update

G ¸ = ΦTΓn ΦΓn (G ¸ )−1 ΦTΓn x

Residual Update

Sub Routine

x − ΦΓn wΓn

A=G ¸ −1

Total Complexity

O(M N + M n2 )

k-Sparse Complexity

O(kM N + k 3 M )

For the transformation techniques, the FG algorithm tries to exploit the high speed computations of some transforms such as the Fast Fourier Transforms (FFT)[124], or the Fast Wavelet Transforms (FWT)[125]. In these types of transforms, the number of mathematical processes in matrix-vector product ΦT x reduces from M N to N log2 M . As stated previously, although the structured dictionaries lead to high speed sparse decomposition, they are not designed to obtain the high quality sparse representation because it doesn’t consider the signal features and characteristics during the sparse representation. In clustering-based techniques, each technique exploits the non orthogonality among the contents of Φ. The non orthogonality among the atoms backs to the fact that, the over-completed dictionary consists of atoms greater than the basis of the dictionary space, so there are highly correlated atoms; i.e., non orthogonal, that have analogous features. So, by grouping the identical set of atoms we can reduce the search time in those new clusters. One of these approaches had been proposed in [126], in this approach the analogous atoms are grouped with each other, and each group is labeled as a molecule. By employing the clustering step in a recursive fashion on both atoms and molecules we get a tree structure, which can be utilized to design a high speed search algorithm.

26


To accelerate the matrix operations especially the inverse of Gram matrix G ¸ = ΦTi Φi (see Table 3.2), there are different matrix factorization techniques can be utilized to reduce the complexity of those operations. One of the most important matrix factorization methods is the Cholesky factorization which is used by the OMP in [63, 127] to reduce the time spent by matrix operations. Also, the QR factorization method is very important to reduce the complexity of the OMP as proposed in [128, 129]. The Cholesky factorization is based on decomposing a positive-definite, Hermitian G ¸ into the product G ¸ TL G ¸ L where the subscript L denotes the lower-triangular matrix and the superscript T denotes the conjugate transpose matrix. For the QR factorization, it decomposes square, real matrix into the product of an upper triangular matrix R and an orthogonal matrix Q. The main drawback of these methods is that they require additional storage. So, another efficient trend was suggested in [121] that requires less storage. The required storage can become an issue for large dictionaries. In optimization-based techniques, we seek to accelerate the orthogonal projections of the residuals on the remaining space of atoms. Fast solvers for the linear system of equations had been utilized for estimating the orthogonal projection, for instance, in [121] the least squares method had been replaced with another fast approach which is called the conjugate gradient [130].

3.2

Backward Pursuit

A major fault of the FG method is that it can not correct the earlier mistakes at all. For instance, consider the situation plotted in Figure 3.3. In the figure, x can be represented as a linear combination of ϕ1 and ϕ3 but ϕ2 is so close to x. By utilizing the FG algorithm, we find ϕ2 first, then ϕ3 and ϕ1 according to the correlation selection criterion. Now, we have found that all features of x can be matched only by ϕ1 and ϕ3 , but we can not bring out the atom ϕ2 which is picked out in the first step. This situation confirms that the FG procedure is inappropriate for atom selection. Because it works well with the near orthogonal subsets of atoms [131]. Shortly, Figure 3.3 shows that FG algorithm makes errors that can not be corrected in a later step. In order to edit the problem, the so-called Backward Elimination (BE) algorithm has been − widely used. Assume that Γn , → e n are the support set of indices after n forward iterations − and the corresponding forward error respectively, such that → e n = x − ΦΓn wΓn . Also, ← − assume that Γ(n,¯n) , e n¯ are the support set of indices after n ¯ backward elimination steps − and the corresponding backward error respectively, such that ← e n¯ = x − ΦΓ(n,¯n) wΓ(n,¯n) . (n,¯ n) Note that the cardinality of the set Γ equals the new sparse level and equals n − n ¯.

3.2 Backward Pursuit

27

Fig. 3.3 Failure of forward greedy algorithm. − − − Any BE method aims to keep the level of ∥← e n¯ ∥22 less than ∥→ e n−¯n ∥22 where → e n−¯n is the forward error after n − n ¯ forward iterations. All introduced BE methods can be classified according to the position of the backward stage with respect to the forward stage into two groups. The first group is called the Discrete Backward Stage DBS-BE methods and the other is called the Feedback-based Backward Stage FbBS-BE methods.

3.2.1

Discrete Backward Stage (DBS)

In DBS-BE methods, the backward processing starts after finishing the forward decomposition. Sometimes the forward stage can be ignored when the backward stage is − e 0 = x − ΦΓ wΓ . The main idea behind the initialized with the full model such that ← DBS-BE method is to remove atoms from the initialized support set greedily, i.e., remove one atom at a time. Although the backward greedy method seems to be a reasonable approach that addresses the issue of the FG algorithm, it is computationally very costly because it starts with all atoms which were selected by the forward algorithm. One of the earlier DBS-BE algorithms is the Backward Greedy Algorithm (BGA) [132] that is introduced by Harikumar, Couvreur, and Bresler since 1998. This algorithm is presented as a competitor to the FG algorithm not to be a follower to it. So, the idea of BGA is to start by the full model of N linear combinations ΦΓ wΓ , and then remove one atom at a time until k atoms are left. The atoms that is brought out at each step is chosen to − bring down the increment in the least-squares residual error ∥← e n¯ ∥22 . By looking at the complexity of the BGA, we find that it starts with a number of atoms N and then bring them out one by one until only k atoms are left, if k is small with respect to N , the computational cost of the backward steps will be higher than that of the forward steps required to obtain the same number of atoms k. Conversely, if k is close to N , the cost of BGA will be smaller than that of the forward algorithm. The main limitation of BGA is that, it works only with a complete or under-complete

28 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

Forward and Backward Greedy Pursuit procedure BGA(wk ,Φ,Γk , ϵ) − − n ¯ ← 1, Γ(k,0) ← Γk , ← e0 ←→ ek loop ∆n ← Γ(k,¯n−1) γn¯ ← arg minn ∥x − Φ∆n \γ w∆n \γ ∥22 ∀γ∈∆

w∆n \γn¯ ← min∥x − Φ∆n \γn¯ Ω∥22 Ω − ∥← e n¯ ∥22 ← ∥x − Φ∆n \γn¯ w∆n \γn¯ ∥22 − if ∥← e ∥2 < ϵ then

◃ Update the remaining weights

n ¯ 2

Γ(k,¯n) ← ∆n \γn¯ n ¯←n ¯+1 else n ¯←n ¯−1 wΓ(k,¯n) ← min∥x − ΦΓ(k,¯n) Ω∥22 Ω break end if end loop return (wΓ(k,¯n) , Γ(k,¯n) ) end procedure Fig. 3.4 BGA’s psudeo code

dictionary. In this situation, atoms are sequentially omitted from a dictionary to get a sparse solution. An algorithm was declared in [133] which improves computationally the BGA, but is still restricted to the case of under-complete dictionary. A new version of BGA is presented by Shane F. Cotter [134]. In contrast to the old BGA, the new version works with the over-complete dictionaries. Also, it sets eyes on a new criteria to choose which atoms are deleted from the dictionary other than the minimum error. At the backward iteration n ¯ the new BGA selects the atom of index γn¯ that satisfies the minimum p norm of the remaining weights where 0 ≤ p ≤ 1. In [135], Laura Rebollo-Neira introduced the recursive version of OOMP, which is termed by the Backward OOMP (BOOMP). The basic idea of BOOMP is to get the weights vector wΓk of OOMP after k forward steps then it removes one atom per iteration till the maximum acceptable level of error is reached. At each backward iteration, it updates the weights and the matrix of biorthogonal atoms. So, the BOOMP is a special backward greedy algorithm because it follows a certain FG algorithm OOMP and depends on its output such as the biorthogonal atoms.

3.2 Backward Pursuit

3.2.2

29

Feedback-based Backward Stage (FbBS)

In FbBS-BE methods, the forward and backward stages work together during decomposing x; i.e, the FG adds α ≥ 1 atoms per iteration then the backward stage works to remove β ≥ 0 atoms per iteration, where we call α and β are the forward and backward step size respectively. Any FbBS-BE method repeats the forward and backward processing till the number of selected atoms reaches its required level or the convergence error reaches the satisfied level. Unlike the DBS-BE, the FbBS-BE methods are more efficient because the backward stage works with little number of atoms. In [136], the author proposed an adaptive forward and backward algorithm which is also called FoBa. This algorithm avoids the shortcomings of both forward and backward method, and combines their strength. FoBa is designed to make balance between the next two aspects. First, it should take reasonably offensive backward steps to avoid all errors caused by forward steps. The second consideration is that the algorithm should take backward step adaptively such that the obtained error per backward iteration does not rub out the gain made in the last forward step. This implies that FoBa always making convergence. In FoBa, the backward step is considered successful if and only if the increase in error power is no more than half of that decrease in the earlier corresponding forward step. In [137], the author addressed the theoretical and computational issues associated with the FoBa. He introduced a new algorithm referred to as "gradient" FoBa (FoBa-gdt) which significantly improves the computational efficiency of FoBa. The key difference is that FoBa-gdt only evaluates gradient information in individual forward selection steps rather than solving a large number of single variable optimization problems. In [138], another algorithm is introduced and called Forward-Backward Pursuit (FBP) FBP. Like FoBa, it is an iterative algorithm that combines two sequential stages. The first stage is responsible for expanding the support set by α > 1 atoms and this is the forward stage. These α indices are chosen in one iteration, as is the case in the StOMP algorithm, which are maximally correlated with the residue. Then, by means of the orthogonal projection the algorithm calculates the weights of the selected support. As for the second stage, it will be the backward step at which the algorithm prunes the support set by removing β < α indices with smallest contributions to the projection. By means of the least squares method the algorithm recalculates the weights of the pruned support set. These forward and backward steps are repeated until the energy of the residue either vanishes, or is less than a preset level. Unlike FoBa, as illustrated, the backward processing in FBP selects a predefined number of atoms to remove regardless of the corresponding error which may remove the gain introduced by the forward step.

30

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:


procedure FBP(Φ, x, k, ϵ, α, β) Γα ← ∅, Γβ ← ∅, e ← x, n ← 1, Γ(0,0) ← ∅ loop Forward Step: Γα ← arg min ∥Φ∆ e∥1 ∆:|∆|=α ((n−1)α,(n−1)β)

Γ(nα,nβ) ← Γ ∪ Γα wΓ(nα,nβ) ← arg min∥x − ΦΓ(nα,nβ) Ω∥22 Ω Backward Step: Γβ ← arg min ∥w∆ ∥1 ∆:|∆|=β (nα,nβ)

Γ(nα,nβ) ← Γ \Γβ wΓ(nα,nβ) ← arg min∥x − ΦΓ(nα,nβ) Ω∥22 Ω e = x − ΦΓ(nα,nβ) wΓ(nα,nβ) if ∥e∥22 < ϵ or (nα − nβ) < k then n←n+1 else break end if end loop return (w, Γ(nα,nβ) ) end procedure Fig. 3.5 FBP’s pseudo code

◃ ∆ ̸∈ Γ(nα,nβ)

◃ ∆ ∈ Γ(nα,nβ)

3.3 Summary

3.2.3

31

Backward Pursuit Complexity

By looking to the BGA algorithm, we find that, it needs to select an atom ϕγn¯ at the backward iteration n ¯ to remove it from the subspace ΦΓk . The main core of the search step is a matrix-vector product, such that the matrix dimension is M ×(k−¯ n +1) and the vector dimension is (k − n ¯ +1)×1. For total β backward iterations, the number of multiplications P P and additions are βn¯ =1 M (k − n ¯ + 1) and βn¯ =1 (M − 1)(k − n ¯ + 1) respectively. The complexity of this step can be denoted by O(βkM ). Then, as illustrated in Figure 3.4, the remaining weights should be updated and this step requires to orthogonalize the vector x against the remaining subspace. The complexity of the updating step was discussed in details in [139] which showed that, the updating step complexity is dominated by the QR factorization process and it can be denoted by O((k − n ¯ )2 M ) at the backward iteration n ¯. Unlike BGA, the FBP repeats the backward elimination after each forward iteration. So, its complexity depends upon the maximum sparse level k desired to be reached, and k k with a complexity O(( α−β )2 M ). the total backward iterations is α−β

3.3

Summary

In this chapter we discussed the forward and backward strategies of the greedy pursuit algorithms. This discussion illustrated the main aspects of the most common algorithms including the main idea, starting arguments, weights updating procedure, stopping criterion, and finally the algorithm complexity. One of the main conclusions of this chapter is that, the correlation among the atoms in Φ raises the need to the backward elimination algorithms that will exploit these correlation to decompose some atoms partially in terms of others, and this backward decomposition process increases the sparse level of the compressed sensing representation.

Chapter 4 Backward Replacement In this chapter, we introduce a new backward processing technique called the Backward Replacement (BRe). The new technique will compete the backward elimination methods on many fronts, including the compression ratio, working with orthogonal bases and the computational complexity. As illustrated in earlier chapters, the sparse approximation problems request a good approximation of an input signal x ∈ RM as a linear combination of a few atoms belonging to a dictionary Φ ∈ RM ×N . This class of problems arises throughout applied mathematics, statistics, and electrical engineering. One of the main tasks of sparse approximation in the digital signal processing is the signal compression which is the main theme of our thesis. As it is already known, for compression purposes, the data encoder consists of two main stages that work separately to minimize the number of bits assigned to each input sample. The first stage is called the lossy compression that ends with the quantization process, and the second stage is called the lossless compression and ends with the entropy encoding. The sparse modeling is a lossy compression that precedes the quantizer and works as a transform encoder. As depicted in Figure 4.1, the sparse decomposition process may include two procedures. The main procedure is the forward decomposition, and there are several FG algorithms such as MP[6], OMP[7], OOMP[8], StOMP[10], CoSaMP[9]; etc. All forward algorithms try to get a good approximation using only k atoms such that k < M by solving the following objective function. min ∥x − Φw∥22

s.t.

∥w∥0 = k

(4.1)

The solution of (4.1) is a vector of k nonzero weights or wk and the obtained approximation is Φwk . The second procedure is called the backward processing. This task handles the

34


Fig. 4.1 Compressed sensing-based coding system.

vector wk to increase the sparsity level by solving the following objective function. min ∥wk ∥0

s.t.

− ∥x − Φwk ∥22 ≤ δ∥→ e k−1 ∥22

(4.2)

where δ is a real number less than or equal 1 that guarantees that the backward processing − e k−1 is the doesn’t eliminate the overall gain obtained by the last forward step, and → residual vector at the forward step of order k − 1. As discussed in Chapter 3, all existing backward processing techniques such as BOOMP [135], FoBa [136], FBP [138] and BGA [132] are based on the fact that the selected atoms by any forward algorithm can be reduced by eliminating some of them and the elimination error can be compensated if and only if the atoms are not orthogonal to each other. After the backward processing, b atoms are eliminated from wk , then the new vector becomes wkb and hence the output should be quantized by a quantizer and the obtained vector becomes Q(wkb ), where Q(.) denotes the quantizer operator. In this work, we are interested in the backward processing and its impact on the signal compression. From the point of view of sparsity representation, the efficiency of the Backward Elimination (BE) algorithms is directly proportional to the atoms’ correlations. So, they go down in case of the orthogonality or quasi-orthogonality. Also, their efficiency is inversely proportional to the recovery performance of the forward greedy algorithms according to the Restricted Isometric Property (RIP) [98]. On the side of signal compression, it is expected that the existing backward processing would not introduce any significant impact due to the strong dependency on the atoms’ correlations. So, the main contribution of this thesis is introducing a new backward technique so-called

35

4.1 Greedy Pursuit and Signal Compression

the Backward Replacement (BRe) that takes into consideration the impact of backward processing on the signal compression.

4.1

Greedy Pursuit and Signal Compression

To our knowledge, there is no study about the effect of the existing backward processing techniques upon the signal compression. But there are few efforts such as [55] and [56] that discuss the Rate-Distortion (R-D) performance of the FG algorithms. According to these studies the FG algorithms have a poor performance due to the lack of tying up FG’s distortion and the required bit rates. The previous studies enhanced the R-D performance of the FG algorithms by replacing the objective function of Eq.(4.1) with another one whose constraint is the rate budget. This replacement procedure converted the methodologies from a heuristic-based to be optimal-based. The main drawback of the optimized solutions is the computational complexity that increases with increasing of the nonzero elements in w. Therefore, as stated by their authors, this type of solutions are suitable for low bit rates at which the sparse level k is low too. As mentioned, sparse coding is a type of waveform-based encoders, which doesn’t work well at low bit rates. Consequently, the R-D optimized strategies are also unsuitable for qualitybased compression such as speech and image compression because it is expected to be computationally expensive at high bit rates. Therefore, we suggested another backward heuristic-based methodology that can work efficiently at different bit rates to match the requirements of quality-based applications such as the speech compression which is the main goal of this work. Back to the FG algorithm, if the output of this task is a sparse level k with a distortion level ϵk , then we can calculate the corresponding bit rate (bits/sample) at the same distortion level as follows Rk (ϵk ) =

Bh +

q j=1 (Bj

Pk

M

+ Bji )

(4.3)

where Bjq is the assigned bits by a quantizer for the nonzero weight wj whose index in the vector wk needs Bji bits, and B h is the code length of the End-of-Block header P (EOB). For any BE algorithm, the upper limit of in Eq.(4.3) is changed to k − b after eliminating b atoms such that the corresponding approximation error should lie within (ϵk−1 , ϵk ). Simply, the impact of any BE algorithm on the compression back to the elimination of (weight, index) pair which is called the complete elimination. As for our proposed solution, we seek to increase the compression within the same range of

36


distortion by reducing the error per backward iteration. This condition triggered our idea which is called the Backward Replacement or BRe. The idea of this algorithm is based on the fact that the weights vector wk can be represented by a diagonal matrix Wk such that its diagonal elements are the same as the vector elements. By looking at the diagonal nonzero elements as pairs of wights, then we can convert the diagonal matrix to a symmetric matrix Wkr that includes k − r different weights. This reduction in the number of weights contributes to the signal compression more and more. Next section shows the main aspects of our algorithm.

4.2

Backward Elimination Drawbacks

The main drawback of the backward elimination algorithms is that, they need to compensate the removed energy by the remaining atoms. This compensation process tries to adjust the weights per backward removal, and this could take more time than the elimination decision. In addition to that, the backward elimination algorithm doesn’t work well with the orthogonal dictionaries. − As stated before in section 3.2, any BE method aims to keep the level of ∥← e n¯ ∥22 less − − than ∥→ e n−¯n ∥22 where → ¯ forward iterations. But this e n−¯n is the forward error after n − n condition can’t be achieved in the case of orthogonal bases as explained in the next corollary. Corollary 1. If the output of the forward greedy algorithm is a subspace of k orthogonal atoms ΦΓk then the backward processing iterations are just considered the reverse-forward iterations. Proof. Most of the forward greedy algorithms work in a similar fashion with the orthogonal atoms. Also, the weights degrade in proportional to the convergence error. For the MP algorithm, the approximation after k iterations can be represented by

− − − x = max ϕTγ x ϕγ1 + max1 ϕTγ → e 1 ϕγ2 + · · · + maxk−1 ϕTγ → e k−1 ϕγk + → ek

γ∈Γ

γ∈Γ\Γ

= max ϕTγ x ϕγ1 γ∈Γ

· · · + maxk−1 γ∈Γ\Γ

= max ϕTγ x ϕγ1 γ∈Γ

+ max1 γ∈Γ\Γ

γ∈Γ\Γ

T ϕγ (x − wγ1 ϕγ1 ) ϕγ2 +

k−1 X T ϕγ (x − wγi ϕγi ) ϕγk

− +→ ek

i=1

− + max1 ϕTγ x ϕγ2 + · · · + maxk−1 ϕTγ x ϕγk + → ek

γ∈Γ\Γ

γ∈Γ\Γ

37

4.3 The Backward Replacement Approach Also, we have

γ∈Γ

max ϕTγ x ≥ max1 ϕTγ x ≥ · · · ≥ γ∈Γ\Γ

maxk−1 ϕTγ x

γ∈Γ\Γ

So, we can conclude that wγ1 ≥ wγ2 ≥ · · · ≥ wγk

(4.4)

As illustrated previously, the BE algorithm is designed to remove the atoms with little contributions. So, for orthogonal atoms, the BE algorithm selects ϕγk , ϕγk−1 , · · · , ϕγk−n+1 at 1st , 2nd and nth backward iteration respectively which is considered reverse-forward iterations.

4.3 4.3.1

The Backward Replacement Approach Objective Function

Like the BE algorithms, the BRe handles the obtained nonzero weights of wk in a backward fashion that increases the distortion level per backward iteration such that the overall losses less than the gain obtained by the last step of FG process. Unlike the BE, the BRe have another objective function, as follows max (r)

s.t.

− ∥x − ΦWkr 1T ∥22 ≤ δ∥→ e k−1 ∥22

(4.5)

where r is the replacement order, and Wkr is a sparse symmetric matrix with k nonzero elements and its diagonal has k − 2r nonzeros. As illustrated in (4.5), the weights vector wk is replaced with Wkr 1T that can be encoded in a compressed sense. The proposed bit stream of BRe’s output will be discussed later in this work. In the next section we answer the main question of BRe which is "How does BRe build the matrix Wkr in a backward iterative fashion?"

4.3.2

From wk to Wk

The idea of BRe stands on the fact that, the weights vector wk can be represented in terms of a diagonal matrix Wk as follows: w k = Wk 1 T

(4.6)

38


where 1T denotes the N dimensional vector of ones. In Wk , if a pair of weights on the main diagonal positions (γ, γ) and (¯ γ , γ¯ ) are forced to be equal to each other, then the new weight can be positioned at two new positions (γ, γ¯ ) and (¯ γ , γ). If the previous procedure is repeated over r different pair of weights then the resulting matrix is symmetric and denoted by Wkr . Theoretically, the number of weights in Wkr reduced to k − r weight such that k − 2r wights are positioned on the main diagonal and the others lie below or above the main diagonal.

4.3.3

Index Grouping

Unlike the backward elimination techniques, the BRe method doesn’t eliminate any atom but it groups the indices of the selected atoms into two main sets ΓP and ΓI . First set ΓP is called the pair indices set and consists of r subsets P1 , P2 , ..., Pr such that Pi = {γi , γ¯i }, γi ̸= γ¯i and the union of all possible intersections can be expressed as follows

[

h

i

Pi ∩ Pj = ∅

(4.7)

1≤(i,j)≤r i̸=j

The second set ΓI is called the individual weights set and consists of the remainder of the indices such that |ΓI | = k − 2r, ΓI ∩ ΓP = ∅ and ΓI ∪ ΓP = Γk .

4.3.4

Replacement Weight and Replacement Error

The main idea behind this kind of index grouping is to replace the corresponding pair of weights with another weight that minimizes the replacement error cost function. Assume the pair {γ, γ¯ } belongs to the set Γk and the corresponding weights are {wγ , wγ¯ } that are replaced with a new weight wγ¯γ , then the replacement error is obtained as follows e = (wγ ϕγ + wγ¯ ϕγ¯ ) − wγ¯γ (ϕγ + ϕγ¯ )

(4.8)

Corollary 2 (Minimum Replacement Error). The value of wγ¯γ that minimizes 4.8 is the average of the pair {wγ , wγ¯ } and the corresponding mean squared error is defined as: ∥e∥22

(wγ − wγ¯ )2 = (1 − µγ¯γ ) 2

where µγ¯γ is the inner product between ϕγ and ϕγ¯ .

(4.9)

39

4.3 The Backward Replacement Approach

Proof. To minimize the cost function ∥e∥22 , the error vector must be orthogonal to the sum vector, i.e., e ⊥ (ϕγ + ϕγ¯ ). Let’s first find the inner product formula between e and (ϕγ + ϕγ¯ ). eT (ϕγ + ϕγ¯ ) = wγ (1 + µγ¯γ ) + wγ¯ (1 + µγ¯γ ) − 2wγ¯γ (1 + µγ¯γ ) = (wγ + wγ¯ − 2wγ¯γ )(1 + µγ¯γ ) To achieve the condition e ⊥ (ϕγ + ϕγ¯ ), the inner product eT (ϕγ + ϕγ¯ ) must equal zero and the unique value of wγ¯γ that achieves previous condition is defined as: wγ¯γ =

wγ + wγ¯ 2

(4.10)

The minimum replacement error can be obtained by substituting 4.10 in 4.8 wγ + wγ¯ (ϕγ + ϕγ¯ ) 2 wγ¯ − wγ wγ − wγ¯ ϕγ + ϕγ¯ 2 2 wγ − wγ¯ (ϕγ − ϕγ¯ ) 2 (wγ − wγ¯ )2 ∥ϕγ − ϕγ¯ ∥22 4 (wγ − wγ¯ )2 (1 − µγ¯γ ) 2

e = (wγ ϕγ + wγ¯ ϕγ¯ ) − = = ∴ ∥e∥22 = =

As explained earlier, the backward elimination process fails in case of the orthogonal set of bases ΦΓk , because it is impossible to compensate the elimination error using this set of bases; i.e., the corresponding error of the elimination of wγ equals wγ2 . One of the main pros of the BRe method is that, the replacement error ∥e∥22 depends not only on the orthogonality of the atoms, but also on the converged weights as shown in Eq.(4.9). Corollary 3 (Upper and Lower Bounds of wγ and wγ¯ ). To obtain a replacement error ∥e∥22 < wγ2 and ∥e∥22 < wγ¯2 then the values of wγ and wγ¯ should be bounded as follows !

!

1 − ∆γ¯γ wγ < wγ¯ < 1 + ∆γ¯γ wγ !

(4.11)

!

1 − ∆γ¯γ wγ¯ < wγ < 1 + ∆γ¯γ wγ¯

(4.12)

40


where ∆γ¯γ =

q

2 1−µγ γ¯

and µγ¯γ is the inner product between ϕγ and ϕγ¯ .

Proof. To obtain the bounds of wγ¯ , substitute 4.9 in the inequality ∥e∥22 < wγ2 , then we have ∥e∥22 < wγ2 1 (wγ − wγ¯ )2 (1 − µγ¯γ ) < wγ2 2 !2 wγ¯ 2 1− < wγ 1 − µγ¯γ wγ¯ 1− wγ

!

s

k, the complexity of the distortion matrix can be described by O(M k 2 ). In addition to the 2 aforementioned complexity, there are r iterations includes comparing k2 − k elements in Υk to get the index of the minimum element and this procedure have a complexity O(k 2 ). Also, updating the weights have a complexity of O(1). Finally, to update the error you need to multiply k weights by k atoms that have a complexity of O(kM ). So, the overall complexity of the r iterations can be denoted by O(rkM ). Lastly, the complexity of the BRe algorithm is described by O(M k 2 ) which is much smaller than the complexity of the forward greedy algorithm because k ≪ N . On the other side, the storage complexity of the BRe is fixed O(k 2 ), and doesn’t increase over time. As for the backward elimination complexity, it was discussed in details in [139]. It was found that, each backward elimination complexity is dominated by the QR factorization process. This factorization process is necessary per backward iteration to upgrade the weights and to determine the atom to be eliminated. As shown in [139] the complexity is denoted by O(M (k − n ¯ )2 ) in the backward iteration n ¯ . By comparing this result by the BRe complexity per backward iteration, we find the proposed algorithm is faster because the complexity per backward replacement can be denoted by O( 1r M k 2 ).

4.4

Rate Analysis

The weights matrix Wkr which is generated by the BRe algorithm passes through the quantizer at the end of the lossy compression stage. The quantized weights are then sequenced and losslessly packed into the output bit stream. As it is known, the corresponding bit stream of a vector of weights Q(wk ) consists of k quantized weights and their indices such that each weight is followed by its index then the stream ends

44


with a special header such as the End-of-Block. As for the quantized output of BRe or Q(Wkr ), we suggested a special bit stream that splits the bit stream of Q(wk ) into two sub-streams Is and Ps . For the sub-stream Is , it carries all individual weights which are allocated in the main diagonal of Q(Wkr ) and the other sub-stream carries the weights which are allocated below the main diagonal of Q(Wkr ), and each sub-stream ends with a special header. This splitting procedure is inevitable to differentiate between the indices of the diagonal weights and the others. Assume Rrk denotes the bit rate (bits per sample) of the signal x ∈ RM after the BRe processing, and can be computed as follows Rrk =

BI + BP M

(4.19)

where BI and BP are the total bits required for coding the sub-streams Is and Ps respectively. Our approach assumes that, both sub-streams have the same quantization and header bits and they are denoted by B q and B h respectively, but they differ from each other in respect of the index encoding bits.

For the index encoding, if the index is encoded by the position in the weights sequence, then we need BIi bits to encode the index of the substream Is such that BIi = ⌈log2 N ⌉

(4.20)

, and BPi bits to encode the index of the substream Ps such that l

BPi = log2

N (N − 1) m ≈ 2BIi − 1 2

(4.21)

But if the index is encoded by the number of zeros (runs encoding) among any two successive nonzero weights, then we need a number of bits equal to BIi = ⌈log2 (runs(Is ))⌉

(4.22)

BPi = ⌈log2 (runs(Ps ))⌉

(4.23)

where runs(.) equals the maximum number of runs for a given sequence of weights. If we want to calculate Rrk in terms of BIi , BPi , Bq and Bh , then we need to know the number of weights in each substream. For the substream Is , it has k − 2r after making r

45

4.4 Rate Analysis

replacements. As for the substream Ps , it has only r weights. So, we can rewrite the main formula of Rrk as follows: Rrk =

i 1h (k − 2r)BIi + rBPi + (k − r)B q + 2B h M

The minimum value of Rrk could be obtained when r = [Rrk ]min

4.4.1

=

k/2 Rk

k 2

(4.24)

and can be expressed by

"

1 k i = (BP + B q ) + 2B h M 2

#

(4.25)

Compression Ratio and Rate Saving

Here, we define the compression ratio CRBRe of the backward stage as the ratio of the output bit length to the input bit length. The output bit length is Rrk but the input is RF which is the output bit length of the FG algorithm. So, the CRBRe can be calculated as follows Rr CRBRe = k (4.26) RF For the forward greedy algorithm, we know that there are k weights. So, the general expression of RF can be written as follows RF =

i 1h k(BFi + BFq ) + BFh M

(4.27)

In addition to the foregoing, if the run length encoder is not used then the maximum number of indices equals N like the weights in BRe model, and hence BFi = BIi . let BFq = B q and BFh = B h . Then, the expression of RF can be rewritten as follows RF =

i 1h k(BIi + B q ) + B h M

(4.28)

By substituting 4.24 and 4.28 into 4.26 we obtain the next expression for CRBRe CRBRe = 1 −

r(B q + 1) − B h k(BIi + B q ) + B h

(4.29)

In terms of N , we can rewrite 4.29 as follows CRBRe = 1 −

r(B q + 1) − B h k(⌈log2 N ⌉ + B q ) + B h

(4.30)

It is interesting to note that the value of CRBRe is minimized theoretically at the maximum level of replacements r = k2 .

46


Corollary 4. To obtain a desirable compression ratio CR < 1 under the assumption of non-run length encoding then the number of replacements is lower bounded as follows r(B q + 1) − B h > 0 or r>

Bh Bq + 1

(4.31)

From 4.31, we can conclude that, the header which is added by the BRe algorithm affects on the number of replacements required to obtain CRBRe < 1. Also, it could be concluded that, the best size of B h is to be smaller than B q + 1 to make CRBRe < 1 for any value of r; i.e. r > 0. Therefore, in this work we assume that B h = B q , and hence the corresponding CRBRe can be written as follows: CRBRe = 1 −

(r − 1)B q + r (k + 1)B q + k⌈log2 N ⌉

(4.32)

As for the Rate Saving (RS BRe ), it measures the bit savings obtained by the backward approach per sample, assume that each input sample of x is quantized by B q bits such as the output weights, then the rate saving is obtained as follows: RS BRe = 1 −

Rrk Bq

(4.33)

In terms of CRBRe , we can rewrite 4.33 as follows: RS BRe = 1 − CRBRe × (1 − RS F )

(4.34)

, where RS F is rate savings obtained by the FG algorithm and can be calculated in terms of RF as follows, RF RS F = 1 − q (4.35) B Sparse Representation ̸≡ Rate Saving One of the main conclusions in this section is that, it is not inevitable for the sparse representation to mean acquiring rate savings. This conclusion backs to the fact that, for some values of k; i.e. k ′ ≤ k < M , the obtained rate savings are negative. By substituting 4.27 in 4.35 we obtain k(BFi + B q ) + B h RS F = 1 − (4.36) M Bq

47

4.4 Rate Analysis

By neglecting the effect of B h we can obtain the following inequality which is considered the condition of the negative rate savings k(BFi + B q ) > M B q

(4.37)

So, the range of k which gives negative rate savings can be obtained as follows: M Bq ≤k

Speech Coding Based On Sparse Modeling

Speech Coding Based On Sparse Modeling

Suggest Documents

SPARSE CODING FOR SPEECH RECOGNITION ... - Google Sites

SPARSE CODING FOR SPEECH RECOGNITION ... - Google Sites

SPEECH ENHANCEMENT WITH SPARSE CODING IN LEARNED

Speech Enhancement Based on Statistical Modeling ...

Sparse-Coding-Based Computed Tomography Image Reconstruction

Rain Removal via Shrinkage-Based Sparse Coding

A highly secure oblivious sparse coding-based

Rain Removal via Shrinkage-Based Sparse Coding

A content distribution system based on sparse linear network coding

Sparse coding and dictionary learning based on the MDL principle

Sparse coding and NMF

Recklessly Approximate Sparse Coding

Distributed Convolutional Sparse Coding

Recklessly Approximate Sparse Coding

Neural Coding: Sparse but On Time

Wavelet Based Speech Coding Using Orthogonal ... - CiteSeerX

CELP Speech Coding Based on an Adaptive Pulse Position Codebook

Speech Coding based on Spectral Dynamics - Idiap Publications

robust multiband excitation coding of speech based on variable ...

Sparse Modeling-based Sequential Ensemble ... - Google Sites

Maximum Phase Modeling for Sparse Linear Prediction of Speech

Bayesian anti-sparse coding - arXiv

Recursive Sparse, Spatiotemporal Coding - CiteSeerX

Group Sparse Coding - NIPS Proceedings