Adaptive Compressed Sensing of Speech Signal Based ... - IEEE Xplore

0 downloads 0 Views 237KB Size Report
basis at sub-Nyquist sampling rate. This paper focuses on the realization of CS on natural speech signals. We construct an over-complete data-driven dictionary ...
Proceedings of the 15th Asia-Pacific Conference on Communications (APCC 2009)-060

Adaptive Compressed Sensing of Speech Signal Based on Data-Driven Dictionary Tingting Xu, Zhen Yang, and Xi Shao Compressed Sensing (CS) [1], also named Compressive Sampling [2] is an emerging sensing/sampling paradigm that goes against the common wisdom in data acquisition. CS theory asserts that one can recover signals which are sparse on some certain basis from fewer samples or so-called measurements than traditional method does. However, most of the CS researches today are based upon space sparse time sequence or image signals which have sparse representation in wavelet domain. Much fewer results have been reported about CS of speech signals. The main problem is that in order to apply CS, there is a very important precondition: signal must be sparse on some certain basis. As a natural signal, speech components are very complicated and do not show enough sparsity on some traditional basis e.g. DCT (discrete cosine transform) or DWT (discrete wavelet transform). In this paper, we solve this problem by constructing a data-driven dictionary specialized for speech signals. CS sampling and reconstruction are realized based on this dictionary. The reconstruction error is much smaller than using traditional bases. Furthermore, we propose to use an adaptive sensing matrix according to the energy distribution of original speech signal instead of using random sensing matrix, which can significantly improve the reconstruction quality. The rests of this paper are organized as following: In Section 2, basic principle of CS is described. Section3 and 4 explain the detail procedure of building the sparse basis and the adaptive sensing matrix, respectively. Experimental results are shown in Section 5 while conclusion is made in Section 6.

Abstract—Compressed Sensing (CS) is an emerging signal acquisition theory that provides a universal approach for characterizing signals which are sparse or compressible on some basis at sub-Nyquist sampling rate. This paper focuses on the realization of CS on natural speech signals. We construct an over-complete data-driven dictionary as the sparse basis specialized for speech signals. Based on this, CS sampling and reconstruction of speech signal are realized. Furthermore, we propose to choose the sensing matrix adaptively, according to the energy distribution of original speech signal. Experimental results show significant improvement of speech reconstruction quality by using such adaptive approach against using traditional random sensing matrix. Index Terms—adaptive sensing matrix, compressed sensing, K-SVD, overcomplete dictionary, speech signal

I. INTRODUCTION Speech is among the most important media for people to exchange information with each other. Speech communication plays an indispensable role in the modern society. Conventional speech signal acquisition approach follows Shannon’s celebrated sampling theory: for accurate representation of a signal by its time samples, the sampling rate must be at least twice the maximum frequency presented in the signal (the so-called Nyquist rate). On the other hand, according to the characteristic of human auditory system and the correlation between speech time samples, it is recognized that speech signal is highly compressible, which is quite important and necessary for practical speech communications. The dominant approach currently used for speech signal redundancy reduction is to first sample the speech according to Nyquist rate, and then to eliminate redundancy using various compression schemes. However, such kind of methods doesn’t take into account the time-variant characteristic of the speech signal i.e. for most of the scenarios, there may be only low-frequency components in a speech signal while the high-frequency components are present just instantaneously. Thus, computing the sampling rate according to invariant highest frequency will leads to unnecessary large amount of data and redundancy, which is inconvenient for signal collection, storage, processing and transmission.

II. COMPRESSED SENSING THEORY BACKGROUND A. Basic Principle of Compressed Sensing Compressed sensing is a newly introduced concept of signal processing which aims at reconstructing a sparse or compressible signal from its compressed measurements. Let x0 ∈ R N be a real-valued signal of length N. Assume that x0 is k-sparse or compressible on a particular sparse basis

Ψ ∈ R N ×D ( D ≥ N ) i.e. x0 can be represented by only K ( K 0 that for ∀θ ∈ R D , we have:

III. CONSTRUCT THE SPARSE BASIS FOR SPEECH SIGNAL Constructing an appropriate sparse basis is the preliminary step of applying CS on speech signal. The sparsity will increase by using an appropriate basis for a particular class of signal and leads to better performance of CS i.e. less amount of measurements and smaller reconstruction error. For traditional signal representation methods, such as DCT, DFT or various wavelet transforms, the basis Ψ is orthogonal 

Proceedings of the 15th Asia-Pacific Conference on Communications (APCC 2009)-060

(1 − δ k ) || θ ||22 ≤|| ΦΨθ ||22 ≤ (1 + δ k ) || θ ||22

selected to build a dictionary for speech signal, as it has characteristics that are present in many voiced sound classes [12]. So as a comparison of CS reconstruction effect using proposed data-driven dictionary, an over-complete symmlet dictionary and an over-complete discrete cosine dictionary are also tested. Both of the two dictionaries are 4 times overcomplete, the same size as our proposed dictionary.

(7)

That is: (1 − δ k ) || θ ||22 ≤|| y ||22 ≤ (1 + δ k ) || θ ||22 (8) From equation (8), we see that the energy distribution of measurement represents the energy distribution of θ . On the other hand, most energy of speech signal concentrates on the k non-zero coefficients. As a result, through analyzing the measurement, we could get the energy distribution of the original signal. Measurements obtained by random sensing matrix do not reflect the energy distribution of original speech signal. So we propose to build the sensing matrix adaptively according to the energy distribution of measurements. ˆ ∈ R N ×N . First generate a full-rank random sensing matrix Φ Using this matrix, we create measurement vector yˆ ∈ R N

16 14

SNRseg (dB)

12

which has equal length with speech signal x0 :

ˆ x = ΦΨ ˆ θ ∈R yˆ = Φ 0

(9)

0 20

35

40

45

50

55

60

Fig.3 shows the reconstruction quality of speech signal based on proposed dictionary, the over-complete cosine dictionary and the over-complete symmlet dictionary, using both BP and OMP reconstruction algorithms. From the figure, we see that based on our data-driven dictionary, CS reconstruction quality of speech signal can achieve a good SNRseg when sufficient number of measurements is given, which outperforms the results using traditional DCT and wavelet basis, given the same number of measurement. Besides, based on same sparse basis, BP shows better reconstruction quality than OMP. Because BP methods can compute sparse solutions in situations where greedy algorithms fail [5].

Emax

max

Schematic procedure of adaptive sensing is described in fig.2. yˆ E M ∈ R M x0 ∈ R N yˆ ∈ R N

ˆ M Φ E

30

Fig. 3. Reconstructed speech quality based on different dictionaries

sensing matrix and generate adaptive measurement vector which reflects the energy distribution of the original speech signal: ˆ M x ∈ RM yˆ E M = Φ (11) 0 E

ˆ Φ

25

Number of measurements

ˆ and construct the sub matrix Pick up M row vectors from Φ M M × N ˆ M as the adaptive ˆ M ∈R according to E max . Use Φ Φ

max

6

2

E ⊂{1,2...C N }

U max

BP-DCT BP-DWT BP-proposed OMP-DCT OMP-DWT OMP-proposed

8

4

N

Travel all dimension-M subset of yˆ and find out the subset with maximum energy: M ˆ x ||2 Emax = max M || yˆU ||22 = max M || Φ (10) E 0 2 E ⊂{1,2...C N }

10

max

max

Fig. 2. Adaptive CS

B. Improve the Reconstruction Quality by Adaptive Sensing

V. EXPERIMENTAL RESULT

16

In our experiment, the testing speeches are 8 kHz sampled and 16 bits quantized for each sample. CS sampling and reconstruction are performed frame by frame, with a frame length of N=128 samples. All the testing results are an average value over 100 runs under same condition. Because CS reconstruction algorithms aim at recovering speech time domain samples with smallest reconstruction error, the objective index average-subsection signal-to-noise ratio (SNRseg) of speech waveform is used to evaluate the reconstruction quality of speech signal: 1 n SNRseg = 10 lg( ¦ SNRseg k ) (12) n k =1 N

N

i =1

i =1

15 14

SNRseg(dB)

13 12 11 10 9 BP with adaptive measurement BP with random measurement OMP with adaptive measurement OMP with random measurement

8 7 6 20

25

30

35

40

45

50

55

60

Number of measurements

Fig. 4. Reconstructed speech quality using random sensing and adaptive sensing

Where SNRseg k = ¦ xk2 (i ) / ¦ [ xk (i ) − xk' (i )]2

Fig.4 demonstrates the improvement of using proposed adaptive sensing matrix against using random sensing matrix, based on our proposed data-driven dictionary. Reconstruction quality improves for both BP and OMP when we take the

A. Reconstruction Quality Based on Different Dictionaries According to [11], DCT is believed to give sparse representations for speech and in [12], symmlet wavelet is 

Adaptive Compressed Sensing of Speech Signal Based on Data-Driven Dictionary

components caused by reconstruction error in some detail parts of speech signal. Increase the number of measurement can effectively reduce the reconstruction error, however, with a cost of increased data amount and computational time. The number of measurement is determined according to specific applications with different requirement of speech quality.

energy distribution of original speech into consideration. Besides, when both using adaptive sensing, BP still shows better performance than OMP, which is consistent with the results based on random sensing. Reconstruction Quality under Different Compressed Ratio Finally, we analyze the CS reconstruction effect under different compressed ratio, which is defined as Rc =M/N. The following test results are obtained based on the proposed data-driven dictionary and adaptive sensing matrix using BP reconstruction algorithm. C.

Original speech signal

Amplitude

Amplitude

0.5

0

0

1000 2000 Sample

0

1000 2000 3000 Sample Reconstructed speech (Rc=0.5)

0.5

Amplitude

Amplitude

0

0

0

-0.5

1000 2000 3000 Sample Reconstructed speech (Rc=0.75) 0.5

-0.5

Compressed sensing is an emerging sensing/sampling paradigm that could recover signals which are sparse on some certain basis from fewer measurements sampled at sub-optimal Nyquist rate. This paper realizes CS on natural speech signals. We construct an over-complete data-driven dictionary for speech signals which leads to better CS reconstruction quality than using traditional basis. We further improve the reconstruction quality by choosing the sensing matrix adaptively, according to the energy distribution of original speech signal. The feasibility of compressed sensing of speech signals creates a new direction for future research on speech signal processing.

Reconstructed speech (Rc=1.0)

0.5

-0.5

VI. CONCLUSION

0

-0.5

3000

REFERENCES 0

1000 2000 Sample

3000

[1]

D. Donoho, “Compressed sensing,” IEEE Trans. on Information TheoryJ, vol. 52(4), 2006, pp. 1289-1306. [2] E.J. Candes, M.B.Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, Mar.2007, pp. 21-30 [3] http://www.dsp.ece.rice.edu/cs/ [4] J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit”, IEEE Trans. Inform. Theory J, vol. 53(12), 2007, pp. 4655-4666. [5] S. Chen, D.L. Donoho and M.A. Saunders, “Atomic decomposition by basis pursuit”, SIAM J. Sci. Comp., vol. 20(1), pp. 33-61, 1999. [6] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE Trans. Information Theory, vol. 49, no. 12, pp. 3320 –3325, 2003. [7] M.Yaghoobi, L.Daudet and M.E. Davies, “Parametric Dictionary Design for Sparse Coding”, Proceedings of the 2nd workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS'09), Saint-Malo, France (2009). [8] M. Yaghoobi, T. Blumensath, and M. Davies, “Dictionary learning for sparse approximations with the majorization method,” accepted for publication in IEEE Trans. on Signal Processing. [9] M.l Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation”, IEEE Trans. on Signal Processing, vol. 54, No. 11, pp.4133-4322, Nov. 2006 [10] E. Candès, and T. Tao, “Decoding by linear programming”, IEEE, Trans. on Information Theory, vol.51, No. 12, pp.4203-4215, 2005. [11] V. Tan, C. Fevotte, “A study of the effect of source sparsity for various transforms on blind audio source separation performance,” Proc. SPARS, 2005 [12] F.M. Martinez, J.C. Goddard, A.E. Martinez, and H.L. Rufiner, “Basis Pursuit Decomposition: An analysis of Spanish words”, SPECOM2006, pp.349-354

Fig. 5. Waveform of original speech and CS reconstructed speech signal under different compressed ratio Original speech

Reconstruced speech (Rc=1.0) 4000 Frequency(kHz)

Frequency(kHz)

4000 3000 2000 1000 0

0.1

2000 1000 0

0

0.1

0.2 Time(s)

0.3

1000

0

0.1 0.2 0.3 Time(s) Reconstruced speech (Rc=0.5)

4000 Frequency(kHz)

3000

2000

0

0.2 0.3 Time(s) Reconstruced speech (Rc=0.75) 4000 Frequency(kHz)

0

3000

3000 2000 1000 0

0

0.1

0.2 Time(s)

0.3

Fig. 6. Spectrogram of original speech signal and CS reconstructed speech under different compressed ratio

Fig.5 and 6 shows the time domain waveform and spectrogram of original speech signal and CS reconstructed speech respectively, with different compressed ratio of 1.0, 0.75 and 0.5. As the figures show, based on our proposed data-driven dictionary and the adaptive sensing matrix, a good reconstruction result can be obtained when sufficient number of measurement is given. When the number of measurement is equal with the frame length (Rc =1.0), the reconstruction is accurate. The differences of both waveform and spectrogram between original speech and reconstructed speech are negligible. As the compressed ratio increases, some information lost begin to appear in the high-frequency 

Suggest Documents