Design and implementation of a space domain spherical microphone ...

5 downloads 4347 Views 4MB Size Report
ticular, formulation of the space domain spherical array is given in matrix formalism. ..... FIG. 4. (Color online) Beampatterns of modal domain and solid aperture.
Design and implementation of a space domain spherical microphone array with application to source localization and separation Mingsian R. Bai,a) Yueh Hua Yao, Chang-Sheng Lai, and Yi-Yang Lo Department of Power Mechanical Engineering, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan

(Received 11 August 2015; revised 2 February 2016; accepted 11 February 2016; published online 3 March 2016) In this paper, four delay-and-sum (DAS) beamformers formulated in the modal domain and the space domain for open and solid spherical apertures are examined through numerical simulations. The resulting beampatterns reveal that the mainlobe of the solid spherical DAS array is only slightly narrower than that of the open array, whereas the sidelobes of the modal domain array are more significant than those of the space domain array due to the discrete approximation of continuous spherical Fourier transformation. To verify the theory experimentally, a three-dimensionally printed spherical array on which 32 micro-electro-mechanical system microphones are mounted is utilized for localization and separation of sound sources. To overcome the basis mismatch problem in signal separation, source localization is first carried out using minimum variance distortionless response beamformer. Next, Tikhonov regularization (TIKR) and compressive sensing (CS) are employed to extract the source signal amplitudes. Simulations and experiments are conducted to validate the proposed spherical array system. Objective perceptual evaluation of speech quality test and a subjective listening test are undertaken in performance evaluation. The experimental results demonstrate better separation quality achieved by the CS approach than by the TIKR approach at C 2016 Acoustical Society of America. the cost of computational complexity. V [http://dx.doi.org/10.1121/1.4942639] [NX]

Pages: 1058–1070

I. INTRODUCTION

Spherical microphone arrays have received much research attention in recent years. Spherical microphone arrays find applications in sound field analysis, teleconferencing, direction of arrival (DOA) estimation, noise source identification, room response measurement, etc.1 One of the advantages of spherical arrays lies in its rotational invariance of beampattern due to its perfect symmetry. There are ample results on spherical microphone arrays in literature. Meyer and Elko,2 also Abhayapala and Ward3 and others4 presented spherical beamformers based on spherical harmonic decomposition of plane-wave fields. Park and Rafaely5 employed the same approach to transform the sound signals into the spherical harmonics domain in formulating beamformers. This approach is also referred to as the phase-mode array. Rafaely6 compared the phase-mode array and the space domain delay-and-sum (DAS) beamformer for open spheres with a continuous integral formulation. Rafaely7 performed a theoretical analysis on spatial aliasing for various sphere sampling schemes. He demonstrated how high-order spherical harmonics are aliased into the lower orders. To increase robustness at the frequencies associated with the zeros of spherical Bessel functions in designing open spherical arrays, Rafaely8 examined the robustness issue by considering the condition numbers of the spherical harmonics transformation matrix. In the work by Rafaely,9 a)

Electronic mail: [email protected]

1058

J. Acoust. Soc. Am. 139 (3), March 2016

modal spherical arrays are closely examined and documented. Several robust array configurations were suggested in the present paper. Yan et al.10 proposed a modal domain beamformer based on constrained optimization to trade-off various performance measures including directivity index (DI), robustness, array gain, sidelobe level, and mainlobe width. Rafaely11 and Huleihel and Rafaely12 applied the modal domain multiple signal classification (MUSIC)13 algorithm to obtain room responses by using a solid spherical array. The aforementioned spherical microphone arrays were predominantly formulated in the modal domain. This generally leads to simple formulation due to the decoupled characteristics of the orthogonal eigenmodes. Although a spherical microphone array can be formulated directly in the space domain, the related literature is not as much as that on the spherical harmonics domain arrays. Koretz and Rafaely14 formulated a Dolph-Chebyshev beamformer in the space domain. Its array weights are computed, given prescribed mainlobe width and sidelobe level. Sun et al.15 applied convex optimization to design space domain beamformers for spherically isotropic noise fields. Mathematically, the modal formulation and the spatial formulation are equivalent. Modal domain, or eigenspace, formulation refers to as the discretization scheme based on modal data associated with spatial eigenmodes. Space domain, or element space, formulation refers to as the discretization scheme based on spatial data sampled with discrete transducer elements. Discretization in those two domains

0001-4966/2016/139(3)/1058/13/$30.00

C 2016 Acoustical Society of America V

could make some difference numerically. In this paper, four spherical array configurations, including the modal domain and open aperture array, the modal domain and solid aperture array, the space domain and open aperture array, and the space domain and solid aperture array, are compared. In particular, formulation of the space domain spherical array is given in matrix formalism. DAS and minimum variance distortionless response (MVDR)16 beamformers are designed in the preceding four configurations. The comparison of four array configurations is based on the performance measures: DI and white noise gain (WNG). In addition to the analysis of beamforming formulations, this paper also examines the feasibility of spherical arrays when applied to source localization and separation problems. Source localization is based on minimum power distortionless response (MPDR)15 whereas source separation is based on Tikhonov regularization (TIKR) and compressive sensing (CS).17,18 Instead of performing localization and separation at one shot, this two-stage procedure is used to prevent from angle mismatch problems.19 Source localization and separation experiments are conducted in an anechoic room by using a three-dimensionally (3D) printed spherical array, with 32 micro-electro-mechanical system (MEMS) microphones mounted on the surface of the sphere. The microphones are positioned similarly to the Eigenmike.20,21 The spherical array is utilized to localize and separate source signals in various objective and subjective tests. The objective test is based on perceptual evaluation of speech quality (PESQ).22,23 The mean opinion scores (MOS) of these methods are calculated to assess the separation performance of the array. In addition, subjective listening tests are conducted with results processed by using analysis of variance (ANOVA) and regression analysis. To facilitate readers to follow the formulations, the symbols used consistently hereafter throughout the paper are summarized in Table I. Subscripts “SD” and “MD” denote “space domain” and “modal domain,” respectively. II. SPHERICAL ARRAY FORMULATIONS A. Plane-wave expansion based on spherical harmonics

function of the space coordinates r and time t with k being the wave vector, impinging on a solid sphere of radius a from the direction, X0 ¼ ðh0 ; /0 Þ. The sound pressure at the look angle on the sphere, Xs ¼ ðhs ; /s Þ due to the plane wave can be expanded in spherical harmonics as pðka; X0 ; Xs Þ ¼

1 X n¼0

bn ðkaÞ

n X

½Ynm ðX0 Þ Ynm ðXs Þ;

(1)

m¼n

where k ¼ x=c is the wave number, x is the angular frequency, and c is the sound speed. Superscript “*” denotes complex conjugation and the coefficient bn ðkaÞ depends on the sphere configuration:25 8 n > < 4pi jn ðkaÞ; ! open sphere 0ð Þ j ka bn ðkaÞ ¼ n n > : 4pi jn ðkaÞ  h0 ðkaÞ hn ðkaÞ ; solidsphere; n (2) where jn and hn are the nth-order spherical Bessel function and Hankel function, respectively. The prime “0 ” denotes the first derivative. Ynm is the spherical harmonics of order n and degree m defined as Ynm ðXÞ ¼ Ynm ðh; /Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2n þ 1Þ ðn  mÞ! m ¼ P ðcos hÞeim/ ; 4p ðn þ mÞ! n

(3)

where Pm n ðÞ denotes the associated Legendre function of order n and degree m. For terms with orders higher than ka, the function bn ðkaÞ is significantly diminishing. Therefore, by choosing an array order of N ¼ dkae;

(4)

where de denotes the ceiling function, the spherical harmonics expansion can be truncated to order N without significant error for frequencies associated with kr < N.1 B. Spherical arrays formulated in the modal domain

The spherical coordinates ðr; h; /Þ are adopted in the following formulation. Time-harmonic eixt , is pffiffiffiffiffifficonvention, ffi assumed (x is frequency and i ¼ 1). Consider a unitamplitude plane wave, pðr; tÞ ¼ 1eiðkrxtÞ written as a

This section provides mathematical background necessary for spherical array formulation. We begin with the orthonormality of spherical harmonics: ð 0 Ynm0 ðXÞ½Ynm ðXÞ dX ¼ dnn0 dmm0 ; (5) X2S2

TABLE I. Nomenclature list. The subscripts “SD” and “MD” denote “space domain” and “modal domain,” respectively. Description Array data vector Array weight vector Array manifold vector Array manifold matrix Additive noise vector Data spatial correlation matrix Noise spatial correlation matrix

Space domain

Modal domain

xSD wSD aSD ASD vSD Rxx;SD Rvv;SD

xMD wMD aMD AMD vMD Rxx;MD Rvv;MD

Ð 2p Ð p Ð where the integral X2S2 dX ¼ 0 0 sin h dhd/ represents the surface integration over an unit sphere, and dnn0 and dmm0 are the Kronecker deltas. The continuous surface integral in Eq. (5) can only be approximated in practice by spatially sampling sound pressure at microphone positions Xs ; s ¼ 1; …; M, where M is the number of microphones. That is, M X

0

as Ynm0 ðXs Þ½Ynm ðXs Þ  dnn0 dmm0 ;

(6)

s¼1

J. Acoust. Soc. Am. 139 (3), March 2016

Bai et al.

1059

where as is a constant which depends on the spatial sampling scheme for the sphere. For uniform and nearly uniform sampling1 on a sphere, as ¼ 4p=M. For a spherical aperture insonified with a unit-amplitude plane wave, the spherical Fourier transform (SFT) of the pressure function pðka; X0 ; XÞ and the inverse transform are defined as ð pðka; X0 ; XÞ½Ynm ðXÞ dX pnm ðka; X0 Þ ¼ X2S2

¼ bn ðkaÞ½Ynm ðX0 Þ ; pðka; X0 ; XÞ ¼

1 X n X

(7)

pnm ðka; X0 ÞYnm ðXÞ:

(8)

However, for a general incident plane wave with arbitrary amplitude function xSD ðka; X0 ; Xs Þ arriving from X0 , the discrete SFT and the inverse transform can be written as M X

as xSD ðka; X0 ; Xs Þ½Ynm ðXs Þ ;

(9)

s¼1

xSD ðka; X0 ; Xs Þ ¼

N X n X

(14)

It should be noted that, for an Nth-order spherical array, the number of microphones must be greater or equal to the order, i.e., M  ðN þ 1Þ2 ;

(15)

to be an overdetermined system. Spatial aliasing will arise if the condition above is violated. The modal array data vector can be obtained by inverting Eq. (12): xMD ðka; X0 Þ ¼ Yþ xSD ðka; X0 Þ ¼ ðYH Y þ bIÞ1 YH xSD ðka; X0 Þ;

n¼0 m¼n

xmn ðka; X0 Þ ¼

2

yðXÞ ¼ ½ Y00 ðXÞ    YNN ðXÞ  2 C1ðNþ1Þ :

(16)

where b is a regularization parameter, the superscript “þ” denotes pseudoinverse, and the superscript “H” denotes matrix Hermitian transpose. The coefficients as in Eq. (6) result from the pseudoinverse of Y in Eq. (16). To be specific, xMD ðkaÞ ¼ Yþ xSD ðkaÞ

xnm ðka; X0 ÞYnm ðXs Þ;

(10) ) xnm ðkaÞ ¼

n¼0 m¼n

M X

anm s xe ðka; Xs Þ;

(17)

s¼1

where xSD ðka; X0 ; Xs Þ is the sound pressure measured at Xs and N is the highest order in the truncated expansion. The pressure field sampled at the M positions can be stacked into the space domain array data vector xSD ðka; X0 Þ ¼ ½ xSD ðka; X0 ; X1 Þ   xSD ðka; X0 ;XM Þ T 2 CM1 ;

(11)

where superscript “T” denotes matrix transpose. It is straightforward to show that Eq. (10) can be rewritten by using matrix definition in Eq. (11) as xSD ðka; X0 Þ ¼ YxMD ðka; X0 Þ;

(12) 2

where xMD ðka; X0 Þ ¼ ½ x00    xNN T 2 CðNþ1Þ 1 is the modal space array data vector and the spherical harmonics matrix Y is defined as 2

3 yðX1 Þ 6 7 6 yðX1 Þ 7 6 7 Y¼6 . 7 6 .. 7 4 5 yðXM Þ 2 Y00 ðX1 Þ Y11 ðX1 Þ 6 0 6 Y0 ðX2 Þ Y11 ðX2 Þ 6 ¼6 .. .. 6 . . 4 0 1 Y0 ðXM Þ Y1 ðXM Þ

xMD ðka; X0 Þ 

4p H Y xSD ðka; X0 Þ: M

(18)

Next, we define the following symbols relevant to the modal domain formulation: aMD ¼ vecðf½pnm nm¼n gNn¼0 Þ; xMD ¼ vecðf½xnm nm¼n gNn¼0 Þ; wMD ¼ vecðf½wnm nm¼n gNn¼0 Þ; and B ¼ diagðb0 ; b1 ; b1 ; b1 ; …; bN Þ;

(19)

where vecðÞ denotes a ðN þ 1Þ2  1 column vector stacking all the entries in the parentheses and B is a ðN þ 1Þ2 ðN þ 1Þ2 diagonal matrix. Combining Eqs. (7), (14), and (19) leads to the modal manifold vector aMD : 3   YNN ðX1 Þ 7   YNN ðX2 Þ 7 7 MðNþ1Þ2 ; 72C .. .. 7 . . 5   YNN ðXM Þ (13)

with 1060

þ where anm s ¼ fY gnm;s and xSD ðkaÞ ¼ ½xe ðka; X1 Þ xe ðka; X2 Þ T    xe ðka; XM Þ with xe ðka; Xs Þ being the surface pressure data received at the angle Xs . For configurations of nearly uniform sampling, as ¼ 4p=M,

J. Acoust. Soc. Am. 139 (3), March 2016

aMD ðka; X0 Þ ¼ ByH ðX0 Þ:

(20)

An important matrix, array spatial correlation matrix, can also be defined in the modal domain: Rxx;SD ¼ EfxMD xMD H g;

(21)

where Efg denotes the expectation operation. For spherically isotropic acoustic noise, the modal domain spatial correlation matrix simplifies to8 Bai et al.

 r2n diag jb0 ðkaÞj2 ; 4p r2 ¼ n BH B; 4p

Rvv;MD ¼

jb1 ðkaÞj2 ;

jb1 ðkaÞj2 ; jb1 ðkaÞj2



where 

wmn

H

yðkaÞ ¼ wMD ðkÞxMD ðkaÞ ¼ wMD H ðkÞaMD ðka; X0 ÞsðkÞ þ wMD H ðkÞvMD ðkÞ; (23) in which vMD ðkaÞ is an uncorrelated noise term. With the preceding notations, the weight vectors of three widely used modal domain beamformers are summarized next. First, the weight vector of the modal DAS beamformer is given by

ð Þ R1 vv;MD ka aMD ðka; X0 Þ aMD H ðka; X0 ÞR1 vv;MD aMD ðka; X0 Þ

: (25)

The weight vector of the phase-mode beamformer4 is given by5,6 wPM ðkaÞ ¼ vecf½ðwmn Þnm¼n Nn¼0 g;

¼

(26)

 ¼

pmn ðka; X0 Þ : jbn ðkaÞj2

(27)

In fact, the phase-mode beamformer described by Eqs. (26) and (27) is motivated by the completeness relation:22 1 X n X

½Ynm ðh; /Þ Ynm ðh0 ; /0 Þ

n¼0 m¼n

¼ dðcos h  cos h0 Þdð/  /0 Þ;

(28)

where dðÞ is the Dirac delta function. Recall that

(24)

The weight vector of the modal MVDR beamformer is given by

jb1 ðkaÞj2 ;

Y m ðX 0 Þ ¼ n bn ðkaÞ

pðka; X0 ; Xs Þ ¼

aMD ðka; X0 Þ : wMDDAS ðkaÞ ¼ aMD H ðka; X0 ÞaMD ðka; X0 Þ

wPM ðkaÞ ¼ diagfjb0 ðkaÞj2 ;

 (22)

where r2n denotes the spectral power density of noise. The modal domain array output signal can be expressed as

wMDMVDR ðkaÞ ¼

jbN ðkaÞj2

1 X

bn ðkaÞ

n X

½Ynm ðh0 ; /0 Þ Ynm ðhs ; /s Þ;

m¼n

n¼0

for a unit-amplitude plane wave arriving from X0 , it is plausible to choose the equalizing weight as wmn ðkÞ ¼ ½Ynm ðX0 Þ=bn ðkaÞ , which gives a sharp peak at the source direction. Therefore, as a side effect of the phase-mode beamformer, it breaks down at the zeros of spherical Bessel functions for open spheres, which entails regularization to avoid numerical error. It will be shown next that a phasemode beamformer is equivalent to an unregularized dataindependent modal MVDR beamformer for a spherically isotropic noise field. From Eqs. (26) and (27), we can write the modal weight vector of the phase-mode array as

jb1 ðkaÞj2 ; jb1 ðkaÞj2



 jbN ðkaÞj2 ; aMD

r2n 1 ðkaÞaMD ðka; X0 Þ R 4p w;MD

(29)

On the other hand, it can be shown by using the Uns€old theorem24 that aMD ðka; X0 ÞR1 vv;MD ðkaÞaMD ðka; X0 Þ H

¼ ðN þ 1Þ

2

=r2n ;

wMDMVDR ðkaÞ ¼ ¼

ð Þ R1 vv;MD ka aMD ðka;X0 Þ aMD H ðka; X0 ÞR1 vv;MD aMD ðka;X0 Þ r2n ð N þ1Þ2

ð Þ R1 vv;MD ka aMD ðka;X0 Þ:

(31)

(30) Rvv;MD is regulated by appropriate diagonal loading, i.e., Rvv;MD ! ðRvv;MD þ eIÞ with e being a regularization parameter. Equation (30) enables the weight vector of the unregulated modal MVDR beamformer to be written as J. Acoust. Soc. Am. 139 (3), March 2016

By comparing Eqs. (29) and (31), we may conclude that the phase-mode beamformer and modal MVDR beamformer are equivalent up to a scaling constant, which gives identical beampatterns of two beamformers. Bai et al.

1061

C. Spherical arrays formulated in the space domain

In the section, as rarely done in the prior literature, the spherical beamformers will be reformulated in the space domain, with the aid of full matrix notations. In the space domain, the array manifold vector is given by  T aSD ¼ vðka; X; X1 Þ vðka; X; X2 Þ    vðka; X; XM Þ 2 32 3 b0 ðkaÞY00 ðXÞ Y00 ðX1 Þ    YNN ðX1 Þ 6 76 7 6 76 7 .. .. .. ¼6 7 6 7 . . .  54 4 5 Y00 ðXM Þ    YNN ðXM Þ

bN ðkaÞYNN ðXÞ

¼ YðXm ÞaMD ðXÞ ¼ YðXm ÞByH ðXÞ;

(33)

For spherically isotropic acoustic noise, the spatial correlation matrix has a closed-form solution (see the Appendix): Rvv;SD ¼ r2n AMD H ðka; XÞAMD ðka; XÞ 2 CMM ;

Rvv;SD ¼ r2n AMD H ðka; XÞAMD ðka; XÞ (35)

This is readily identified to be the eigenvalue decomposition (EVD) of Rvv with U ¼ Y and K ¼ r2 BH B:

(36)

From the analysis above, we may conclude that the modal space MVDR beaformer is equivalent to the conventional eigenspace MVDR beamformer in the literature26 with xMD ¼ UH xSD and aMD ¼ UH aSD for uniform sampling. Therefore, the modal space arrays share the same advantages with eigenspace arrays in that computation complexity is reduced in the 1062

J. Acoust. Soc. Am. 139 (3), March 2016

(37)

with vSD ðkaÞ being an uncorrelated noise term. Proceeding with the same logic as in the modal beamformers, the weight vectors of two space domain beamformers are presented next. The weight vectors of the space domain DAS beamformer and MVDR beamformer are given by15 wSDDAS ðkaÞ ¼

aSD ðka; X0 Þ aSD H ðka; X0 ÞaSD ðka; X0 Þ

(38)

and wSDMVDR ðkaÞ ¼

ð Þ R1 vv;SD ka aSD ðka; X0 Þ : H ð Þ aSD ðka; X0 ÞR1 vv;SD ka aSD ðka; X0 Þ (39)

D. Source localization and separation algorithms

In the following, the spherical microphone array is applied to source localization and separation problems. A two-stage procedure has been developed to prevent from the basis mismatch problem.17 For source localization, the MPDR beamformer is used. MPDR beamformer differs from the preceding data-independent MVDR beamformer only in that the array data correlation matrix Rxx;MD based on measured noisy signals is used in place of the closed form Rxx;MD used in the MVDR. The weight vector and angular spectrum of MPDR beamformer is given in the modal domain as

(34)

where AMD ðka; XÞ ¼ ½aMD ðka; X1 Þ    aMD ðka; XM Þ ¼ BYH . It follows that the space domain Rvv;SD is a full (but diagonally dominant) matrix, whereas the modal domain Rvv;SD is a diagonal matrix having the zeros of spherical Bessel functions in its diagonal entries. An interesting insight is borne out from the preceding space domain correlation matrix. Rewriting Rvv;SD using the relation AMD ðka; XÞ ¼ BYH , we have

¼ r2n YBH BYH ¼ UKUH :

yðka; X0 Þ ¼ wSD H ðkaÞxSD ðka; X0 Þ þ vSD ðkaÞ;

(32)

where vðka; X; Xm Þ, m ¼ 1; …; M are the unit-variance spherical acoustic isotropic noise received by the M microphones on the sphere. The elements of manifold vector stand for a plane wave with noise amplitude function. Thus, the corresponding manifold vector is the array response vector due to a unit-variance acoustic noise signal spherically and isotropically distributed in the space. Note that, in the space domain with open aperture, the array weights do not have the “divided by zero” problem as in the modal domain because the zeros of spherical Bessel functions are “scrambled” in formulating the space domain manifold vector. On the other hand, the space domain spatial correlation matrix is defined as Rxx;SD ¼ EfxSD xSD H g:

transformed domain and processing can be carried out in a reduced-dimensional space at low frequencies. With the preceding notations, the space domain array output can be written as

ð Þ R1 xx;MD ka aMD ðka; X0 Þ ; H ð Þ aMD ðka; X0 ÞR1 xx;MD ka aMD ðka; X0 Þ

wMDMPDR ðkaÞ ¼

(40) SMDMPDR ðka; X0 Þ ¼

1 aMD

H ðka; X

1 0 ÞRxx;MD ðkaÞaMD ðka; X0 Þ

:

(41)

Similarly, the space domain MPDR weight vector and angular spectrum can be written as wSDMPDR ðkaÞ ¼

ð Þ R1 xx;SD ka aSD ðka; X0 Þ ; ð Þ aSD H ðka; X0 ÞR1 xx;SD ka aSD ðka; X0 Þ (42)

SSDMPDR ðka; X0 Þ ¼

1 aSD

H ðka; X

1 0 ÞRxx;SD ðkaÞaSD ðka; X0 Þ

:

(43) Proceeding forward, the second stage of source separation is carried out by using TIKR for over-determined problems or CS for underdetermined problems.27 The pressure vector at the array can be expressed as Bai et al.

xSD ðxÞ ¼ ASD sðxÞ þ vSD ðxÞ;

(44)

where ASD ¼ ½aSD1 ðX1 Þ    aSDND ðXND Þ is a M  ND manifold matrix, ND is the number of sources, vector s ¼ ½s1 ðxÞ    sND ðwÞT denotes the Fourier transform of the amplitude vector of the source signals, and vSD ðwÞ denotes an additive noise vector. The separation problem amounts to finding the source amplitude vector s by solving the problem above. The solution can be expressed as s ¼ ASD † xSD ;

(45)

where “†” symbolizes some kind of inverse operation. Since the matrix ASD can be singular in some frequencies, TIKR is commonly utilized for over-determined problems (M > ND ). TIKR is based on the least-square optimization problem

^s ¼ ðASD H ASD þ bIÞ1 ASD H xSD :

(47)

On the other hand, for under-determined source separation problems (M < ND ), a CS problem can be formulated and solved by using the convex (CVX) optimization:28,29 min k^s k1 ^s

st:

kASD^s  xSD k2 g;

(48)

where k  k1 denotes the vector 1 norm and g is a constant threshold. III. NUMERICAL SIMULATIONS

where k  k2 denotes the vector 2 norm and b is a regularization parameter. The optimal solution is

Simulations are conducted to examine the spherical microphone arrays. The radius of the sphere is chosen to be 5 cm. The array is designed for maximum frequency 4 kHz which covers telephone speech bandwidth. According to the aforementioned design criteria, N ¼ dkae, N ¼ 4 is selected. Furthermore, to satisfy M  ðN þ 1Þ2 and nearly uniform sampling scheme, M ¼ 32 is chosen. These 32 sensors are to

FIG. 1. (Color online) Beampatterns ðwH MD xMD Þ for the modal domain and solid aperture DAS beamformer. The modal array data vectors are calculated using two modal transformation approaches. (a) pseudoinverse: xMD ¼ Yþ xSD , (b) Discrete SFT: xMD ¼ ð4p=MÞYH xSD .

FIG. 2. (Color online) Beampatterns of four DAS beamformers: (a) the modal domain and open aperture array, (b) the modal domain and solid aperture array, (c) the space domain and open aperture array, (d) the space domain and solid aperture array.

minðkASD sp  xSD k22 þ bksk22 Þ; sp

J. Acoust. Soc. Am. 139 (3), March 2016

(46)

Bai et al.

1063

be mounted on the vertices and centers of faces of an icosahedron. The beampatterns will be presented intentionally in the frequency range 0 to 8 kHz (which corresponds to N ¼ 8) to show spatial aliasing at high frequencies above the spatial aliasing frequency 4369 Hz. Beampattern, DI, and WNG are used as performance measures for the simulated cases. For simplicity, beampattern is plotted as a function of frequency and azimuth angle at a fixed elevation angle h ¼ 90 because azimuth and elevation angles follow similar trends. The DI of the modal domain array is calculated using 0 1 wMD H ðX0 ; xÞxMD ðxÞ DIðxÞ ¼ 10 logB C: (49) D B1 X C H @ wMD ðXd ; xÞxMD ðxÞA ND d¼1 The DI of the space domain array is defined similarly with the weight and data vectors replaced by their space domain versions. The WNG of the modal domain array is calculated using1

FIG. 3. (Color online) Performance measures calculated for four DAS beamformers in Fig. 2. (a) DI, (b) WNG. 1064

J. Acoust. Soc. Am. 139 (3), March 2016

WNGðxÞ ¼ 10 log



M : 4pwMD H ðX0 ; xÞwMD ðX0 ; xÞ (50)

The WNG of the space domain array is defined similarly with the weight and data vectors replaced by their space domain versions. To quantify the sidelobes in the modal and space domains, a metric nðf Þ is defined to quantify the resulting sidelobes for the beamformers formulated in the two domains. First, the array output, y ¼ wH ðÞ xðÞ , is normalized by its maximum at each frequency f(Hz). Next, the boundary that separates the mainlobe region and the sidelobe region is determined by the 3 dB criterion. Last, the root-meansquare sidelobe level (SLLrms) corresponding to the grid points lying in the sidelobe region (with sidelobe level less than 3 dB) is computed. The preceding three steps can be summarized as the following equation:

FIG. 4. (Color online) Beampatterns of modal domain and solid aperture MVDR beamformers calculated by using two different approaches. (a) Ideal array response with no discrete approximation effect ðy ¼ wH MD aMD Þ; (b) the array output response with discrete approximation effect of SFT considered ðy ¼ wH MD xMD Þ. Bai et al.

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u L u1 X nð f Þ ¼ t j y ð/l ; p=2; f Þj2 ; L l¼1

(51)

where y is the normalized array output signal, l is the running index associated with the grid point in the sidelobe region, and L is the number of grid points in the sidelobe region at the frequency f . A. DAS beamformers formulated in the modal domain and the space domain

FIG. 5. (Color online) The root-mean-square sidelobe level (SLLrms) of beampattern plotted versus frequency, computed for the four configurations of spherical microphone arrays.

First, Fig. 1 compares beampatterns based on modal array data obtained using two discrete modal transformation methods. Specifically, the modal array data of Figs. 1(a) and 1(b) are calculated using Eqs. (16) and (18), respectively. The sidelobes at high frequencies above 2 kHz calculated using matrix pseudoinverse in Fig. 1(a) is less pronounced than those calculated using YH in Fig. 1(b). Therefore, Eq. (16) is employed for modal transformation hereafter.

FIG. 6. (Color online) Beampatterns of four MVDR beamformers: (a) the modal domain and open aperture array; (b) the modal domain and solid aperture array; (c) the space domain and open aperture array; and (d) the space domain and solid aperture array. J. Acoust. Soc. Am. 139 (3), March 2016

Bai et al.

1065

Figures 2(a)–2(d) illustrate the beampatterns of DAS beamformers in four configurations: the modal domain and open aperture array, the modal domain and solid aperture array, the spatial domain and open aperture array, and the spatial domain and solid aperture array. For frequencies above 4 kHz (bearing in mind that the theoretical spatial aliasing frequency is 4369 Hz), spatial aliasing is visible in all beampatterns. In particular, aliasing is more pronounced for two modal domain beamformers than the other two space domain counterparts. To further examine this distinction, the DIs of the four beamformers are calculated as a function of frequency. The result in Fig. 3(a) shows that the DIs of the space domain beamformers are indeed higher than the modal domain beamformers, suggesting that space domain beamformers are more resilient to spatial aliasing than the modal domain counterparts. For the space domain beamformers, the DI of solid array is slightly higher at low frequencies, but lower at high frequencies than in the open array except at 7–8 kHz. Overall, they are comparable. For the modal DAS beamformers, the DI of solid array is predominantly higher than that of the open array perhaps due to the zeros of spherical Bessel functions. Figure 3(b) shows the WNG calculated for the four DAS beamformers. The WNGs of two open arrays in the space and modal domains are nearly identical, whereas the WNGs of another two solid arrays in the space and modal domains are also close to each other. In general, the WNGs of the solid beamformers increase with frequency. A natural question arises as why the space domain arrays are more resilient to spatial aliasing than the modal domain arrays. To answer this question, the beampatterns of the ideal array response function ðy ¼ wMD H aMD Þ and the array output response due to discrete approximation of SFT ðy ¼ wMD H xMD Þ are calculated for the modal solid DAS beamformer. Distinction in beampatterns is clearly visible in the result of Fig. 4. The beampattern in Fig. 4(b) exhibits lower directivity and higher sidelobes than Fig. 4(a). This suggests that discrete approximation of continuous SFT in calculating modal data is indeed the primary source of error of the modal beamformers, which leads to increased sidelobe level over that of the space domain beamformers. Incidentally, this aspect seems reminiscent of the independent modal space control (IMSC)30 invented by Leonard Meirovitch, where numerical error can arise because of the discrete approximation of a distributed “modal filter.” Figure 5 shows the SLLrms plotted versus frequency, computed using Eq. (51) for the 4 configurations of spherical microphone arrays. The result reveals that the sidelobe level of the modal domain beamformers are predominantly higher than those of the space domain counterparts whether it is open or solid aperture.

modal domain and solid aperture array, the space domain and open aperture array, and the space domain and solid aperture array. All MVDR beamformers are apparently more directional than the preceding DAS counterparts, particularly at low frequencies. However, two space domain MVDR beamformers yield markedly narrower mainlobe and lower sidelobes than those of the modal domain beamformers. Figure 7 shows the DIs and WNGs of the MVDR beamformers for the preceding four configurations. The DIs of the space domain MVDR beamformers are clearly higher than those of the modal domain beamformers. The DI of the space domain MVDR beamformer in solid aperture is slightly lower than that of the open array because of the increased sidelobes at high frequencies. This suggests that solid array is slightly more susceptible to aliasing than the open array in the context of space domain MVDR design. Nevertheless, the DI of the modal MVDR domain beamformer in solid aperture is predominantly higher than that of

B. MVDR beamformers formulated in the modal domain and the space domain

In addition to DAS beamformers presented above, MVDR beamformers are also examined here. Figures 6(a)–6(d) show the beampatterns of the four MVDR beamformers: the modal domain and open aperture array, the 1066

J. Acoust. Soc. Am. 139 (3), March 2016

FIG. 7. (Color online) Performance measures calculated for four MVDR beamformers in Fig. 5. (a) DI, (b) WNG. Bai et al.

Figure 8(a) shows the experimental arrangement. The experiment is conducted in a 5.4 m  3.5 m  2 m anechoic room, where three sources broadcasting speech signals from three azimuth angles. A sphere of radius 5 cm on which 32 SPM0404HE5H-PB, Knowles MEMS mics are mounted on the vertices and the centers of faces of an icosahedron is constructed using 3D-printing technology, as shown in Fig. 8(b). The sensitivity of the MEMS microphones is 42 dB–1 V/ Pa at 1 kHz. Figures 9(a) and 9(b) show the beampatterns of the DAS beamformer formulated in the modal domain and the space domain, measured by using a white noise source positioned at the direction, ðh; /Þ ¼ ð90 ; 180 Þ. The measured beampatterns resemble the simulation results in Fig. 2. As in the simulations, spatial aliasing phenomenon is more pronounced in the beampattern for the modal domain beamformer than the space domain beamformer.

FIG. 8. (Color online) The experimental arrangement for source localization and separation using a spherical microphone array. (a) Spherical microphone array positioned in the center of the anechoic room with 3 speech source signals; (b) the close-up view of a 3D-printed spherical microphone array with 32 MEMS microphones mounted on the surface.

open array, as in the preceding DAS beamformers due to the zeros of spherical Bessel functions associated with the open sphere. The WNGs of the modal domain beamformers are higher than those of the space domain beamformers at the cost of decreased DI’s. Therefore, it can be concluded from the preceding results of DAS and MVDR beamformers, the space domain beamformers can attain higher directivity than the modal domain beamformers in general. IV. SOURCE LOCALIZATION AND SEPARATION EXPERIMENTS

In the section, a solid spherical microphone array is applied in an experiment to source localization and separation problems. The sampling rate and fast Fourier transform (FFT) frame size are 16 kHz and 1024, respectively. A Hanning window with 50% overlap is used.31 J. Acoust. Soc. Am. 139 (3), March 2016

FIG. 9. (Color online) Measured beampatterns of the spherical microphone array for (a) the modal domain DAS beamformer and (b) the space domain DAS beamformer. Bai et al.

1067

FIG. 11. (Color online) The MOS of PESQ calculated for the signals separated TIKR and CS algorithms in comparison with the clean signals (reference) and the mixed signal picked up at one of the array microphones. The experiment is performed in the anechoic room.

FIG. 10. (Color online) Experimental results of source localization using the space domain MPDR beamformer. (a) Angular spectrum versus azimuth and frequency; (b) frequency-averaged angular spectrum magnitude versus azimuth angle.

In the following example, we shall localize and separate three speech source signals, located at the directions: ðh; /Þ ¼ ð90 ; 60 Þ, (90 , 180 ), and (90 , 280 ). The space domain beamforming that yields better directivity than the modal domain beamforming (as indicated by the simulation results) is chosen for the experiment. Figures 10(a) and 10(b) show the angular spectra calculated using the space domain MPDR beamformer in terms of (azimuth, frequency) and azimuth, respectively. The spatial correlation matrix Rxx;SD is calculated by averaging 30 frames of frequencydomain microphone signals obtained from FFT. Diagonal loading e ¼ 105 is assumed. The angular spectrum in Fig. 10(b) show clearly three peaks at ðh; /Þ ¼ ð90 ; 63 Þ, (90 , 181 ), and (90 , 285 ), which gives the source locations. Next, source separation performance is evaluated objectively through a PESQ test. The MOS of PESQ are summarized in Fig. 11. The results demonstrate improved voice 1068

J. Acoust. Soc. Am. 139 (3), March 2016

quality of the extracted signals achievable by the TIKR and CS methods. The MOS of the CS method is slightly higher that of the TIKR method. In addition to the objective tests, subjective listening tests are conducted to assess the separation performance in terms of three attributes: separation quality, distortion, and overall preference. There are 14 subjects of male and female participating in the listening test. Listeners are given instructions on the definitions of the subjective attributes prior to the test. The perceived performance in terms of the subjective attributes is graded with a score ranging from 1 (very bad) to 5(very good). The listening test is performed in accordance with the multiple stimuli with hidden reference and anchor (MUSHRA)32 procedure which requires the use of a reference signal and an anchor signal. The clean speech signal is used as the reference, whereas the mixed signal picked up at one of the array microphones is used as the anchor signal. The

FIG. 12. (Color online) The ANOVA output of the subjective listening test conducted in the anechoic room. The mean and 95% spread are indicated in the figure. Bai et al.

TABLE II. The p-value of the experimental results in three attributes: separation quality, distortion, and overall preference. A p-value less than 0.05 indicates that significant statistical difference exists among the compared data. Attribute p-value

Separation

Distortion

Overall

0.01942

0.15667

0.18841

acquired test data are processed by ANOVA to evaluate the statistical difference existing in the data. The unit-amplitude plane wave shows the results of listening test (Fig. 12). CS performs better than TIKR in separation quality, but worse in distortion. However, the overall preference of TIKR and CS is comparable. The linear dependence of the overall preference on the separation quality and distortion is examined, with the aid of a multiple regression analysis. The linear regression model obtained is

The work was supported by the National Science Council (NSC) in Taiwan, Republic of China, under project number NSC 102-2221-E-007-029-MY3. APPENDIX

The array spatial correlation matrix for the spherically isotropic noise field can be written as "1 n # ð XX  2 m m bn ðkaÞYn ðXl ÞYn ðXÞ fRvv glj ¼ rn X2S2

" 

1 X

n¼0 m¼n n0 X

¼

r2n

1 X n X 1 X n0 X

dX

bn ðkaÞbn0 ðkaÞ Ynm ðXl Þ

n¼0 m¼nn0 ¼0m0 ¼n0 0

ð

0

X2S2

(52)

The coefficients in the preceding model reveal that separation quality presents stronger influence on overall preference than distortion. With 95% confidence level, however, the pvalues summarized in Table II suggest that the separation quality is the only subjective attribute that is statistically significant with p-value less than 0.05. Therefore, it may be concluded that TIKR and CS perform comparably. CS performs better in separation quality at minor cost of distortion, although the latter is not statistically significant.

#

0 0 bn0 ðkaÞ Ynm0 ðXj Þ Ynm0 ðXÞ

n0 ¼0m0 ¼n0

 Ynm0 ðXj Þ

Overall ¼ 0:192 þ 0:570  Separation þ 0:370  Distortion:

ACKNOWLEDGMENTS

Ynm ðXÞ Ynm0 ðXÞdX;

(A1)

where rn is the root-mean-square noise amplitude. By orthonormality, fRvv;SD glj ¼ r2 2

¼r

1 X n X

bn ðkaÞ Ynm ðXl ÞYnm ðXj Þ

n¼0 m¼n aMD H ðka;

Xl ÞaMD ðka; Xj Þ:

(A2)

Let AMD ðka; XÞ be the Nth-order modal domain manifold matrix AMD ðka; XÞ ¼ ½ aMD ðka; X1 Þ   aMD ðka;XM Þ 

V. CONCLUSIONS

In conclusion, the contribution of the work is twofold. First, the traditionally overlooked space domain spherical array is investigated in comparison to mainstream modal domain formulation. Second, the proposed method is experimentally validated in the context of source localization and separation problems. Four generic configurations of spherical arrays are examined in the modal and space domains with open and solid apertures. The results reveal that the space domain beamformers outperform the modal domain beamformers in achievable directivity and resilience to spatial aliasing. The major source of error in the modal domain beamformers can be attributed to the discrete approximation of SFT. A 32-microphone spherical array constructed by using 3D-printing technology was applied to a source localization and separation experiment. By using the space domain MPDR beamformer, the directions of arrival can be determined. In addition, objective and subjective tests suggest that TIKR and CS algorithms prove effective in separating the source signals. CS performs better in separation quality at minor cost of distortion, although the latter is not statistically significant. J. Acoust. Soc. Am. 139 (3), March 2016

2 CðNþ1Þ

2

M

;

(A3)

we may rewrite Rvv;SD as Rvv;SD ¼ r2n AMD H ðka; XÞAMD ðka; XÞ 2 CMM :

(A4)

Alternatively, by angular averaging, 2 H Rvv;SD ¼EX ½vSD vH SD ¼rn EX ½aSD aSD  H ¼r2n EX ½YaMD aH MD Y 

¼r2n EX ½YByH ðXÞyðXÞBH YH  ¼r2n YBEX ½yH ðXÞyðXÞBH YH ; 22

3 3 Y00 ðXÞ 66 7 7 66 7 7 0 .. N  ½ EX ½yH ðXÞyðXÞ ¼ EX 66 ðXÞ    Y ðXÞ Y 7 7 0 N . 44 5 5  N YN ðXÞ 2 3 Y00 ðXÞ ð 6 7 6 7 0 .. ¼ 6 7½ Y0 ðXÞ    YNN ðXÞ  dX: . 5 X2S2 4 YNN ðXÞ Bai et al.

1069

By orthonormality, ð 0 Ynm ðXÞ½Ynm0 ðXÞ dX ¼ dmm0 dnn0 ) EX ½yH ðXÞyðXÞ ¼ I; X2S2

Rvv;SD ¼ r2n YBEX ½yH ðXÞyðXÞBH YH ¼ r2n YBIBH YH ¼ r2n YBBH YH ; BBH ¼ diagfjb0 ðkaÞj2 ; jb1 ðkaÞj2 ; jb1 ðkaÞj2 ; jb1 ðkaÞj2  jbN ðkaÞj2 g ¼ BH B; AMD ðka;XÞ ¼ BYH 2 CðNþ1Þ

2

M

;

Rvv;SD ¼ r2n YBBH YH ¼ r2n YBH BYH MM ¼ r2n AH ; MD ðka; XÞAMD ðka; XÞ 2 C

(A5)

which gives the same result as Eq. (A4).

1

B. Rafaely, Fundamentals of Spherical Array Processing (Springer, Berlin, 2015), 189 pp. 2 J. Meyer and G. Elko, “A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield,” in IEEE International Conference on Audio Speech and Signal Processing (ICASSP), Orlando, FL (2002), Vol. 2, pp. 1781–1784. 3 T. D. Abhayapala and D. B. Ward, “Theory and design of high order sound field microphones using spherical microphone array,” in IEEE International Conference on Audio Speech and Signal Processing (ICASSP), Orlando, FL (2002), Vol. 2, pp. 1949–1952. 4 S. Yan, “Optimal modal beamforming for spherical microphone arrays,” IEEE Trans. Signal Processing 19(2), 361–371 (2010). 5 M. Park and B. Rafaely, “Sound-field analysis by plane wave decomposition using spherical microphone array,” J. Acoust. Soc. Am. 118(5), 3094–3103 (2005). 6 B. Rafaely, “Phase-mode versus delay-and-sum spherical microphone array processing,” IEEE Signal Process. Lett. 12(10), 713–716 (2005). 7 B. Rafaely, “Spatial aliasing in spherical microphone arrays,” IEEE Trans. Signal Process. 55(3), 1003–1010 (2007). 8 B. Rafaely, “The spherical-shell microphone array,” IEEE Trans. Audio Speech Lang. Process. 16(4), 740–747 (2008). 9 B. Rafaely, Fundamentals of Spherical Array Processing (Springer, Berlin, 2015), 73 pp. 10 S. Yan, H. Sun, and X. Ma, “Optimal modal beamforming for spherical microphone array,” IEEE Trans. Audio Speech Lang. Process. 19(2), 361–371 (2011). 11 B. Rafaely, “Acoustic analysis by spherical microphone array processing of room impulse responses,” J. Acoust. Soc. Am. 132(1), 261–270 (2012).

1070

J. Acoust. Soc. Am. 139 (3), March 2016

12

N. Huleihel and B. Rafaely, “Spherical array processing for acoustic analysis using room impulse responses and time-domain smoothing,” J. Acoust. Soc. Am. 133(6), 3995–4007 (2013). 13 R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. Antennas Propagation. 34(3), 276–280 (1986). 14 A. Koretz and B. Rafaely, “Dolph-Chebyshev beampattern design for spherical arrays,” IEEE Trans. Signal Processing 57(6), 2417–2420 (2009). 15 H. Sun, S. Yan, and U. P. Svensson, “Space domain optimal beamforming for spherical microphone arrays,” in IEEE International Conference on Audio Speech and Signal Processing (ICASSP), Dallas, TX (2010), pp. 117–120. 16 M. R. Bai, J. G. Ih, and J. Benesty, Acoustic Array Systems: Theory, Implementation, and Application (Wiley, Singapore, 2013), 536 pp. 17 G. F. Edelmann and C. F. Gaumond, “Beamforming using compressive sensing,” J. Acoust. Soc. Am. 130(4), EL232–EL237 (2011). 18 A. Xenaki and P. Gerstoft, “Compressive beamforming,” J. Acoust. Soc. Am. 136(1), 260–271 (2014). 19 M. R. Bai and C. H. Kuo, “Acoustic source localization and deconvolution-based separation,” J. Comp. Acoust. 23, 1–23 (2015). 20 G. W. Elko, R. A. Kubli, and J. M. Meyer, “Audio system based on at least second-order eigenbeams,” U.S. patent 7587054 (September 8, 2009). 21 G. W. Elko, R. A. Kubli, and J. M. Meyer, “Audio system based on at least second-order eigenbeams,” U.S. patent 8433075 (April 30, 2013). 22 ITU-T Recommendation P.862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs (International Telecommunication Union, Geneva, Switzerland, 2001), 21 pp. 23 ITU-T Recommendation P. 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs (International Telecommunication Union, Geneva, Switzerland, 2007), 4 pp. 24 E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography (Academic Press, New York, 1999), 305 pp. 25 M. Abramowitz and I. Stegun, Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables (Courier Corporation, Washington, DC, 1964), 1046 pp. 26 H. L. Van Trees, Optimum Array Processing (Wiley, New York, 2002), pp. 559–567. 27 M. R. Bai, Y. S. Hua, and C. C. Kuo, “An integrated recording and reproduction array system for spatial audio,” in 21th International Congress On Sound and Vibration (ICSV 2014), Beijing, China (July 13–17, 2014). 28 S. Boyd and L. Vandenberghe, Convex Optimization (Cambridge University Press, New York, 2004), 716 pp. 29 M. Grant and S. Boyd, “MATLAB (version 1.21) [software],” http:// cvxr.com/cvx (Last viewed on June 14, 2013). 30 L. Meirovitch and H. Baruh, “Robustness of the independent modal-space control method,” J. Guid. Control Dyn. 6(1), 20–25 (1983). 31 A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed. (Prentice-Hall, London, 2009), 1132 pp. 32 ITU-R Recommendation BS.1534-1: Method for the Subjective Assessment of Intermediate Quality Levels of Coding Systems (International Telecommunication Union, Geneva, Switzerland, 2003), 18 pp.

Bai et al.

Suggest Documents